PROCESSORS, METHODS, SYSTEM AND COMMANDS FOR DATA ELEMENT COMPARISON
Data element comparison instructions facilitate efficient processing of non-aligned packed data operands in SIMD architectures, addressing alignment issues in CSR format to enhance scalar product calculations and improve machine learning performance.
Patent Information
- Authority / Receiving Office
- DE · DE
- Patent Type
- Patents
- Current Assignee / Owner
- INTEL CORP
- Filing Date
- 2016-08-24
- Publication Date
- 2026-07-02
AI Technical Summary
Existing processors face challenges in efficiently processing packed data operands, particularly in SIMD architectures, when dealing with sparse matrices represented in CSR format, as the alignment of data elements is disrupted, leading to inefficiencies in operations like scalar product calculations.
The implementation of data element comparison instructions that allow for the comparison and alignment of packed data operands within processors, enabling efficient execution of operations on non-aligned data elements, such as those in CSR format, through the use of decoding units and execution units that generate result mask operands.
Enhances the performance of scalar product calculations for sparse vectors, improving the efficiency of machine learning and other applications by efficiently identifying and isolating non-zero values within sparse matrices.
Smart Images

Figure 00000000_0000_ABST
Abstract
Description
BACKGROUND Technical field The embodiments described here generally relate to processors. In particular, the embodiments described here generally relate to processors for processing packed data operands. Background information Many processors feature single-instruction multiple-data (SIMD) architectures. In SIMD architectures, a packed data instruction, a vector instruction, or a SIMD instruction can act on multiple data elements or multiple pairs of data elements simultaneously or in parallel. The processor may have parallel execution hardware that responds to the packed data instruction to perform multiple operations concurrently or in parallel. Multiple data elements can be packed within a register or memory location as packed data or vector data. Within the packed data, the bits of the register or other memory location can be logically divided into a sequence of data elements. For example, a 256-bit wide register for packed data can contain four 64-bit wide data elements, eight 32-bit data elements, sixteen 16-bit data elements, and so on. Each of the data elements can represent a separate, individual piece of data (e.g., a pixel color, a component of a complex number, etc.) that can be manipulated separately and / or independently of the others. US 2015 / 0186-141A1 concerns a processor for comparing packed data in response to packed data comparison instructions. An example processor includes a decoding unit for decoding a versatile packed data comparison instruction to display a first packed data operand from a source containing a first plurality of data items, and a second packed data operand from a source containing a second plurality of corresponding data items. The instruction displays an operand for specifying a source comparison operation, which includes comparison operation indicators, each indicating a potentially different comparison operation for a different corresponding pair of data items from the first and second source operands. The example further includes an execution unit that, in response to the instruction, stores a result in a destination memory location specified by the instruction.The result comprises outcome indicators, each corresponding to a different comparison operation indicator. Each outcome indicator displays the result of a comparison operation performed on the corresponding pair of data elements, as indicated by the respective comparison operation indicator. SUMMARY OF THE INVENTION The present invention is defined by a processor according to main claim 1. The dependent claims define further developments of the invention. BRIEF DESCRIPTION OF THE DRAWINGS The invention can best be understood with reference to the following description and the accompanying drawings, which are used to illustrate the embodiments. In the drawings: Fig. 1 is a block diagram of a portion of an exemplary sparse matrix. Fig. 2 illustrates a representation of a compressed sparse row of a subset of the columns of rows 1 and 2 of the sparse matrix according to Fig. 1. Fig. 3 is a block diagram of an embodiment of a processor capable of executing an embodiment of a data element comparison instruction. Fig. 4 is a block diagram of an embodiment of a method for executing an embodiment of a data element comparison instruction. Fig. 5 is a block diagram of a first exemplary embodiment of a data element comparison operation.Figure 6 is a block diagram of a second exemplary embodiment of a data element comparison operation. Figure 7 is a block diagram of a third exemplary embodiment of a data element comparison operation. Figure 8 is a block diagram of a fourth exemplary embodiment of a data element comparison operation. Figure 9 is a block diagram of an exemplary masked data element merge operation. Figure 10 is a block diagram of an exemplary embodiment of a suitable set of operation mask registers for packed data. Figure 11 is a block diagram of an exemplary embodiment of a suitable set of registers for packed data. Figures 12A-C are block diagrams illustrating a generic vector-friendly instruction format and its instruction templates according to the embodiments of the invention.Figures 13A-B are block diagrams illustrating an exemplary specific vector-friendly instruction format and an opcode field according to embodiments of the invention. Figures 14A-D are block diagrams illustrating an exemplary specific vector-friendly instruction format and its fields according to embodiments of the invention. Figure 15 is a block diagram of an embodiment of a register architecture. Figure 16A is a block diagram illustrating an embodiment of an in-order pipeline and an out-of-order output / execution pipeline with register renaming. Figure 16B is a block diagram of an embodiment of a processor core containing a front-end unit coupled to an execution machine unit, both of which are coupled to a memory unit.Figure 17A is a block diagram of an embodiment of a single processor core together with its connection to the interconnection network on the die and together with its local subset of the Level 2 cache (L2 cache). Figure 17B is a block diagram of an embodiment of an extended view of a portion of the processor core according to Figure 17A. Figure 18 is a block diagram of an embodiment of a processor that may have more than one core, an integrated memory controller, and integrated graphics. Figure 19 is a block diagram of a first embodiment of a computer architecture. Figure 20 is a block diagram of a second embodiment of a computer architecture. Figure 21 is a block diagram of a third embodiment of a computer architecture. Figure 22 is a block diagram of a fourth embodiment of a computer architecture.Figure 23 is a block diagram of the use of a software instruction converter to convert binary instructions in a source instruction set into binary instructions in a target instruction set according to the embodiments of the invention. DETAILED DESCRIPTION OF THE EXECUTION FORMS Herein are disclosed data element comparison instructions, processors for executing the instructions, procedures performed by the processors when processing or executing the instructions, and systems comprising one or more processors for processing or executing the instructions. Numerous specific details (e.g., specific instruction operations, data formats, processor configurations, microarchitectural details, sequences of operations, etc.) are set forth in the following description. However, the embodiments can be practiced without these specific details. In other cases, well-known circuits, structures, and techniques have not been shown in detail to avoid obscuring the understanding of the description. The data element comparison instructions disclosed herein are general-purpose instructions and are not limited to any known use. Instead, these instructions can be used for various purposes and / or in various ways based on the creativity of the programmer, compiler, or the like. In some embodiments, these instructions can be used to process data associated with sparse matrices, although the scope of protection of the invention is not limited in this way. In some embodiments, these instructions can be used to process the data associated with a compressed sparse row (CSR) representation, although the scope of protection of the invention is not limited in this way.To further illustrate certain concepts, specific uses of these commands for processing indices in a CSR format, which can be used to represent the indices and values of a sparse matrix, are described, although it should be recognized that this is only one possible use of these commands. Representatively, this can be useful in data analysis, high-performance computing, machine learning, sparse linear algebra problems, and the like. In other embodiments, these commands can be used to process other types of data besides sparse matrices and / or data in CSR format. For example, these commands can be used to process various types of data, such as multimedia data, graphics data, audio data, video data, pixels, text sequence data, string data, financial data, or other types of integer data, or the like.Furthermore, such processing of the data can be used for various purposes, such as identifying duplicate data elements, selecting duplicate data elements, merging duplicate data elements, removing duplicate data elements, modifying duplicate data elements, or for various other purposes. Fig. 1 is a block diagram of a portion of an exemplary sparse matrix 100. The matrix generally represents a two-dimensional data structure in which the data values are arranged in rows and columns. The data values can also be referred to simply as values or data elements. It is shown that the illustrated exemplary sparse matrix has at least thirty-nine columns and at least two rows, and optionally more. Alternatively, other sparse matrices may have more rows and / or fewer or more columns. The values of the first row are shown as a* values, where the asterisk (*) represents the column number containing the value. Similarly, the values of the second row are shown as b* values, where the asterisk (*) represents the column number containing the value. The value in row 1, column 7 is a7, the value in row 1, column 23 is a23, the value in row 2, column 15 is b15, and so on. In many different applications, it may be desirable to act on two vectors, such as two rows of a sparse matrix. This can be done, for example, for scalar product calculations of sparse vectors. Such scalar product calculations of sparse vectors are generally used, for example, in machine learning applications. Examples of such machine learning applications include the kernelized support vector machine (SVM), the open-source libSVM, kernelized principal component analysis, and the like. A kernel commonly used in such applications is the quadratic distance calculation pattern, also known as the L2 norm between two vectors. The quadratic distance function, f, (|| f ||) between two vectors α and β, is represented by Equation 1: The inner product (•) between the two vectors α and β, which can be sparse vectors, is represented as a scalar product calculation, as shown in equation 2: Such scalar product calculations of sparse vectors tend to contribute significantly to the overall computation time of machine learning and other applications. Accordingly, increasing the performance of executing such scalar product calculations of sparse vectors can help improve the performance of both machine learning and other applications. In Fig. 1, the sparse matrix 100 can be described as sparse if a significant number or proportion of the values in the matrix are zero. Often, such zero values exhibit special mathematical properties; for example, multiplication by zero produces a product of zero, or the like. In the case of multiplying the values in different rows of the same column, such zero values can produce products with a value of zero, whereas multiplying two non-zero values can produce non-zero values. For example, multiplying the data elements in rows 1 and 2 of column 2 (i.e., a2 * 0) produces a product of zero, whereas multiplying the data elements in rows 1 and 2 of column 3 (i.e., a3 * b3) produces a non-zero product.Furthermore, in the specific case of calculating the multiplication-accumulation or scalar product type, such zero values often cannot contribute to the total accumulation value or scalar product. Accordingly, in these and certain other uses, it may be desirable to ignore these zero values in the sparse matrix. In the sparse matrix of this particular example, there are only three pairs of values from rows 1 and 2 that occupy a common column, both of which contain non-zero values, as indicated by the reference 102. Specifically, this is true for a3 and b3, a7 and b7, and a23 and b23. In some embodiments, it may be advantageous to efficiently identify and / or isolate such pairs of values. As will be further explained below, the data element comparison instructions disclosed herein are useful for this purpose, although they are not limited to this purpose alone. Fig. 2 illustrates a representation 204 of a compressed sparse row (CSR) of a subset of the columns of rows 1 and 2 of the sparse matrix according to Fig. 1. In the CSR representation or in the CSR format, the values of the matrix and / or a vector (e.g., a single row of the matrix) are represented by a 2-tuple or a pair consisting of an index and a corresponding value. In the case of the aforementioned sparse matrix, the index can, for example, represent the column number, while the value can represent the data value for a given row in that column. <index:wert>-2 tuples or pairs can generally be grouped together in increasing index order for all non-zero data values in a row. The end of the string can be delimited by a sentinel value, such as a negative one (i.e., -1). The zero values can be omitted or "compressed" from the CSR representation. For example, the CSR representations for a subset of the columns in row 1 and for a subset of the columns in row 2 can be represented as follows: As can easily be seen, such a CSR format omits the zero values (which, for example, cannot contribute to a scalar product or any other type of operation). However, a likely consequence of the CSR representation or format is that values that were in the same column of a matrix (or set of vectors), such as the data values a3 and b3, may not be in the same relative 2-tuple position and / or may not be "aligned" when converted to the CSR representation, partly due to the removal of generally different numbers of zeros and / or zeros in different positions in the different vectors. This lack of alignment is shown in the illustration by reference numeral 206. For example, in the matrix shown in Fig. 1, there were...The values a3 and b3 are both in column 3, vertically aligned, although in the CSR representation of rows 1 and 2, the tuple <3:a3> is in the second position from the left in the list of tuples (because, for example, a3 is the second non-zero value in row 1), whereas the pair <3:b3> in row 2 is in the first position from the left in the list of tuples (because, for example, b3 is the first non-zero value in row 2). Similarly, the data elements a7 and b7 and a23 and b23 can also be in different relative positions in the CSR format. A likely consequence of this is that when data is processed in vector processors, packed data processors, or single-instruction multiple data processors (SIMD processors), values that were in the same column of the matrix can no longer occupy the same corresponding vertically aligned data element positions of the packed data operands, vectors, or SIMD operands. In some embodiments, it may be desirable to act on values in the same column (e.g., in the case of vector multiplication, etc.). This can tend to present certain challenges when efficiently implementing operations on such values because the vector operations, packed data operations, or SIMD operations are often designed to act on corresponding vertically aligned data elements. For example, an instruction set might...A packed multiplication instruction may be used to multiply a corresponding pair of least significant data elements of a first and a second packed source data operand, to multiply a corresponding pair adjacent to the least significant data elements of the first and the second packed source data operand, and so on. Conversely, the packed multiplication instruction may not be operational to multiply data elements in non-corresponding or non-vertically aligned positions. Fig. 3 is a block diagram of an embodiment of a processor 310 that is operational to execute an embodiment of a data element comparison instruction 312. In some embodiments, the processor may be a general-purpose processor (e.g., a general-purpose microprocessor or a central processing unit (CPU) of the type used in a desktop computer, laptop computer, or other computer). Alternatively, the processor may be a special-purpose processor. Examples of suitable special-purpose processors include, but are not limited to, network processors, communication processors, cryptographic processors, graphics processors, coprocessors, embedded processors, digital signal processors (DSPs), and controllers (e.g., microcontrollers).The processor may have any of the various architectures of complex instruction set computation (CISC architectures), reduced instruction set computation (RISC architectures), very long instruction word (VLIW) architectures, hybrid architectures, or other types of architectures, or it may have a combination of different architectures (different cores may, for example, have different architectures). During operation, the processor 310 can receive a data element comparison instruction 312. The instruction can be received, for example, from a memory on a bus or other interconnect. The instruction can represent a macro instruction, an assembly language instruction, a machine code instruction, or another instruction or control signal from an instruction set of the processor. In some embodiments, the data element comparison instruction can explicitly (e.g., by one or more fields or a set of bits) specify or otherwise indicate (e.g., implicitly indicate) a first packed source data operand 322, specify or otherwise indicate a second packed source data operand 324, and specify or otherwise indicate at least one destination memory location 326 where a first result mask operand 328 and, optionally, a second result mask operand 330 are to be stored.In some embodiments, there may be at least four or at least eight data elements in each of the first and second packed source data operands. In some embodiments, the data elements may represent indices corresponding to a CSR representation, although the scope of protection of the invention is not limited in this way. As an example, the instruction may include specification fields of the source and / or destination operand to specify registers, data storage locations, or other storage locations for the operands. Alternatively, one or more of these operands may optionally be implicit in the instruction (e.g., be implicit in an opcode of the instruction). In Fig. 3, in some embodiments, the first packed source data operand 322 can optionally be stored in a first packed data register of a set of packed data registers 320, while the second packed source data operand 324 can optionally be stored in a second packed data register of the set of packed data registers 320. Alternatively, data storage locations or other storage locations can optionally be used for one or more of these operands. Each of the packed data registers can represent a storage location on the die that is operational for storing packed data, vector data, or single-instruction multiple data (SIMD) data.Packed data registers can represent architecturally visible registers, or architecture registers visible to the software and / or a programmer, and / or they can be the registers specified by the processor's instruction set to identify the operands. These architecture registers are distinct from other non-architecture registers in a given microarchitecture (e.g., temporary registers, reorder buffers, standby registers, etc.). Packed data registers can be implemented in various ways across different microarchitectures and are not restricted to any particular design type. Examples of suitable types of packed data registers include, but are not limited to, dedicated physical registers, dynamically allocated physical registers using register renaming, and combinations thereof.Specific examples of suitable registers for packed data include those shown and described in Fig. 11, but are not limited to these. In Fig. 3, the processor in some embodiments may optionally include a set of packed data operation mask registers 322. Each of the packed data operation mask registers may represent a memory location on the die that is operational to store at least one packed data operation mask. The packed data operation mask registers may represent architecturally visible registers, or architecture registers visible to the software and / or a programmer, and / or be the registers specified by the instructions of the processor's instruction set to identify the operands. Examples of suitable types of packed data operation mask registers include, but are not limited to, dedicated physical registers, dynamically allocated physical registers using register renaming, and combinations thereof.Specific examples of suitable operation mask registers for packed data include those shown and described in Fig. 10 and the mask or k-mask registers described in the latter part of the application, but are not limited to these. As further shown, in some embodiments, the one or more target memory locations 326 can optionally be one or more packed data operation mask registers in the set of packed data operation mask registers 332. In some embodiments, a first packed data operation mask register can optionally be used to store the first result mask operand 328, and a second packed data operation mask register can optionally be used to store the second result mask operand 330, as further explained below (e.g., in relation to Fig. 5). In other embodiments, a single packed data operation mask register can optionally be used to store both the first result mask operand 328 and the second result mask operand 330, as further explained below (e.g., in relation to Fig. 6).In other embodiments, the first result mask operand 328 and the second result mask operand 330 can optionally be stored in a packed data register within the set of packed data registers 320, as further explained below (e.g., in connection with Fig. 8). The result mask operands can, for example, be stored in a different packed data register than those used to store the first and second packed source data operands. Alternatively, a packed data register used either for the first packed source data operand or for the second packed source data operand can optionally be reused to store the first and second result mask operands. The instruction can, for example,Specify a source / destination register for packed data that can be implicitly or impliedly detected by the processor to be used both initially for a packed source data operand and subsequently to store the result mask operands. In Fig. 3, the processor includes a decoding unit or decoder 314. The decoding unit can receive and decode the data element comparison instruction. The decoding unit can output one or more instructions at a relatively lower level or one or more control signals 316 (e.g., one or more microinstructions, microoperations, microcode entry points, decoded instructions, or control signals, etc.) that reflect, represent, and / or are derived from the data element comparison instruction at a relatively higher level. In some embodiments, the decoding unit can have one or more input structures (e.g., a terminal(s), interconnect(s), an interface) to receive the data element comparison instruction, instruction recognition and decoding logic coupled to it to recognize and decode the data element comparison instruction, and one or more output structures (e.g.,The decoding unit includes a connection (or connections), interconnection (or interfaces) coupled to it to output the instruction(s) at a lower level or the control signal(s). The decoding unit may be implemented using various different mechanisms, including microcode read-only memory (microcode ROMs), lookup tables, hardware implementations, programmable logic arrays (PLAs), and other mechanisms suitable for implementing decoding units, but is not limited to these. In some embodiments, instead of the data element comparison instruction directly provided to the decoding unit, an instruction emulator, translator, morpher, interpreter, or other instruction translation module may optionally be used. Various types of instruction translation modules may be implemented in software, hardware, firmware, or a combination thereof. In some embodiments, the instruction translation module may reside outside the processor, such as on a separate die and / or in memory (e.g., as a static, dynamic, or runtime emulation module).For example, the instruction translation module can receive the data element comparison instruction, which may be from a first instruction set, and then emulate, translate, morph, interpret, or otherwise convert the data element comparison instruction into one or more corresponding intermediate instructions or control signals, which may be from a second, different instruction set. The one or more intermediate instructions or control signals from the second instruction set can be provided to a decoder (e.g., decoder 314), which can then decode them into one or more lower-level instructions or control signals that can be executed by the processor's native hardware (e.g., one or more execution units). In Fig. 3, the execution unit 318 is coupled with the decoding unit 314, the registers 320 for packed data, and optionally the operation mask registers 332 for packed data (e.g., if the result mask operands 328, 330 are to be stored therein). The execution unit can receive one or more decoded or otherwise converted instructions or control signals 316 that represent the data element comparison instruction and / or are derived from the data element comparison instruction. The execution unit can also receive the first packed source data operand 322 and the second packed source data operand 324. The execution unit can, in response to the data element comparison instruction and / or as a result of the data element comparison instruction (e.g.,in response to the one or more instructions or the one or more control signals decoded from the instruction) be operational to store the first result mask operand 328 and the optional second result mask operand 330 at the one or more destination memory locations 326 specified by the instruction. In some embodiments, at least one result mask operand (e.g., the first result mask operand 328) may contain another mask element at the same relative position within the operands for each corresponding data element in one of the first and second packed source data operands (e.g., the first packed source data operand 322). In some embodiments, each mask element may indicate whether the corresponding data element in the aforementioned one of the first and second packed source data operands (e.g.,the first result mask operand 328) is equal to any data elements in the other of the first and second packed source data operands (e.g. the second result mask operand 330). In some embodiments, the first result mask operand 328 can contain a different mask element for each corresponding data element in the first packed source data operand 322 at the same relative position within the operands, wherein each mask element of the first result mask operand 328 can indicate whether the corresponding data element in the first packed source data operand 322 is equal to any of the data elements in the second packed source data operand 324.In some embodiments, the second result mask operand 330 can contain a different mask element for each corresponding data element in the second packed source data operand 330 at the same relative position within the operands, wherein each mask element of the second result mask operand 330 can indicate whether the corresponding data element in the second packed source data operand 324 is equal to any of the data elements in the first packed source data operand 322. In some embodiments, each mask element can be a single mask bit. In some embodiments, the result can be any of those shown and described in Figures 5-8, although the scope of the invention is not limited in this way. The execution unit and / or the processor may include specific or special logic (e.g., transistors, an integrated circuit arrangement, or other hardware potentially combined with firmware (e.g., instructions stored in non-volatile memory) and / or software) that is operational to execute the data element comparison instruction and / or to store the result in response to and / or in the result of the data element comparison instruction (e.g., in response to one or more instructions or one or more control signals decoded from the data element comparison instruction). In some embodiments, the execution unit may have one or more input structures (e.g.,The execution unit includes a connection (or connections), interconnection (or interconnections), or interface for receiving source operands, a circuit arrangement or logic coupled thereto for receiving and processing the source operands and generating the result operands, and one or more output structures (e.g., a connection (or connections), interconnection (or interconnections), or interface) coupled thereto for outputting the result operands. In some embodiments, the execution unit may optionally include a comparison circuit arrangement or logic coupled to the data elements of the source operands by a fully connected crossbar, such that each data element in the first packed source data operand can be compared to each data element in the second packed source data operand, allowing a comparison of all elements with all elements to be performed. For example, in some embodiments,If there are N integer elements in the first packed source data operand and M integer elements in the second packed source data operand, then N * M comparisons can be performed. To avoid making the description unclear, a relatively simple processor 310 has been shown and described. However, the processor may optionally include other processor components. Different embodiments may, for example, include different combinations and configurations of the components shown and described in any of Figures 15-18. All of the processor components can be coupled together to enable them to operate as intended. Fig. 4 is a block diagram of an embodiment of method 436 for executing an embodiment of a data element comparison instruction. In various embodiments, the method can be executed by a processor, an instruction processing device, or another digital logic device. In some embodiments, method 436 can be executed by and / or within the processor 310 of Fig. 3. The components, features, and specific optional details described here for processor 310 also apply optionally to method 436. Alternatively, method 436 can be executed by and / or within a similar or different processor or by and / or within a similar or different device. Furthermore, processor 310 can execute methods that are similar to or different from method 436. Block 437 of the procedure includes receiving the data element comparison instruction. In various ways, the instruction can be received by a processor or a section thereof (e.g., an instruction fetch unit, a decoder unit, a bus interface unit, etc.). It can also be received from a source outside the processor and / or outside the die (e.g., from memory, interconnects, etc.) or from a source within the processor and / or the die (e.g., from an instruction cache, an instruction queue, etc.).The data element comparison instruction can specify or otherwise indicate a first packed source data operand containing at least four data elements, or in some cases at least eight or more data elements; a second packed source data operand containing at least four data elements, or in some cases at least eight or more data elements; and one or more destination memory locations. In some embodiments, the data elements can represent indices corresponding to a CSR representation, although the scope of protection of the invention is not limited in this way. In block 438, at least one result mask operand can be stored in response to and / or as a result of the data element comparison instruction at one or more target memory locations. The at least one result mask operand can contain another mask element for each corresponding data element in one of the first and second packed source data operands at the same relative position within the operands. Each mask element can indicate whether the corresponding data element in the aforementioned one of the first and second packed source data operands is equal to any of the data elements in the other of the first and second packed source data operands. In some embodiments, at least two result mask operands are stored. In some embodiments, the two result mask operands can be stored in a single mask register.In other embodiments, the two result mask operands can be stored in two different mask registers. In still other embodiments, the two result mask operands can be stored in a packed data operand, for example, by storing one bit of each of the first and second result mask operands in each data element of the packed data operand. The illustrated method includes architectural operations (e.g., those visible from a software perspective). In other embodiments, the method may optionally include one or more microarchitecture operations. For example, the instruction may be fetched, decoded, scheduled out of order, the source operands accessed, an execution unit may perform the microarchitecture operations to implement the instruction, and so on. In some embodiments, the microarchitecture operations to implement the instruction may optionally include comparing each data element of the first packed source data operand with each data element of the second packed source data operand. In some embodiments, crossbar-based hardware comparison logic may be used to perform these comparisons. In some embodiments, the method can optionally be executed during or as part of an algorithm to accelerate arithmetic of a sparse vector with another sparse vector (e.g., a scalar product calculation of a sparse vector with another sparse vector), although the scope of protection of the invention is not limited in this way. In some embodiments, the result mask operands stored in response to the instruction can be used to join or aggregate the data elements that the result mask operands indicate match in the packed source data operands. In some embodiments, the result mask operands can, for example, be specified as a source operand of a masked data element join instruction and used by a masked data element join instruction.In other embodiments, the result mask operand(s) can be minimally processed, whereby the resulting result mask operand(s) can be specified as the source operand(s) of the masked data element join instruction(s) and can be used by the masked data element join instruction(s). Fig. 5 is a block diagram illustrating a first exemplary embodiment of a data element comparison operation 540 that can be executed in response to a first exemplary embodiment of a data element comparison instruction. The instruction can specify or otherwise indicate a first packed source data operand 522 and can specify or otherwise indicate a second packed source data operand 524. These source operands can be stored in packed data registers, at data storage locations, or at other storage locations as previously described. In the illustrated embodiment, both the first and second packed source data operands are 512-bit operands comprising sixteen 32-bit data elements, although operands of other sizes, data elements of other sizes, and other numbers of data elements may optionally be used in other embodiments. Typically, the number of data elements in each packed source data operand may be equal to the size in bits of the packed source data operand divided by the size in bits of a single data element. In various embodiments, the sizes of each of the packed source data operands may be 64 bits, 128 bits, 256 bits, 512 bits, or 1024 bits, although the scope of protection of the invention is not limited to these. In various embodiments, the size of each data element may be 8 bits, 16 bits, 32 bits, or 64 bits, although the scope of protection of the invention is not limited to these.Other sizes of the packed data operands and other sizes of the data elements are also suitable. In various embodiments, there can be at least four, at least eight, at least sixteen, at least thirty-two, or more than thirty-two data elements (e.g., at least sixty-four data elements) in each of the packed source data operands. Often, the number of data elements in both the first and second packed source data operands can be the same, although this is not required. For further illustration, some illustrative examples of suitable alternative formats are mentioned, although the scope of protection of the invention is not limited to only these formats. A first exemplary format is a packed 128-bit byte format containing sixteen 8-bit data elements. A second exemplary format is a packed 128-bit word format containing eight 16-bit data elements. A third exemplary format is a packed 256-bit byte format containing thirty-two 8-bit data elements. A fourth exemplary format is a packed 256-bit word format containing sixteen 16-bit data elements. A fifth exemplary format is a packed 256-bit double-word format containing eight 32-bit data elements. A sixth exemplary format is a packed 512-bit word format containing thirty-two 16-bit data elements.A seventh exemplary format is a packed 512-bit double-word format containing sixteen 32-bit data elements. An eighth exemplary format is a packed 512-bit quad-word format containing eight 64-bit data elements. As shown, in some embodiments, in response to the instruction and / or the operation, a first result mask operand 528 can be generated and stored in a first mask register 532-1 specified by the instruction, and a second result mask operand 530 can be generated and stored in a second mask register 532-2 specified by the instruction. In some embodiments, the first and second packed source data operands 522 and 524 can be input into an execution unit 518. The execution unit can generate and store the result mask operands in response to the instruction (e.g., as it is controlled by one or more instructions or one or more control signals 516 decoded from the instruction).In some embodiments, this may involve the execution unit comparing each data element in the first packed source data operand with each data element in the packed second source data operand. For example, each of the sixteen data elements in the first packed source data operand may be compared with each of the sixteen data elements in the second packed source data operand for a total of two hundred and fifty-six comparisons. Each result mask operand can correspond to a different one of the packed source data operands. In the illustrated embodiment, for example, the first result mask operand can correspond to the first packed source data operand, while the second result mask operand can correspond to the second packed source data operand. In some embodiments, each result mask operand can have the same number of mask elements as the number of data elements in the corresponding packed source data operand. In the illustrated embodiments, each of the mask elements is a single bit.As shown, the first result mask operand can have sixteen 1-bit mask elements, each corresponding to a different one of the sixteen data elements of the first packed source data operand at the same relative position within the operands, and the second result mask operand can have sixteen 1-bit mask elements, each corresponding to a different one of the sixteen data elements of the second packed source data operand at the same relative position within the operands.In the case of other numbers of data elements in other embodiments, if a first packed source data operand has N data elements and a second packed source data operand has M data elements, N * M comparisons can be performed, whereby a first N-bit result mask corresponding to the first packed source data operand can be stored and a second M-bit result mask corresponding to the second packed source data operand can be stored. In some embodiments, each mask element can have a value (e.g., in this case, a bit value) to indicate whether its corresponding source data element (e.g., at the same relative position) in its corresponding packed source data operand corresponds to any of the source data elements in the other non-corresponding packed source data operand. For example, each bit in the first result mask operand can have a bit value to indicate whether its corresponding data element (e.g., at the same relative position) in the first packed source data operand corresponds to any of the data elements in the second packed source data operand, whereas each bit in the second result mask operand can have a bit value to indicate whether its corresponding data element (e.g.,at the same relative position) in the second packed source data operand either matches or does not match any of the data elements in the first packed source data operand. According to one possible convention used in the illustrated embodiment, each mask bit set to a binary one (i.e., 1) can indicate that its corresponding data element in its corresponding packed source data operand matches or is equal to at least one data element in the other non-corresponding packed source data operand. Conversely, each mask bit cleared to a binary zero (i.e., 0) can indicate that its corresponding data element in its corresponding packed source data operand does not match or is equal to any of the data elements in the other non-corresponding packed source data operand. The opposite convention is also suitable for other embodiments. In the specific illustrated exemplary embodiment, the only data elements in the first packed source data operand that are identical or the same as the data elements in the second packed source data operand are, for example, those with the values 3, 7, and 23. Considering the first packed source data operand, the data element with the value 3 is located at the second data element position from the leftmost or least significant bit, the data element with the value 7 is located at the third data element position from the leftmost or least significant bit, and the data element with the value 23 is located at the tenth data element position from the leftmost or least significant bit. Correspondingly, in the first result mask operand, only the second, third, and tenth mask bits from the leftmost or least significant end are set to a binary one (i.e., 1)., 1) set to indicate that the corresponding data elements in the first packed source data operand match at least one data element in the second packed source data operand, whereas all other bits are cleared to a binary zero (i.e., 0) to indicate that the corresponding data elements in the first packed source data operand do not match any data elements in the second packed source data operand. Similarly, considering the second packed source data operand, the data element of value 3 is located at the first data element position from the left, or the least significant bit; the data element of value 7 is located at the fourth data element position from the left, or the least significant bit; and the data element of value 23 is located at the ninth data element position from the left, or the least significant bit. Accordingly, in the second result mask operand, only the first, fourth, and ninth mask bits from the left, or the least significant end, are set to a binary one (i.e., 1) to indicate that the corresponding data elements in the second packed source data operand match at least one data element in the first packed source data operand, whereas all other bits are set to a binary zero (i.e., 0)., 0) are deleted to indicate that the corresponding data elements in the second packed source data operand do not match any data elements in the first packed source data operand. In some embodiments, the first and second mask registers can represent the registers of a processor's architectural register set that are to be used by the packed data mask instructions of a processor instruction set to perform the operation masking, operation assertion, or conditional control of the packed data operation. For example, in some embodiments, the first and second mask registers can be registers in the set of packed data operation mask registers 322 shown in Fig. 3. The packed data mask instructions can be operational to specify the mask registers as the source operands (for example, they can have a field to specify the mask registers as the source operands) to be used to mask, assert, or conditionally control a packed data operation.In some embodiments, masking, assertion, or conditional control can be provided at a granularity per data element, so that operations on different data elements or pairs of corresponding data elements can be masked, asserted, or conditionally controlled separately and / or independently of the others. Each mask bit can, for example, have a first value to allow the operation to be performed and to allow the corresponding result data element to be stored at the destination, or it can have a second value to prevent the operation from being performed and / or to prevent the corresponding result data element from being stored at the destination. According to one possible convention, a value can be assigned to a binary zero (i.e.,A cleared mask bit (0) can represent a hidden operation for which no corresponding operation should be performed and / or a corresponding result should not be stored, whereas a mask bit set to a binary one (i.e., 1) can represent an unmasked operation for which a corresponding operation should be performed and a corresponding result should be stored. The opposite convention is also possible. In the embodiment illustrated in Fig. 5, the first and second result mask operands are stored in different mask registers (e.g., in different operation mask registers for packed data). A possible advantage for some embodiments is that each result mask operand and / or each mask register is directly suitable for use as a packed source data operation mask operand for a masked or preempted instruction for packed data, such as a masked or preempted data element join instruction (such as a VPCOMPRESS instruction), although the scope of protection of the invention is not limited to such use.For example, two instances of the masked or prophesied data element join instruction can each use a different first and second result mask operand as a source mask operand, a prophecy operand, or a conditional control operand for a data element join operation, without substantially requiring any additional processing of the first and second result mask operands. The unmasked bits or mask elements of the result masks or mask registers can correspond to the matching indices of the CSR tuples that were compared, and the masked or prophesied data element join instruction can use these unmasked bits or mask elements to join the corresponding values of these CSR tuples.Further details on how such masked or declared data element join commands can be used in this way are discussed below. Fig. 6 is a block diagram illustrating a second exemplary embodiment of a data element comparison operation 640, which can be executed in response to a second exemplary embodiment of a data element comparison instruction. Operation 640 has certain similarities to Operation 540 shown in Fig. 5. To avoid obscurity, the different and / or additional features of Operation 640 are described primarily without repeating all of the optionally similar or common features and details of Operation 540. However, it should be understood that the previously described features and details of Operation 540, including its variations and alternative embodiments, may also optionally apply to Operation 640 unless otherwise stated or otherwise clearly evident. As in the embodiment shown in Fig. 6, the instruction can specify or otherwise indicate a first packed source data operand 622, and it can specify or otherwise indicate a second packed source data operand 624. The first and second packed source data operands can be input into an execution unit 618. The execution unit can, in response to the instruction (e.g., as controlled by one or more instructions or one or more control signals 616 decoded from the instruction), generate and store a first result mask operand 628 and a second result mask operand 630. One difference between the embodiment according to Fig. 6 and the embodiment according to Fig. 5 is that the first and second result mask operands are stored in a single mask register 632, instead of each being stored in a separate mask register (e.g., the first mask register 532-1 and the second mask register 532-2). Specifically, the first result mask operand 628 is stored in the least significant 16 bits of the single mask register, while the second result mask operand 630 is stored in the next adjacent 16 bits of the single mask register. Alternatively, the positions of the first and second mask operands can optionally be reversed. In this case, the least significant portion of the single mask register (e.g., the least significant 16 bits) corresponds to one of the packed source data operands (in this case, e.g., the first packed source data operand), while a more significant portion of the single mask register (e.g.,The next higher-order 16 bits correspond to another of the packed source data operands (in this case, for example, the second packed source data operand). The illustration shows that the mask register has only 32 bits, although in other embodiments it can have fewer or more, such as 64 bits. In some embodiments, the least significant first result mask operand may be directly suitable for use as a packed source data operation mask operand for a masked instruction for packed data, such as a masked or revealed data element join instruction (such as a VPCOMPRESS instruction), although the scope of protection of the invention is not limited to such use. Furthermore, a simple shift may be used to shift the bits [16:31] of the mask register to the bits [0:15], so that the second result mask operand may be directly suitable for use as a packed source data operation mask operand for a masked instruction for packed data, such as a masked or revealed data element join instruction (such as a VPCOMPRESS instruction), although the scope of protection of the invention is not limited to such use. Fig. 7 is a block diagram illustrating a third exemplary embodiment of a data element comparison operation 740, which can be executed in response to a third exemplary embodiment of the data element comparison instruction. Operation 740 has certain similarities to operation 540 shown in Fig. 5. To avoid obscurity, the different and / or additional features of operation 740 are described primarily, without repeating all of the optionally similar or common features and details of operation 540. However, it should be understood that the previously described features and details of operation 540, including its variations and alternative embodiments, may also optionally apply to operation 740 unless otherwise stated or otherwise clearly evident. As in the embodiment shown in Fig. 7, the instruction can specify or otherwise indicate a first packed source data operand 722, and it can specify or otherwise indicate a second packed source data operand 724. The first and second packed source data operands can be input into an execution unit 718. The execution unit can generate and store a result in response to the instruction (e.g., as controlled by one or more instructions or one or more control signals 716 decoded from the instruction). One difference between the embodiment according to Fig. 7 and the embodiment according to Fig. 5 is that the execution unit 718 can only generate and store a single result mask operand 728. In some embodiments, the single result mask operand can be stored in a mask register (e.g., an operation mask register for packed data). In some embodiments, the single result mask operand can correspond to one of the first and second packed source data operands (e.g., the first packed source data operand in the illustrated example). In some embodiments, the result mask operand 728 and / or the mask register 732 can be used as a packed source data operation mask operand for a masked instruction for packed data, such as a masked or pre-empted data element join instruction (e.g.,a VPCOMPRESS command) may be directly suitable, although the scope of protection of the invention is not limited to such use. Another instance of the command (with the same opcode) can be executed again to generate the result mask operand for the other packed source data operand. Fig. 8 is a block diagram illustrating a fourth exemplary embodiment of a data element comparison operation 840, which can be executed in response to a fourth exemplary embodiment of the data element comparison instruction. Operation 840 has certain similarities to operation 540 shown in Fig. 5. To avoid obscurity, the different and / or additional features of operation 840 are described primarily, without repeating all of the optionally similar or common features and details of operation 540. However, it should be understood that the previously described features and details of operation 540, including its variations and alternative embodiments, may also optionally apply to operation 840 unless otherwise stated or otherwise clearly evident. As in the embodiment shown in Fig. 8, the instruction can specify or otherwise indicate a first packed source data operand 822, and it can specify or otherwise indicate a second packed source data operand 824. The first and second packed source data operands can be input into an execution unit 818. The execution unit can, in response to the instruction (e.g., as controlled by one or more instructions or one or more control signals 816 decoded from the instruction), generate and store a first result mask operand 828 and a second result mask operand 830. One difference between the embodiment according to Fig. 8 and the embodiment according to Fig. 5 is that the execution unit 818 can generate the first and second result mask operands 828, 830 and store them in a packed result data operand 820. The packed result data operand can be stored, for example, in a packed data register, at a data storage location, or at another storage location. In one embodiment, the packed result data operand or the register can be a 512-bit operand or register, although the scope of protection of the invention is not limited in this way. Another difference is that the mask bits of the first and second result mask operands can be arranged within other non-mask bits. As shown, there can be two bits in each result data element in the packed result data operand that are used as the mask bits.One of these two bits in each data element can be used for the first result mask operand, while the other can be used for the second result mask operand. For example, the two least significant bits of each data element can be used optionally, the two most significant bits of each data element can be used optionally, the least significant and the most significant bits can be used optionally, or any other set of bits can be used optionally. In the illustrated embodiment, the two least significant bits are used, with the least significant bit of the two being used for the first mask operand, while the more significant bit of the two is used for the second mask operand, although this is not required. The following pseudocode represents an exemplary implementation of a data element comparison instruction named VXBARCMPU: VXBARCMPU{QIDQ} VDEST, SRC1, SRC2 / / The instruction creates 2 masks for n indices in each of SRC1 and SRC2 / / VDEST, SRC1 and SRC2 are each a register for packed data VDEST = 0 ; / / initialize, VDEST holds the final 2-bit masks for i ← 1 to n / / n=16 (Q) or 8 (DQ) for j ← 1 to n / / n=16 (Q) or 8 (DQ) bool match = (SRC1.element[i] == SRC2.element[j]) ? 1:0 / / n^2 compare VDEST.element[i].bit[0] = VDEST.element[i].bit[0] | match; / / bit0 VDEST.element[j].bit[1] = VDEST.element[j].bit[1] | match; / / bit1 In this pseudocode, Q represents a 32-bit quadword, while DQ represents a 64-bit double quadword. The symbol "|" represents the logical OR operation. The term "match" represents the comparison for equality, e.g., of integers. In the embodiments shown in Figures 5-8, each bit in the result mask operand provides a summary or cumulative indication of whether its corresponding source data element matches any of the source data elements in the other non-matching operand. Furthermore, in the embodiments shown in Figures 5-8, each result mask operand has the same number of mask bits as the number of data elements in its corresponding source operand. These mask bits are in a format generally well-suited for use as a mask operand for a masked instruction for packed data, such as a masked or preempted data element join instruction (e.g., a masked VPCOMPRESS instruction). An alternative approach would be to store a number of bits per comparison equal to the number of comparisons performed. Each of these bits alone would not provide a summary or cumulative indication of whether its corresponding source data element matches any of the source data elements in the other non-matching operand. Instead, each of these bits per comparison would correspond to a single comparison performed between a different combination of a data element from the first packed source data operand and a data element from the second packed source data operand. In the case of two packed source data operands, each containing N data elements, N * N comparisons can be performed, and N * N result mask bits can be stored using this alternative approach.In the case of two sixteen-data-element operands, two hundred and fifty-six operations can be performed, where instead of just two 16-bit result masks, a 256-bit result mask can be stored. A potential drawback of such an alternative approach, however, is that the result mask operand tends to be in a less useful and / or efficient format for certain types of subsequent operations. For example, no single such bit per comparison indicates, without further processing, whether a data element in one source has a matching data element in the other source. These result mask bits per comparison, as such, may not be well-suited for use as a mask operand for a masked instruction for packed data, such as a masked or preemissive data element join instruction (like a masked VPCOMPRESS instruction), without further processing. Additionally, the extra bits provided for all comparison results may tend to consume more interconnection bandwidth, register space, processing power, and so on. In contrast, each of the first and second result mask operands 528, 530, and / or each of the first and second mask registers 532-1, 532-2 can be directly used as a source mask by a masked instruction for packed data (such as a masked VPCOMPRESS instruction). Likewise, the first result mask operand 628 can be directly used as a source mask by a masked instruction for packed data (such as a masked VPCOMPRESS instruction), while the second result mask instruction 630 can easily be made directly usable (e.g., by a simple 16-bit shift). Similarly, the result mask operand 728 and / or mask register 732 can be directly used as a source mask by a masked instruction for packed data (such as a masked VPCOMPRESS instruction). In any of the embodiments shown in Figures 3-8, certain comparisons can be optionally avoided in some embodiments if it is fixed for the instruction (e.g., fixed or implied for an opcode of the instruction) or can otherwise be ensured that the data elements of the source operands are arranged in ascending order (as may be the case, for example, when working with the indices of data in CSR format or when working with certain other types of data). For example, comparisons can be avoided if it can be easily determined that none of the elements in the packed source data operands would match.For example, logic could be included to test whether either the lowest-value data element in the first packed source data operand is greater than the highest-value data element in the second packed source data operand, or whether the highest-value data element in the first packed source data operand is less than the lowest-value data element in the second packed source data operand, and whether either of these conditions is true, thus avoiding comparing every data element of one source with every data element of the other. This can help reduce power consumption, although it is optional. Fig. 9 is a block diagram of an example of a masked data element merge operation 996 that can be executed in response to a masked data element merge instruction. An example of such an instruction suitable for the embodiments is the VPCOMPRESSED instruction in x86, although the use of this instruction is not required. The masked data element join command can specify a packed source data operand 997. In some embodiments, the packed source data operand can store data values corresponding to the indices of a CSR format. For example, the packed source data operand can store data values corresponding to the indices of one of the first packed source data operands 522, 622, 722, or 822. Again, referring to the sparse matrix in Fig. 1, the data value a3 corresponds to index 3 of column 3, the data value a7 corresponds to index 7 of column 7, and so on. The masked data element merge command can also specify a source mask operand 928. In various embodiments, the source mask operand can be the first result mask operand 528, the first result mask operand 628, or the result mask operand 728. Alternatively, the packed result mask operand 820 can be minimally processed to generate the source mask operand 928. The packed source data operand 997 and the source mask operand 928 can be provided to an execution unit 918. The execution unit can be operational in response to the instruction and / or operation to store the packed result data operand 998. In some embodiments, the instruction / operation can cause the execution unit to contiguously store the active data elements in the packed source data operand 997, corresponding to the mask bits of the source mask operand 928 at the same relative positions set to a binary one, at the least significant data element positions of the packed result data operand. All remaining data elements of the packed result data operand can be zeroed out. As shown, the three values a3, a7, and a23 of the packed source data operand, which are the only three active values with corresponding mask bits set, can be concatenated at the three lowest-order data element positions of the packed result data operand, with all higher-order result data elements set to zero. In this case, the VPCOMPRESSED command uses zero-set masking, where the masked result data elements are set to zero. Further instances of a masked data element join command can be executed similarly to join the matching values b3, b7, and b23 at the three least significant data element positions of another packed result data operand. The second result mask operand 530, for example, can be used together with the corresponding values from the CSR representation of row 2 of the sparse matrix 100. This approach allows matching or similar data values from the data represented in a CSR format to be isolated, joined, and placed into a vertical SIMD alignment at the same relative positions in the packed data operands. Such operations can be repeated until the vectors or rows of the end of the sparse matrix reach their ends. This can facilitate efficient vertical SIMD processing of these matching data values.This can be used advantageously in one aspect to improve the performance of arithmetic operations with a sparse vector. Fig. 10 is a block diagram of an exemplary embodiment of a suitable set of operation mask registers 1032 for packed data. In the illustrated embodiment, the set contains eight registers, designated k0 to k7. Alternative embodiments may contain either fewer than eight registers (e.g., two, four, six, etc.) or more than eight registers (e.g., sixteen, thirty-two, etc.). Each of these registers can be used to store an operation mask for packed data. In the illustrated embodiment, each of the registers consists of 64 bits. In alternative embodiments, the widths of the registers may be either wider than 64 bits (e.g., 80 bits, 128 bits, etc.) or narrower than 64 bits (e.g., 8 bits, 16 bits, 32 bits, etc.). The registers may be implemented in various ways and are not restricted to any particular circuit or design type.Examples of suitable registers include, but are not limited to, dedicated physical registers, dynamically assigned physical registers using register renaming. In some embodiments, the packed data operation mask registers 1032 can be a separate, dedicated set of architectural registers. In some embodiments, instructions can encode or specify the packed data operation mask registers in different bits or in one or more different fields of an instruction format than those used to encode or specify other types of registers (e.g., packed data registers). For example, an instruction can use three bits (e.g., a 3-bit field) to encode or specify any one of the eight packed data operation mask registers k0 through k7. In alternative embodiments, either fewer or more bits can be used depending on the number of packed data operation mask registers.In a specific implementation, only the packed data operation mask registers k1 to k7 (but not k0) can be addressed as a propositional operand to execute a masked operation on packed data. Register k0 can be used as a regular source or destination, but it cannot be encoded as a propositional operand (e.g., if k0 is specified, it will have a "no-mask" encoding), although this is not required. Fig. 11 is a block diagram of an exemplary embodiment of a suitable set of 1120 packed data registers. The packed data registers comprise thirty-two 512-bit packed data registers, designated ZMM0 to ZMM31. In the illustrated embodiment, the 256 lower-order bits of the lower sixteen registers, namely ZMM0-ZMM15, are designated or superimposed as the corresponding 256-bit packed data registers, designated YMM0-YMM15, although this is not necessary. Likewise, in the illustrated embodiment, the 128 lower-order bits of registers YMM0-YMM15 are designated or superimposed as the corresponding 128-bit packed data registers, designated XMM0-XMM15, although this is also not necessary. The 512-bit registers ZMM0 to ZMM31 are operational to hold packed 512-bit data, packed 256-bit data, or packed 128-bit data.The 256-bit registers YMM0-YMM15 are operational for holding packed 256-bit data or packed 128-bit data. The 128-bit registers XMM0-XMM15 are operational for holding packed 128-bit data. In some embodiments, each of the registers can be used to store either packed floating-point data or packed integer data. Different data element sizes are supported, including at least 8-bit byte data, 16-bit word data, 32-bit double word, 32-bit single-precision floating-point data, 64-bit double word, and 64-bit double-precision floating-point data. In alternative embodiments, other numbers of registers and / or other register sizes can be used. In other embodiments, the registers may or may not use aliasing of larger registers onto smaller registers and / or may or may not be used to store floating-point data. An instruction set contains one or more instruction formats. A given instruction format defines various fields (number of bits, bit locations) to specify, among other things, the operation to be performed (the opcode) and the operand(s) on which the operation is to be performed. Some instruction formats are further subdivided by defining instruction templates (or subformats). The instruction templates of a given instruction format may, for example, be defined such that they contain different subsets of the fields of the instruction format (the contained fields are typically in the same order, with at least some having different bit positions because fewer fields are included), and / or be defined such that they contain a given field that is interpreted differently.Consequently, each command of an ISA is expressed using a given command format (and in a given command template of the command templates of that command format, if defined), containing fields to specify the operation and operands. For example, an ADD command has a specific opcode and a command format that includes an opcode field to specify these opcode and operand fields to select operands (source1 / destination and source2); where an occurrence of this ADD command in a command stream has specific contents in the operand fields that select specific operands. A set of SIMD extensions, known as the Advanced Vector Extensions (AVX) (AVX1 and AVX2), which use the Vector Extension Encoding Scheme (VEX Encoding Scheme), has been released and / or published (see, for example,Intel® 64 and IA-32 Architectures Software Developers Manual, October 2011; and see Intel® Advanced Vector Extensions Programming Reference, June 2011). Example command formats The implementations of the command(s) described here can be embodied in various formats. In addition, exemplary systems, architectures, and pipelines are described in detail below. The implementations of the command(s) can be executed in such systems, architectures, and pipelines, but are not limited to those described in detail. The VEX command format VEX encoding allows instructions to have more than two operands and enables SIMD vector registers to be longer than 128 bits. Using a VEX prefix provides a syntax for three (or more) operands. For example, previous two-operand instructions performed operations such as A = A + B, which overwrites a source operand. Using a VEX prefix allows operands to perform non-destructive operations, such as A = B + C. Fig. 12A illustrates an example AVX instruction format containing a VEX prefix 1202, a true opcode field 1230, a Mod R / M byte 1240, a SIB byte 1250, a shift field 1262, and an IMM8 1272. Fig. 12B illustrates which fields from Fig. 12A form a complete opcode field 1274 and a basic operation field 1242. Fig. 12C illustrates which fields from Fig. 12A form a register index field 1244. The VEX prefix (bytes 0-2) 1202 is encoded in a three-byte form. The first byte is the format field 1240 (VEX byte 0, bits [7:0]), which contains an explicit C4 byte value (the unique value used to distinguish the C4 instruction format). The second and third bytes (VEX bytes 1-2) contain a number of bit fields that provide a specific capability. Specifically, the REX field 1205 (VEX byte 1, bits [7-5]) consists of a VEX.R bit field (VEX byte 1, bit [7] - R), a VEX.X bit field (VEX byte 1, bit [6] - X), and a VEX.B bit field (VEX byte 1, bit [5] - B). The other fields of the instructions encode the lower three bits of the register indices, as is known in engineering (rrr, xxx and bbb), so that Rrrr, Xxxx and Bbbb can be formed by adding VEX.R, VEX.X and VEX.B.The opcode mapping field 1215 (VEX byte 1, bits [4:0] - mmmmm) contains the data to encode an implied leading opcode byte. The W field 1264 (VEX byte 2, bit [7] - W) is represented by the notation VEX.W and provides various functions depending on the instruction. The role of VEX.vvvv 1220 (VEX byte 2, bits [6:3] - vvvv) can include the following: 1) VEX.vvvv encodes the first source register operand, is specified in inverted (1's complement) form, and is valid for instructions with 2 or more source operands; 2) VEX.vvvv encodes the destination register operand and is specified in 1-complementary form for certain vector shifts; or 3) VEX.vvvv does not encode an operand, the field being reserved and intended to contain 1211b. If the size field VEX.L 1268 (the VEX byte 2, the bit [2] - L) = 0, it specifies a 128-bit vector; if VEX.L = 1, it specifies a 256-bit vector.The prefix encoding field 1225 (the VEX byte 2, the bits [1:0] - pp) provides additional bits for the basic operation field. The actual opcode field 1230 (byte 3) is also known as the opcode byte. Part of the opcode is specified in this field. The MOD R / M field 1240 (byte 4) contains a MOD field 1242 (bits [7-6]), a REG field 1244 (bits [5-3]), and an R / M field 1246 (bits [2-0]). The role of REG field 1244 can be either encoding the destination register operand or a source register operand (the rrr of the Rrrr), or it can be treated as an opcode extension and not used to encode any instruction operand. The role of R / M field 1246 can be either encoding the instruction operand that references a memory address, or encoding either the destination register operand or a source register operand. Scale, Index, Base (SIB) - The contents of scale field 1250 (byte 5) contain SS 1252 (bits [7-6]), which is used for memory address generation. Reference has previously been made to the contents of SIB.xxx 1254 (bits [5-3]) and SIB.bbb 1256 (bits [2-0]) with regard to the register indices Xxxx and Bbbb. The shift field 1262 and the immediate field (IMM8) 1272 contain address data. The generic vector-friendly command format A vector-friendly command format is a command format that is suitable for vector commands (for example, it has certain fields that are specific to vector operations). While the embodiments described here support both vector and scalar operations using the vector-friendly command format, alternative embodiments use only vector operations. Figures 13A-13B are block diagrams illustrating a generic vector-friendly instruction format and its instruction templates according to embodiments of the invention. Figure 13A is a block diagram illustrating a generic vector-friendly instruction format and its Class A instruction templates according to embodiments of the invention; while Figure 13B is a block diagram illustrating the generic vector-friendly instruction format and its Class B instruction templates according to embodiments of the invention. Specifically, a generic vector-friendly instruction format 1300 for which the Class A and Class B instruction templates are defined, each of which includes a non-memory-accessible instruction template 1305 and a memory-accessible instruction template 1320. The term "generic" in the context of the vector-friendly instruction format refers to the instruction format being not bound to any specific instruction set. Meanwhile, embodiments of the invention are described in which the vector-friendly command format supports the following: a vector operand length (or size) of 64 bytes with data element widths (or sizes) of 32 bits (4 bytes) or 64 bits (8 bytes), (whereby a 64-byte vector consequently consists of either 16 elements in double-word size or 8 elements in quadruple-word size); a vector operand length (or size) of 64 bytes with data element widths (or sizes) of 16 bits (2 bytes) or 8 bits (1 byte); a vector operand length (or size) of 32 bytes with data element widths (or sizes) of 32 bits (4 bytes), 64 bits (8 bytes), 16 bits (2 bytes), or 8 bits (1 byte); and a vector operand length (or size) of 16 bytes with data element widths (or sizes) of 32 bits (4 bytes), 64 bits (8 bytes), 16 bits (2 bytes) or 8 bits (1 byte);Alternative embodiments can support more, fewer and / or different sizes of vector operands (e.g., 256-byte vector operands) with more, fewer, or different widths of data elements (e.g., data element widths of 128 bits (16 bytes)). The Class A instruction templates in Fig. 13A contain the following: 1) within the non-memory-access instruction templates 1305, an instruction template of operation 1310 of the full-turn control type without memory access and an instruction template of operation 1315 of the data transformation type without memory access are shown; and 2) within the memory-access instruction templates 1320, a time-based instruction template 1325 with memory access and a non-time-based instruction template 1330 with memory access are shown. The Class B instruction templates in Fig.13B contains the following: 1) within the command templates without memory access 1305, a command template of operation 1312 of the type of partial turn control with write mask control and without memory access and a command template of operation 1317 of the vsize type with write mask control and without memory access are shown; and 2) within the command templates with memory access 1320, a command template with write mask control 1327 and memory access is shown. The generic vector-friendly command format 1320 contains the following fields, which are listed below in the order illustrated in Figs. 13A-13B. The format field 1340 – a specific value (a command format identifier value) in this field uniquely identifies the vector-friendly command format and, consequently, the occurrences of commands in the vector-friendly command format within the command streams. This field is optional in the sense that it is not required for a command set that contains only the generic vector-friendly command format. The basic operations field 1342 - its content distinguishes the various basic operations. The register index field 1344—its contents specify, directly or through address generation, the locations of the source and destination operands, whether they reside in registers or in memory. These contain a sufficient number of bits to select N registers from a P × Q register file (e.g., 32 × 512, 16 × 128, 32 × 1024, 64 × 1024). While in one embodiment N can be up to three source registers and one destination register, alternative embodiments can support more or fewer source and destination registers (e.g., they can support up to two sources, with one of these sources also acting as the destination; they can support up to three sources, with one of these sources also acting as the destination; and they can support up to two sources and one destination). Modifier field 1346—its contents distinguish between occurrences of instructions in the generic vector instruction format that specify vector access and those that do not; that is, between the non-memory-access instruction templates 1305 and the memory-access instruction templates 1320. The memory-access operations read and / or write to the memory hierarchy (in some cases specifying the source and / or destination addresses using the values in the registers), while the non-memory-access operations do not (e.g., the source and destinations are registers). While in one embodiment this field also distinguishes between three different ways to perform memory address calculations, alternative embodiments may support more, fewer, or different ways to perform memory address calculations. The magnification operation field 1350—its contents distinguish which of a variety of different operations are to be performed in addition to the basic operation. This field is context-specific. In one embodiment of the invention, this field is subdivided into a class field 1368, an alpha field 1352, and a beta field 1354. The magnification operation field 1350 enables common groups of operations to be executed in a single command instead of in 2, 3, or 4 commands. The scale field 1360 - its content allows scaling of the content of the index field for memory address generation (e.g. for address generation that uses 2scale* index + base). The shift field 1362A - its contents are used as part of memory address generation (e.g., for address generation that uses 2Scale* Index + Base + Shift). The shift factor field 1362B (it should be noted that the juxtaposition of shift field 1362A directly above shift factor field 1362B indicates that one or the other is being used) – its contents are used as part of address generation; it specifies a shift factor to be scaled by the size of a memory access (N) – where N is the number of bytes in the memory access (e.g., for address generation using 2scale * index + base + scaled shift). Redundant lower-order bits are ignored, so the contents of the shift factor field are multiplied by the total size (N) of the memory operands to produce the final shift to be used when calculating an effective address.The value of N is determined by the processor hardware at runtime based on the full opcode field 1374 (described later) and the data manipulation field 1354C. The shift field 1362A and the shift factor field 1362B are optional in the sense that they are not used for the no-memory instruction templates 1305 and / or different implementations may use only one or neither of them. The data element width field 1364 – its content specifies which of a number of data element widths should be used (in some embodiments for all instructions; in other embodiments only for some of the instructions). This field is optional in the sense that it is not required if only one data element width is supported and / or the data element widths are supported using any aspect of the opcodes. The write mask field 1370—its content controls, on a per-data-element basis, whether the data element position in the target vector operand reflects the result of the base operation and the magnification operation. Class A command templates support merge write masking, while Class B command templates support both merge and zero-set write masking. When merging, the vector masks allow any set of elements in the target to be protected from updates during the execution of any operation (specified by the base operation and the magnification operation); in another embodiment, the old value of each element in the target, where the corresponding mask bit is 0, is retained.In contrast, vector masks allow zeroing to occur during the execution of any operation (specified by the base operation and the magnification operation); in one embodiment, an element of the target is set to 0 if the corresponding mask bit has a value of 0. A subset of this functionality is the ability to control the vector length of the operation being performed (i.e., the span of elements being modified, from the first to the last); however, it is not necessary for the elements being modified to be consecutive. Consequently, write mask field 1370 partially enables vector operations, including load operations, store operations, arithmetic, logical, and so on.While the embodiments of the invention are described in which the content of the write mask field 1370 selects one from a number of write mask registers that contains the write mask to be used (whereby the content of the write mask field 1370 consequently identifies that the masking is to be performed), alternative embodiments allow instead or additionally the content of the write mask field 1370 to directly specify the masking to be performed. The instant field 1372 – its content allows the specification of an instant value. This field is optional in the sense that it is not present in an implementation of the generic vector-friendly format that does not support instant values, and that it is not present in commands that do not use instant values. The class field 1368 – its contents distinguish between different classes of commands. In Figures 13A-B, the contents of this field select between commands of class A and class B. In Figures 13A-B, the rectangles with rounded corners are used to indicate a specific value present in a field (e.g., class A 1368A and class B 1368B for class field 1368 in Figures 13A-B). The command templates of class A In the case of the Class A no-memory command templates 1305, the alpha field 1352 is interpreted as an RS field 1352A, the content of which distinguishes which of the various scaling operation types is to be executed (e.g., rounding 1352A.1 and data transformation 1352A.2 are specified for the no-memory command templates of operation 1310 of the round type and operation 1315 of the data transformation type, respectively), while the beta field 1354 distinguishes which of the operations of the specified type is to be executed. The scale field 1362, the displacement field 1362A, and the displacement scale field 1362B are not present in the no-memory command templates 1305. Command templates without memory access - the operation of the full turn control type In the instruction template of operation 1310 of the full round-control type without memory access, the beta field 1354 is interpreted as a round-control field 1354A, the content(s) of which provide static rounding. While in the described embodiments of the invention the round-control field 1354A contains a field 1356 for suppressing all floating-point exceptions (SAE) and a round-operation control field 1358, alternative embodiments can support the encoding of these two concepts in the same field or can have only one or the other of these concepts / fields (e.g., can have only the round-operation control field 1358). The SAE field 1356 - its content distinguishes whether the exception event message is to be suppressed or not; if the content of SAE field 1356 indicates that suppression is enabled, a given command will not report any type of floating-point exception flag and will not initiate a floating-point exception handling facility. The rounding operation control field 1358—its contents distinguish which of a group of rounding operations is to be performed (e.g., round up, round down, round to zero, and round to nearest). Consequently, the rounding operation control field 1358 allows the rounding mode to be changed on a per-instruction basis. In an embodiment of the invention, in which a processor includes a control register for specifying the rounding modes, the contents of the rounding operation control field 1350 override this register value. Command templates without memory access - the operation of the data transformation type In the instruction template of operation 1315 of the data transformation type without memory access, the beta field 1354 is interpreted as a data transformation field 1354B, the content of which distinguishes which of a number of data transformations is to be performed (e.g. no data transformation, swizzle, roundup). In the case of a Class A instruction template with memory access 1320, the alpha field 1352 is interpreted as an evacuation hint field 1352B, the content of which distinguishes which evacuation hint is to be used (in Fig. 13A, temporal 1352B.1 and non-temporal 1352B.2 are specified for the temporal instruction template with memory access 1325 and the non-temporal instruction template with memory access 1330, respectively), while the beta field 1354 is interpreted as a data manipulation field 1354C, the content of which distinguishes which of a number of data manipulation operations (also known as primitives) is to be performed (e.g., no manipulation; broadcast; up-transformation of a source; and down-transformation of a target). The command templates with memory access 1320 contain the scale field 1360 and optionally the displacement field 1362A or the displacement scale field 1362B. The vector storage commands perform vector load operations from and vector write operations to memory with conversion support. Like regular vector commands, the vector storage commands transfer data to and from memory in a data-element-wise manner, with the elements actually transferred being prescribed by the contents of the vector mask selected as the write mask. The command templates with memory access - time-based Temporal data refers to data that is likely to be reused soon enough to benefit from cache storage. However, this is a hint, and different processors may implement it in different ways, including ignoring it entirely. The command templates with memory access - not time-based It is unlikely that the non-temporal data will be reused soon enough to benefit from cache storage in the Level 1 cache, where it should be given priority for clearing. However, this is a hint, and different processors may implement it in different ways, including ignoring the hint entirely. The command templates of class B In the case of the command templates of class B, the alpha field 1352 is interpreted as a write mask control field 1352C (Z), the content of which distinguishes whether the write masking controlled by the write mask field 1370 should be a merge or a zeroing. In the case of the command templates without memory access 1305 of class B, part of the beta field 1354 is interpreted as an RL field 1357A, the content of which distinguishes which of the various enlargement operation types is to be executed (e.g., rounding 1357A.1 and vector length (VSIZE) 1357A.2 are respectively specified for the command template of operation 1312 of the type of partial round control with write mask control and without memory access, and the command template of operations 1317 of the VSIZE type with write mask control and without memory access), while the rest of the beta field 1354 distinguishes which of the operations of the specified type is to be executed. The command templates without memory access 1305 do not include the scale field 1360, the displacement field 1362A and the displacement scale field 1362B. In the instruction template of operation 1310 of type sub-turn control with write mask control and no memory access, the remainder of the beta field 1354 is interpreted as a turn operation field 21359A, with the exception event message locked (a given instruction does not report any type of floating-point exception flag and does not start a floating-point exception handling device). The rounding operation control field 1359A—just like the rounding operation control field 1358—distinguishes its contents from a group of rounding operations to be performed (e.g., round up, round down, round to zero, and round to nearest). Consequently, the rounding operation control field 1359A allows the rounding mode to be changed on a per-instruction basis. In an embodiment of the invention, in which a processor includes a control register for specifying the rounding modes, the contents of the rounding operation control field 1350 override this register value. In the instruction template of operation 1317 of type VSIZE with write mask control and without memory access, the remainder of the beta field 1354 is interpreted as a vector length field 1359B, the content of which distinguishes which of a number of data vector lengths is to be used (e.g. 128, 256 or 512 bytes). In the case of a Class B instruction template with memory access 1320, part of the beta field 1354 is interpreted as a broadcast field 1357B, the content of which determines whether a data manipulation operation of the broadcast type is to be performed, while the remainder of the beta field 1354 is interpreted as the vector length field 1359B. The instruction templates with memory access 1320 contain the scale field 1360 and optionally the displacement field 1362A or the displacement scale field 1362B. Regarding the generic vector-friendly command format 1300, it is shown that a full opcode field 1374 contains the format field 1340, the basic operation field 1342, and the data element width field 1364. While one embodiment is shown in which the full opcode field 1374 contains all of these fields, in embodiments that do not support all of them, the full opcode field 1374 contains fewer than all of these fields. The full opcode field 1374 provides the operation code (the opcode). The magnification operation field 1350, the data element width field 1364 and the write mask field 1370 allow these features to be specified on a per-command basis in the generic vector-friendly command format. The combination of the write mask field and the data element width field creates typed commands because it allows the mask to be applied based on different data element widths. The various instruction templates found within Class A and Class B are advantageous in different situations. In some embodiments of the invention, different processors or different cores within a processor can support only Class A, only Class B, or both classes. A high-performance out-of-order general-purpose core intended for general-purpose computing can support only Class B; a core intended primarily for graphics and / or scientific (throughput) computing can support only Class A; and a core intended for both can support both (obviously, a core that incorporates any mixture of templates and instructions from both classes, but not all templates and instructions from both classes, is within the scope of this invention).Furthermore, a single processor can contain multiple cores, all of which support the same class, or in which different cores support different classes. For example, in a processor with separate graphics and general-purpose cores, one of the graphics cores, primarily intended for graphics and / or scientific computing, may support only class A, while one or more of the general-purpose cores may be high-performance general-purpose cores with out-of-order execution and register renaming, intended for general-purpose computing, and supporting only class B. Another processor, lacking a separate graphics core, may contain one or more in-order or out-of-order general-purpose cores that support both class A and class B. Naturally, the features of one class may also be implemented in other embodiments of the invention in the other class.Programs written in a higher-level language would be translated (e.g., time-compiled or statically compiled) into a variety of different executable forms, including the following: 1) a form containing only the instructions of the class(es) supported for execution by the target processor; or 2) a form containing alternative routines written using different combinations of instructions from all classes, and containing control flow code that selects the routines to be executed based on the instructions supported by the processor currently executing the code. An example of a specific vector-friendly command format Fig. 14 is a block diagram illustrating an exemplary specific vector-friendly instruction format according to embodiments of the invention. Fig. 14 shows a specific vector-friendly instruction format 1400, which is specific in that it specifies the location, size, interpretation, and order of the fields, as well as the values for some of these fields. The specific vector-friendly instruction format 1400 can be used to extend the x86 instruction set, with some of the fields being similar to or the same as those used in the existing x86 instruction set and its extension (e.g., AVX). This format remains consistent with the prefix encoding field, the true opcode byte field, the MOD R / M field, the SIB field, the shift field, and the instant fields of the existing x86 instruction set with extensions. The fields of Fig. 13, into which the fields of Fig. 14 map, are illustrated. It should be recognized that, although the embodiments of the invention are described with reference to the specific vector-friendly command format 1400 in the context of the generic vector-friendly command format 1300 for illustrative purposes, the invention is not limited to the specific vector-friendly command format 1400, except where claimed. The generic vector-friendly command format 1300, for example, considers various possible sizes for the different fields, whereas the specific vector-friendly command format 1400 is shown to have fields with specific sizes. While, as a specific example, the data element width field 1364 is illustrated as a one-bit field in the specific vector-friendly command format 1400, the invention is not limited in this way (i.e., the generic vector-friendly command format 1300 considers other sizes of the data element width field 1364). The generic vector-friendly command format 1300 contains the following fields, which are listed below in the order illustrated in Fig. 14A. The EVEX prefix (bytes 0-3) 1402 - is encoded in a four-byte form. The format field 1340 (the EVEX byte 0, the bits [7:0]) - the first byte (the EVEX byte 0) is the format field 1340, containing 0x62 (the unique value used to distinguish the vector-friendly command format in one embodiment of the invention). The second-fourth bytes (the EVEX bytes 1-3) contain a number of bit fields that provide a specific capability. The REX field 1405 (the EVEX byte 1, bits [7-5]) consists of an EVEX.R bit field (the EVEX byte 1, bit [7] - R), an EVEX.X bit field (the EVEX byte 1, bit [6] - X), and the 1357BEX byte 1, bit [5] - B). The EVEX.R, EVEX.X, and EVEX.B bit fields provide the same functionality as the corresponding VEX bit fields, encoded using the 1's complement form; that is, ZMM0 is encoded as 1211B, and ZMM15 is encoded as 0000B. Other fields of the instructions encode the lower three bits of the register indices, as is known in engineering (rrr, xxx and bbb), so that Rrrr, Xxxx and Brrr can be formed by adding EVEX.R, EVEX.X and EVEX.B. The REX' field 1310 – this is the first part of the REX' field 1310 and is the EVEX.R' bit field (the EVEX byte 1, bit [4] - R') which is used to encode either the upper 16 or the lower 16 of the extended 32-register set. In one embodiment of the invention, this bit, along with others as specified below, is stored in a bit-inverted format to distinguish it (in the well-known x86 32-bit mode) from the BOUND instruction, whose true opcode byte is 62, but which, in the MOD R / M field (described below), does not accept the value of 11 in the MOD field; alternative embodiments of the invention do not store this and the other bits specified below in the inverted format. A value of 1 is used to encode the lower 16 registers. In other words, R'Rrrr is formed by combining EVEX.R', EVEX.R and the other RRR from the other fields. The opcode mapping field 1415 (the EVEX byte 1, the bits [3:0] - mmmm) - its contents encode an implied leading opcode byte (0F, 0F 38 or 0F 3). The data element width field 1364 (the EVEX byte 2, bit [7] - W) - is represented by the notation EVEX.W. EVEX.W is used to define the granularity (the size) of the data type (either 32-bit data elements or 64-bit data elements). The EVEX.vvvv field 1420 (EVEX byte 2, bits [6:3] - vvvv) - the role of EVEX.vvvv can be as follows: 1) EVEX.vvvv encodes the first source register operand, is specified in inverted (one's complement) form, and is valid for instructions with two or more source operands; 2) EVEX.vvvv encodes the destination register operand and is specified in one's complement form for certain vector shifts; or 3) EVEX.vvvv does not encode any operand, the field is reserved, and should contain 1211b. Consequently, the EVEX.vvvv field 1420 encodes the four lower-order bits of the first source register specification element, stored in inverted (one's complement) form. Depending on the command, an additional EVEX bit field is used to extend the size of the specification element to 32 registers. The EVEX.U 1368 class field (the EVEX byte 2, the bit [2] - U) - If EVEX.U = 0, it specifies class A or EVEX.U0; if EVEX.U = 1, it specifies class B or EVEX.U1. The prefix encoding field 1425 (the EVEX byte 2, bits [1:0] - pp) provides additional bits for the basic operation field. In addition to providing support for the Alt-SSE instructions in the EVEX prefix format, this also has the advantage of condensing the SIMD prefix (instead of requiring one byte to express the SIMD prefix, the EVEX prefix requires only two bits). To support the Alt-SSE instructions that use the SIMD prefix (66H, F2H, F3H) in both the Alt format and the EVEX prefix format, in one embodiment these Alt-SIMD prefixes are encoded in the SIMD prefix encoding field. where they are expanded into the Alt SIMD prefix at runtime before being provided to the encoder's PLA (so the PLA can execute both the Alt and EVEX formats of these Alt instructions without modification).Although newer instructions could directly use the contents of the EVEX prefix encoding field as an opcode extension, certain implementations extend it in a similar way for consistency, but allow different meanings to be specified through these alt-SIMD prefixes. An alternative implementation can reconstruct the PLA to support the 2-bit SIMD prefix encodings and consequently does not require the extension. The alpha field 1352 (the EVEX byte 3, the bit [7] - EH; also known as EVEX.EH, EVEX.rs, EVEX.RL, EVEX.WriteMaskControl and EVEX.N; also illustrated with α) - as previously described, this field is context-specific. The beta field 1354 (the EVEX byte 3, the bits [6:4] - SSS, also known as EVEX.s2-o, EVEX.r2-0, EVEX.rrl, EVEX.LL0, EVEX.LLB; also illustrated by βββ) - as previously described, this field is context-specific. The REX' field 1310 – this is the remainder of the REX' field and is the EVEX.V' bit field (EVEX byte 3, bit [3] - V'), which can be used to encode either the upper 16 or the lower 16 of the extended 32-register set. This bit is stored in a bit-inverted format. A value of 1 is used to encode the lower 16 registers. In other words, V'VVVV is formed by combining EVEX.V' and EVEX.vvvv. The write mask field 1370 (the EVEX byte 3, the bits [2:0] - kkk) - its contents specify the index of a register in the write mask registers, as previously described. In one embodiment of the invention, the specific value EVEX.kkk = 000 exhibits a specific behavior that implies that no write mask is used for the particular instruction (this can be implemented in various ways, including the use of a write mask that is hardwired to all, or hardware that bypasses the masking hardware). The actual opcode field 1430 (byte 4) is also known as the opcode byte. Part of the opcode is specified in this field. The MOD R / M field 1440 (byte 5) contains the MOD field 1442, the REG field 1444, and the R / M field 1446. As previously described, the contents of the MOD field 1442 distinguish between operations with and without memory access. The role of the REG field 1444 can be summarized in two situations: encoding either the destination register operand or a source register operand, or it can be treated as an opcode extension and not used to encode any instruction operand. The role of the R / M field 1446 can include either encoding the instruction operand that references a memory address, or encoding either the destination register operand or a source register operand. The Scale, Index, Base (SIB) byte (byte 6) – As previously described, the contents of scale field 1350 are used for memory address generation. SIB.xxx 1454 and SIB.bbb 1456 – the contents of these fields were previously referenced with regard to register indices Xxxx and Bbbb. The shift field 1362A (the bytes 7-10) - if the MOD field 1442 contains 10, then bytes 7-10 are the shift field 1362A, operating in the same way as the Alt 32-bit shift (disp32) and operating on byte granularity. The shift factor field 1362B (byte 7) – if the MOD field contains 1442 01, then byte 7 is the shift factor field 1362B. The location of this field is the same as that of the 8-bit shift (disp8) of the Alt-x86 instruction set, which operates at byte granularity. Because disp8 is sign-expanded, it can only address between -128 and 137 byte offsets; with respect to 64-byte cache lines, disp8 uses 8 bits, which can be set to only four really useful values: -128, -64, 0, and 64; because a larger range is often needed, disp32 is used; however, disp32 requires 4 bytes. Unlike disp8 and disp32, the 1362B shift factor field is a reinterpretation of disp8; when the 1362B shift factor field is used, the actual shift is determined by the content of the shift factor field multiplied by the size of the memory operand access (N).This type of shift is called disp8 * N. This reduces the average instruction length (a single byte is used for the shift, but with a much larger range). Such a compressed shift is based on the assumption that the effective shift is a multiple of the memory access granularity, thus eliminating the need to encode the redundant lower-order bits of the address offset. In other words, the 1362B shift factor field replaces the 8-bit shift of the old x86 instruction set. Consequently, the 1362B shift factor field is encoded in the same way as an 8-bit shift of the x86 instruction set (therefore, no changes to the ModRM / SIB encoding rules), with the sole exception that disp8 is overloaded to disp8 * N.In other words, there are no changes to the encoding rules or encoding lengths, but only to the hardware's interpretation of the shift value (which must scale the shift with the size of the memory operand to obtain a byte-wise address offset). The instant field 1372 works as previously described. The complete opcode field Fig. 14B is a block diagram illustrating the fields of the specific vector-friendly command format 1400 that constitute the complete opcode field 1374 according to one embodiment of the invention. Specifically, the complete opcode field 1374 includes the format field 1340, the basic operation field 1342, and the data element width field 1364 (W). The basic operation field 1342 includes the prefix encoding field 1425, the opcode mapping field 1415, and the actual opcode field 1430. The register index field Fig. 14C is a block diagram illustrating the fields of the specific vector-friendly command format 1400 that form the register index field 1344 according to one embodiment of the invention. Specifically, the register index field 1344 includes the REX field 1405, the REX' field 1410, the MODR / M.reg field 1444, the MODR / Mr / m field 1446, the VVVV field 1420, the xxx field 1454, and the bbb field 1456. The magnification operating field Fig. 14D is a block diagram illustrating the fields of the specific vector-friendly instruction format 1400 that form the magnification operation field 1350 according to one embodiment of the invention. If the class field 1368(U) contains 0, it means EVEX.U0 (Class A 1368A); if it contains 1, it means EVEX.U1 (Class B 1368B). If U = 0 and the MOD field 1442 contains 11 (which means a no-memory operation), the alpha field 1352 (the EVEX byte 3, bit [7] - EH) is interpreted as the rs field 1352A. If the rs field 1352A contains a 1 (the round 1352A.1), the beta field 1354 (the EVEX byte 3, bits [6:4] - SSS) is interpreted as the round control field 1354A. The round control field 1354A contains a one-bit SAE field 1356 and a two-bit round operation field 1358. If the rs field 1352A contains a 0 (the data transformation 1352A.2) The beta field 1354 (EVEX byte 3, bits [6:4] - SSS) is interpreted as a three-bit data transformation field 1354B. If U = 0 and the MOD field 1442 contains 00, 01, or 10 (which signifies a memory access operation), the alpha field 1352 (EVEX byte 3, bit [7] - EH) is interpreted as the clearing hint field 1352B (EH field), and the beta field 1354 (EVEX byte 3, bits [6:4] - SSS) is interpreted as a three-bit data manipulation field 1354C. If U = 1, the alpha field 1352 (EVEX byte 3, bit [7] - EH) is interpreted as the write mask control field 1352C (Z). If U = 1 and the MOD field 1442 contains 11 (which signifies a no-memory operation), part of the beta field 1354 (EVEX byte 3, bit [4] - S0) is interpreted as the RL field 1357A; If it contains a 1 (the rounding field 1357A.1), the remainder of the beta field 1354 (the EVEX byte 3, bit [6-5] - S2-1) is interpreted as the rounding operation field 1359A, whereas if the RL field 1357A contains a 0 (VSIZE 1357.A2), the remainder of the beta field 1354 (the EVEX byte 3, bit [6-5] - S2-1) is interpreted as the vector length field 1359B (the EVEX byte 3, bit [6-5] - L1-0).If U = 1 and the MOD field 1442 contains 00, 01 or 10 (which means a memory access operation), the beta field 1354 (the EVEX byte 3, the bits [6:4] - SSS) is interpreted as the vector length field 1359B (the EVEX byte 3, the bit [6-5] - L1-0) and the broadcast field 1357B (the EVEX byte 3, the bit [4] - B). An exemplary register architecture Fig. 15 is a block diagram of a register architecture 1500 according to an embodiment of the invention. In the illustrated embodiment, there are 32 vector registers 1510, each 512 bits wide; these registers are designated zmm0 to zmm31. The 256 lower-order bits of the lower 16 zmm registers are superimposed on the registers ymm0-16. The 128 lower-order bits of the lower 16 zmm registers (the 128 lower-order bits of the ymm registers) are superimposed on the registers xmm0-15. The specific vector-friendly instruction format 1400 acts on this superimposed register file as illustrated in the tables below. Command templates that do not contain the vector length field 1359B: A( 13A ;U = 0)1310, 1315, 1325, 1330zmm register (the vector length is 64 bytes) B( 13B ;U = 1)1312zmm-Register (the vector length is 64 bytes) Command templates containing the vector length field 1359BB( 13B ;U = 1)1317, 1327zmm, ymm or xmm registers (the vector length depends on the vector length field 1359B: 64 bytes, 32 bytes or 16 bytes) In other words, the vector length field 1359B selects between a maximum length and one or more other shorter lengths, each such shorter length being half the length of the preceding length; the instruction templates without the vector length field 1359B act on the maximum vector length. Furthermore, in one embodiment, the Class B instruction templates of the specific vector-friendly instruction format 1400 operate on packed or scalar single / double-precision floating-point data and packed or scalar integer data. The scalar operations are operations performed at the lowest-order data element position in a zmm / ymm / xmm register; the higher-order data element positions are, depending on the embodiment, either left as they were before the instruction or set to zero. The write mask registers 1515 – in the illustrated embodiment, there are 8 write mask registers (k0 to k7), each with a size of 64 bits. In an alternative embodiment, the write mask registers 1515 have a size of 16 bits. As previously described, in one embodiment of the invention, the vector mask register k0 cannot be used as a write mask; if the encoding that would normally specify k0 is used for a write mask, it selects the hard-wired write mask 0xFFFF, which effectively locks write masking for that instruction. The Universal Registers 1525 – in the illustrated embodiment, there are sixteen 64-bit universal registers which, together with the existing x86 addressing modes, are used to address the memory operands. These registers are referred to by the names RAX, RBX, RCX, RDX, RBP, RSI, RDI, RSP, and R8 through R15. Scalar floating-point stack register file (x87 stack) 1545, otherwise called the packed MMX integer flat register file 1550 - in the illustrated embodiment, the x87 stack is a stack of eight elements used to perform scalar floating-point operations on 32 / 64 / 80-bit floating-point data using the x87 instruction set extension; while the MMX registers are used both to perform operations on packed 64-bit integer data and to hold operands for some operations performed between the MMX and XMM registers. Alternative embodiments of the invention may use wider or narrower registers. Additionally, alternative embodiments of the invention may use more, fewer, or different register files and registers. Exemplary core architectures, processors and computer architectures Processor cores can be implemented in various ways, for different purposes, and in different processors. Implementations of such cores may include, for example: 1) an in-order general-purpose core intended for general-purpose computing; 2) a high-performance out-of-order general-purpose core intended for general-purpose computing; 3) a specialized core intended primarily for graphics and / or scientific computing (throughput computing). The implementations of the various processors may include: 1) a CPU containing one or more in-order general-purpose cores intended for general-purpose computing and / or one or more out-of-order general-purpose cores intended for general-purpose computing; and 2) a coprocessor containing one or more specialized cores intended primarily for graphics and / or scientific computing (throughput).Such diverse processors lead to different computer system architectures, which may include: 1) the coprocessor on a chip separate from the CPU; 2) the coprocessor on a separate die in the same assembly as a CPU; 3) the coprocessor on the same die as a CPU (in which case such a coprocessor is sometimes referred to as special logic, such as integrated graphics and / or scientific logic (throughput logic), or as special cores); and 4) a system-on-a-chip that may contain, on the same die as the CPU described (sometimes referred to as the application core(s) or application processor(s)), the coprocessor described above, and additional functionality. Exemplary core architectures are described next, followed by descriptions of the exemplary processors and computer architectures. Exemplary core architectures A block diagram of an in-order and out-of-order core Fig. 16A is a block diagram illustrating both an exemplary in-order pipeline and an exemplary out-of-order output / execution pipeline with register renaming according to embodiments of the invention. Fig. 16B is a block diagram illustrating an exemplary embodiment of both an in-order architecture core and an exemplary out-of-order output / execution architecture core with register renaming, contained in a processor according to embodiments of the invention. The boxes with solid lines in Figs. 16A-B illustrate the in-order pipeline and the in-order core, while the optional addition of boxes with dashed lines illustrates the out-of-order output / execution pipeline with register renaming and the out-of-order output / execution core with register renaming.Given that the in-order aspect is a subset of the out-of-order aspect, the out-of-order aspect is described. In Fig. 16A, a processor pipeline 1600 includes a fetch stage 1602, a length decode stage 1604, a decode stage 1606, an allocation stage 1608, a rename stage 1610, a scheduling stage (also referred to as a dispatch or output stage) 1612, a register read / memory read stage 1614, an execution stage 1616, a write / memory write stage 1618, an exception handling stage 1622, and a write / store stage 1624. Fig. 16B shows a processor core 1690 containing a front-end unit 1630 coupled to an execution unit 1650, both of which are coupled to a memory unit 1670. The core 1690 can be a reduced instruction set (RISC) core, a complex instruction set (CISC) core, a very long instruction word (VLIW) core, or a hybrid or alternative core type. As a further option, the core 1690 can be a special-purpose core, such as a network or communications core, a compression engine, a coprocessor core, a general-purpose graphics processing unit (GPGPU) core, a graphics core, or the like. The front-end unit 1630 contains a branch prediction unit 1632, which is coupled to an instruction cache unit 1634, which is coupled to an instruction address translation buffer (TLB) 1636, which is coupled to an instruction fetch unit 1638, which is coupled to a decoder unit 1640. The decoder unit 1640 (or decoder) can decode instructions and produce as an output one or more microoperations, microcode entry points, microinstructions, other instructions, or other control signals that are decoded from, otherwise reflect, or are derived from the original instructions. The decoder unit 1640 can be implemented using various different mechanisms. Examples of suitable mechanisms include lookup tables, hardware implementations, programmable logic arrays (PLAs), microcode read-only storage (microcode ROMs), etc., but are not limited to this. In one embodiment, the core 1690 contains a microcode ROM or other medium that stores microcode for specific macro instructions (e.g., in the decoding unit 1640 or elsewhere within the front-end unit 1630). The decoding unit 1640 is coupled to a renaming / assigning unit 1652 in the execution machine unit 1650. The execution machine unit 1650 contains a rename / assign unit 1652, which is coupled to a shutdown unit 1654 and to a set of one or more scheduler units 1656. The scheduler unit(s) 1656 represent any number of different schedulers, including reservation stations, a central instruction window, etc. The scheduler unit(s) 1656 are coupled to the physical register file unit(s) 1658. Each of the physical register file units 1658 represents one or more physical register files, several of which store one or more different data types, such as scalar integer, scalar floating-point, packed integer, packed floating-point, vector integer, vector floating-point, status (e.g., an instruction pointer that is the address of the next instruction to be executed), etc.In one embodiment, the physical register file(s) unit 1658 comprises a vector register unit, a write mask register unit, and a scalar register unit. These register units can provide architectural vector registers, vector mask registers, and general-purpose registers. The physical register file(s) unit 1658 is overlapped by the standby unit 1654 to illustrate the various ways in which register renaming and out-of-order execution can be implemented (e.g., using a reorder buffer (of reorder buffers) and a standby register file(s); using a future file(s), a history buffer (of history buffers), and a standby register file(s); using register mappings and a pool of registers; etc.). The shutdown unit 1654 and the physical register file unit(s) 1658 are coupled to the execution cluster(s) 1660.The execution cluster 1660 contains a set of one or more execution units 1662 and a set of one or more memory access units 1664. The execution units 1662 can perform various operations (e.g., shifts, addition, subtraction, multiplication) and on various data types (e.g., scalar floating-point, packed integer, packed floating-point, vector integer, vector floating-point). While some embodiments may contain a number of execution units dedicated to specific functions or sets of functions, other embodiments may contain only one execution unit or multiple execution units that all perform all functions.The scheduler unit(s) 1656, the physical register file unit(s) 1658, and the execution cluster(s) 1660 are shown as potentially multiple because certain embodiments generate separate pipelines for specific data types / operations (e.g., a scalar integer pipeline, a scalar floating-point / packed integer / packed floating-point / vector integer / vector floating-point pipeline, and / or a memory access pipeline, each having its own scheduler unit, its own physical register file unit(s), and / or its own execution cluster—where, in the case of a separate memory access pipeline, certain embodiments are implemented in which only the execution cluster of that pipeline has the memory access unit(s) 1664). It should also be recognized that where separate pipelines are used, one or more of these pipelines may be out-of-order output / execution and the rest in-order. The set of memory access units 1664 is coupled to the memory unit 1670, which contains a data TLB unit 1672, which is coupled to a data cache unit 1674, which is coupled to a level 2 cache unit (L2 cache unit) 1676. In an exemplary embodiment, the memory access units 1664 can contain a load unit, a memory address unit, and a memory data unit, each of which is coupled to the data TLB unit 1672 in the memory unit 1670. The instruction cache unit 1634 is further coupled to a level 2 cache unit (L2 cache unit) 1676 in the memory unit 1670. The L2 cache unit 1676 is coupled to one or more other levels of the cache and finally to main memory. For example, the exemplary out-of-order output / execution core architecture with register renaming can implement pipeline 1600 as follows: 1) the instruction fetch unit 1638 executes the fetch and length decode stages 1602 and 1604; 2) the decode unit 1640 executes the decode stage 1606; 3) the rename / assign unit 1652 executes the assignment stage 1608 and the rename stage 1610; 4) the scheduler unit(s) 1656 executes the scheduling stage 1612; 5) the physical register file unit(s) 1658 and the storage unit 1670 execute the register read / memory read stage 1614; the execution cluster 1660 executes the execution stage 1616; 6) The storage unit 1670 and the unit(s) 1658 of the physical register file(s) execute the write / store write stage 1618; 7) Different units may be involved in the exception handling stage 1622;and 8) the shutdown unit 1654 and the unit(s) 1658 of the physical register file(s) execute the input stage 1624.; The Kernel 1690 can support one or more instruction sets (e.g., the x86 instruction set (with some extensions added in newer versions); the MIPS instruction set from MIPS Technologies of Sunnyvale, CA; the ARM instruction set (with optional additional extensions, such as NEON) from ARM Holdings of Sunnyvale, CA), including the instruction(s) described here. In one embodiment, the Kernel 1690 includes logic to support an instruction set extension for packed data (e.g., AVX1, AVX2), thereby enabling the operations used by many multimedia applications that execute using packed data. It should be recognized that the core can support multithreading (the execution of two or more parallel sets of operations or threads) and can do so in various ways, including time-slice multithreading, concurrent multithreading (where a single physical core provides a logical core for each of the threads for which that physical core performs concurrent multithreading), or a combination thereof (e.g., time-slice fetching and decoding followed by concurrent multithreading, as in Intel® Hyperthreading Technology). While register renaming is described in the context of out-of-order execution, it should be recognized that register renaming can also be used in an in-order architecture. Furthermore, while the illustrated embodiment of the processor includes separate instruction and data cache units 1634 / 1674 and a shared L2 cache unit 1676, alternative embodiments may have a single internal cache for both instructions and data, such as an internal Level 1 cache (L1 cache), or multiple levels of internal cache. In some embodiments, the system may include a combination of an internal cache and an external cache located outside the core and / or the processor. Alternatively, all of the cache may be located outside the core and / or the processor. A specific exemplary in-order core architecture Figures 17A-B illustrate a block diagram of a more specific exemplary in-order core architecture, where the core would be one of several logic blocks (including other cores of the same type and / or other types) on a single chip. Depending on the application, the logic blocks communicate through a high-bandwidth interconnection network (e.g., a ring network) with some fixed-function logic, memory I / O interfaces, and other necessary I / O logic. Fig. 17A is a block diagram of a single processor core together with its connection to the interconnection network 1702 on the die and with its local subset of the Level 2 cache (L2 cache) 1704 according to embodiments of the invention. In one embodiment, an instruction decoder 1700 supports the x86 instruction set with the instruction set extension for packed data. An L1 cache 1706 allows low-latency access to the cache memory in the scalar and vector units. While in one embodiment (to simplify the design) a scalar unit 1708 and a vector unit 1710 use separate register sets (scalar register 1712 and vector register 1714, respectively), and the data transferred between them is written to memory and then read back from a Level 1 cache (L1 cache) 1706, alternative embodiments of the invention may use a different approach (e.g.,using a single register set or including a communication path that allows data to be transferred between the two register files without being written and read back). The local subset of the L2 cache 1704 is part of a global L2 cache, which is divided into separate local subsets, one for each processor core. Each processor core has a direct access path to its own local subset of the L2 cache 1704. Data read by a processor core is stored in its L2 cache subset 1704, allowing for fast parallel access by other processor cores accessing their own local L2 cache subsets. Data written by a processor core is stored in its own L2 cache subset 1704 and is flushed from other subsets as needed. The ring network ensures coherence for shared data. The ring network is bidirectional to allow agents, such as processor cores, L2 caches, and other logic blocks, to communicate with each other within the chip. Each ring data path is 1012 bits wide in each direction. Fig. 17B is an extended view of a portion of the processor core in Fig. 17A according to embodiments of the invention. Fig. 17B includes both an L1 data cache portion 1706A of the L1 cache 1704 and further details regarding the vector unit 1710 and the vector registers 1714. Specifically, the vector unit 1710 is a 16-inch wide vector processing unit (VPU) (see the 16-inch wide ALU 1728) that executes one or more integer instructions, single-precision floating-point instructions, and double-precision floating-point instructions. The VPU supports register input swizzling with the swizzling unit 1720, numerical conversion with the numerical conversion units 1722A-B, and replication with the replication unit 1724 at the memory input. The write mask registers 1726 allow reporting of the resulting vector write operations. A processor with integrated memory controller and integrated graphics. Fig. 18 is a block diagram of a processor 1800, which may have more than one core, an integrated memory controller, and integrated graphics, according to embodiments of the invention. The boxes with solid lines in Fig. 18 illustrate a processor 1800 with a single core 1802A, a system agent 1810, and a set of one or more bus controller units 1816, while the optional addition of boxes with dashed lines illustrates an alternative processor 1800 with multiple cores 1802A-N, a set of one or more integrated memory controller units 1814 in the system agent unit 1810, and special logic 1808. Consequently, different implementations of the 1800 processor can include: 1) a CPU with the 1808 special logic, which is integrated graphics and / or scientific logic (throughput logic) (which may contain one or more cores), where the 1802A-N cores are one or more general-purpose cores (e.g., in-order general-purpose cores, out-of-order general-purpose cores, or a combination of both); 2) a coprocessor with the 1802A-N cores being a large number of special-purpose cores primarily dedicated to graphics and / or science (throughput); and 3) a coprocessor with the 1802A-N cores being a large number of in-order general-purpose cores. Therefore, the 1800 processor can be a general-purpose processor, a coprocessor, or a special-purpose processor, such as...The processor could be a network or communications processor, a compression engine, a graphics processor, a GPGPU (a general-purpose graphics processing unit), a high-throughput coprocessor with many integrated cores (MIC coprocessor) (containing 30 or more cores), an embedded processor, or the like. The processor can be implemented on one or more chips. The 1800 processor can be part of one or more substrates and / or implemented on one or more substrates using any of a number of process technologies, such as BiCMOS, CMOS, or NMOS. The memory hierarchy includes one or more levels of the cache within the cores, a set of one or more shared cache units 1806, and an external memory (not shown) coupled to the set of integrated memory controller units 1814. The set of shared cache units 1806 may include one or more middle-level caches, such as Level 2 (L2), Level 3 (L3), Level 4 (L4), or other levels of the cache, a last-level cache (LLC), and / or combinations thereof. While in one embodiment a ring-based interconnection unit 1812 connects the integrated graphics logic 1808, the set of shared cache units 1806, and the system agent unit 1810 / the integrated memory controller unit(s) 1814, alternative embodiments may employ any number of well-known techniques to interconnect such units.In one embodiment, coherence is maintained between one or more cache units 1806 and the cores 1802-AN. In some embodiments, one or more of the 1802A-N cores are capable of multi-threading. The 1810 system agent contains the components that coordinate and operate the 1802A-N cores. The 1810 system agent unit can, for example, include a power control unit (PCU) and a display unit. The PCU can be, or contain, the logic and components required to control the power state of the 1802A-N cores and the 1808 integrated graphics logic. The display unit is used to drive one or more externally connected displays. The 802A-N cores can be homogeneous or heterogeneous with respect to the architectural instruction set; i.e., two or more of the 1802A-N cores may be capable of executing the same instruction set, while others may be capable of executing only a subset of that instruction set or a different instruction set. Exemplary computer architectures Figures 19-21 are block diagrams of exemplary computer architectures. Other system designs and configurations known from the techniques for laptops, desktops, handheld PCs, personal digital assistants, development workstations, servers, network devices, network hubs, switches, embedded processors, digital signal processors (DSPs), graphics devices, video game devices, set-top boxes, microcontrollers, mobile phones, portable media players, handheld devices, and various other electronic devices are also suitable. In general, a vast variety of systems or electronic devices that may include a processor and / or other execution logic such as those disclosed here are generally suitable. Figure 19 shows a block diagram of a system 1900 according to an embodiment of the present invention. The system 1900 can include one or more processors 1910, 1915 coupled to a controller hub 1920. In one embodiment, the controller hub 1920 includes a graphics memory controller hub (GMCH) 1990 and an input / output hub (EAH) 1950 (which may be located on separate chips); wherein the GMCH 1990 includes memory and graphics controllers to which a memory 1940 and a coprocessor 1945 are coupled; wherein the EAH 1950 couples the input / output devices (I / O devices) 1960 to the GMCH 1990. Alternatively, one or both of the memory and graphics controllers are integrated within the processor (as described here), the memory 1940 and the coprocessor 1945 are directly coupled to the processor 1910, and the controller hub 1920 is located in a single chip with the EAH 1950. The optional types of additional 1915 processors are shown in Fig. 19 with dashed lines. Each 1910 or 1915 processor can contain one or more of the processing cores described here and can be any version of the 1800 processor. The memory 1940 can be, for example, a dynamic read / write memory (DRAM), a phase-change memory (PCM), or a combination of both. In at least one embodiment, the controller hub 1920 communicates with the processor(s) 1910, 1915 via a bus with multiple stations, such as a front-side bus (FSB), a point-to-point interface such as a QuickPath Interconnect (QPI), or a similar connection 1995. In one embodiment, the coprocessor 1945 is a specialized processor, such as a high-throughput MIC processor, a network or communications processor, a compression engine, a graphics processor, a GPGPU, an embedded processor, or the like. In another embodiment, the controller hub 1920 may include an integrated graphics accelerator. There may be various differences between the physical equipment of 1910 and 1915 with respect to a spectrum of metrics of advantages, including architectural, microarchitectural, thermal, and power consumption characteristics, and the like. In one embodiment, the 1910 processor executes instructions that control the data processing separation of a general type. Coprocessor instructions may be embedded within these instructions. The 1910 processor recognizes these coprocessor instructions as a type that should be executed by the attached 1945 coprocessor. Accordingly, the 1910 processor outputs these coprocessor instructions (or control signals representing the coprocessor instructions) on a coprocessor bus or other interconnection to the 1945 coprocessor. The 1945 coprocessor(s) accepts the received coprocessor instructions and executes them. Figure 20 shows a block diagram of a first, more specific exemplary system 2000 according to an embodiment of the present invention. As shown in Figure 20, the multiprocessor system 2000 is a point-to-point interconnect system comprising a first processor 2070 and a second processor 2080 coupled via a point-to-point interconnect 2050. Each of the processors 2070 and 2080 can be any version of the processor 1800. In one embodiment of the invention, the processors 2070 and 2080 are the processors 1910 and 1915, respectively, while the coprocessor 2038 is the coprocessor 1945. In another embodiment, the processors 2070 and 2080 are the processor 1910 and the coprocessor 1945, respectively. It has been shown that the 2070 and 2080 processors contain the integrated memory controller units (IMC units) 2072 and 2082, respectively. The 2070 processor also includes the point-to-point interfaces (PP interfaces) 2076 and 2078 as part of its bus controller units; similarly, the 2080 processor contains the PP interfaces 2086 and 2088. The 2070 and 2080 processors can exchange information via a point-to-point interface (PP interface) 2050 using the PP interface circuits 2078 and 2088. As shown in Fig. 20, the IMCs 2072 and 2082 couple the processors to corresponding memories, namely a memory 2032 and a memory 2034, which can be parts of the main memory that are locally connected to the respective processors. The 2070 and 2080 processors can each exchange information with a 2090 chipset via the individual PP interfaces 2052 and 2054 using the point-to-point interface circuits 2076, 2094, 2086, and 2098. The 2090 chipset can optionally exchange information with the 2038 coprocessor via a high-performance interface 2039. In one embodiment, the 2038 coprocessor is a specialized processor, such as a high-throughput MIC processor, a network or communications processor, a compression machine, a graphics processor, a GPGPU, an embedded processor, or the like. Each processor, or an external component outside of the two processors, may contain a shared cache (not shown) which is nevertheless connected to the processors via the PP interconnection, so that the local cache information of one or both processors may be stored in the shared cache if one processor is set to a low-power mode. The chipset 2090 can be coupled to a first bus 2016 via an interface 2096. In one embodiment, the first bus 2016 can be a peripheral component interconnect bus (PCI bus) or a bus such as a PCI Express bus or another third-generation I / O interconnect bus, although the scope of protection of the present invention is not limited in this way. As shown in Fig. 20, various I / O devices 2014, together with a bus bridge 2018 that couples the first bus 2016 to a second bus 2020, can be connected to the first bus 2016. In one embodiment, one or more additional processor(s) 2015, such as coprocessors, high-throughput MIC processors, GPGPUs, accelerators (such as graphics accelerators or digital signal processing units (DSPs)), field-programmable gate arrays, or any other processor, are connected to the first bus 2016. In one embodiment, the second bus 2020 can be a low-pin-count bus (LPC bus). In one embodiment, various devices can be connected to the second bus 2020, including, for example, B. a keyboard and / or a mouse 2022, communication devices 2027 and a storage unit 2028, such asa disk drive or other mass storage device that can contain instructions / code and data 2030. Furthermore, an audio I / O 2024 can be coupled to the second bus 2020. It is stated that other architectures are possible. Instead of the point-to-point architecture according to Fig. 20, the system can implement a multi-station bus or another such architecture. Figure 21 shows a block diagram of a second, more specific exemplary system 2100 according to an embodiment of the present invention. Identical elements in Figures 20 and 21 bear the same reference numerals, with certain aspects from Figure 20 being omitted from Figure 21 to avoid obscuring other aspects from Figure 21. Figure 21 illustrates that the 2070 and 2080 processors can contain an integrated memory and I / O control logic ("CL") 2072 and 2082, respectively. Consequently, the CL 2072 and 2082 contain the integrated memory controller units and I / O control logic. Figure 21 also shows that not only are the memory modules 2032 and 2034 connected to the CL 2072 and 2082, but the I / O devices 2114 are also connected to the control logic 2072 and 2082. The legacy I / O devices 2115 are connected to the chipset 2090. Figure 22 shows a block diagram of a SoC 2200 according to an embodiment of the present invention. Similar elements in Figure 18 bear the same reference numerals. Furthermore, the boxes with dashed lines are optional features in more advanced SoCs. In Figure 22, an interconnection unit(s) 2202 is coupled to: an application processor 2210, comprising a set of one or more cores 212A-N and shared cache unit(s) 1806; a system agent unit 1810; a bus controller unit(s) 1816; an integrated memory controller unit(s) 1814; a set of one or more coprocessors 2220, which may include integrated graphics logic, an image processor, an audio processor, and a video processor; and a static read / write memory unit (SRAM unit) 2230. a storage direct access unit (DMA unit) 2232; and a display unit 2240 for coupling to one or more external displays.In one embodiment, the coprocessor(s) 2220 includes a special processor, such as a network or communications processor, a compression machine, a GPGPU, a high-throughput MIC processor, an embedded processor, or the like. The embodiments of the mechanisms disclosed herein can be implemented in hardware, software, firmware, or a combination of such implementation approaches. The embodiments of the invention can be implemented as computer programs or program code executed in programmable systems comprising at least one processor, a memory system (including volatile and / or non-volatile data storage and / or memory elements), at least one input device, and at least one output device. The program code, such as code 2030 illustrated in Fig. 20, can be applied to input commands to perform the functions described herein and to generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, the processing system includes any system comprising a processor, such as a digital signal processor (DSP), a microcontroller, an application-specific integrated circuit (ASIC), or a microprocessor. The program code can be implemented in a high-level procedural programming language or an object-oriented programming language to communicate with a processing system. Optionally, the program code can also be implemented in assembly language or machine language. In fact, the protection mechanisms described here are not limited to any specific programming language. In any case, the language can be compiled or interpreted. One or more aspects of at least one embodiment may be implemented by representative instructions stored in a machine-readable medium representing different logics within the processor. When read by a machine, these instructions cause the machine to construct logic to execute the techniques described herein. Such representations, known as "IP kernels," may be stored in a tangible machine-readable medium and may be supplied to different customers or manufacturing facilities for loading into the manufacturing machines that actually constitute the logic or processor. Such machine-readable storage media may, without limitation, contain non-transient tangible arrangements of manufactured articles or be formed by a machine or apparatus, including storage media such as hard disks, any other type of disk including floppy disks, optical disks, compact disk read-only storage (CD-ROMs), rewritable compact disks (CD-RWs) and magneto-optical disks, semiconductor devices such as read-only memory (ROMs), read-write memory (RAMs) such as dynamic read-write memory (DRAMs), static read-write memory (SRAMs), erasable programmable read-only memory (EPROMs), flash memory, electrically erasable programmable read-only memory (EEPROMs), phase-change memory (PCM), magnetic or optical cards, or any other type of media suitable for storing electronic instructions. Accordingly, the embodiments of the invention also include non-transient tangible, machine-readable media containing instructions or design data, such as a hardware description language (HDL), which defines structures, circuits, devices, processors, and / or system features described herein. Such embodiments may also be referred to as program products. Emulation (including binary translation, codemorphing, etc.) In some cases, an instruction translator can be used to translate an instruction from a source instruction set to a target instruction set. The instruction translator can, for example, translate (e.g., using static binary translation, dynamic binary translation including dynamic compilation), morph, emulate, or otherwise convert an instruction into one or more other instructions to be processed by the kernel. The instruction translator can be implemented in software, hardware, firmware, or a combination thereof. The instruction translator can reside within the processor, outside the processor, or partly within and partly outside the processor. Fig. 23 is a block diagram comparing the use of a software instruction converter to translate binary instructions in a source instruction set into binary instructions in a target instruction set according to embodiments of the invention. In the illustrated embodiment, the instruction converter is a software instruction converter, although alternatively the instruction converter can be implemented in software, firmware, hardware, or various combinations thereof. Fig. 23 shows a program in a higher-level language 2302 that can be compiled using an x86 compiler 2304 to generate binary x86 code 2306 that can be executed natively by a processor with at least one x86 instruction set core 2316.The processor with at least one x86 instruction set core 2316 represents any processor that can perform substantially the same functions as an Intel processor with at least one x86 instruction set core by comprehensibly executing or otherwise processing (1) a substantial portion of the Intel x86 instruction set core instruction set or (2) object code versions of applications or other software designed to run on an Intel processor with at least one x86 instruction set core, to achieve substantially the same result as an Intel processor with at least one x86 instruction set core. The x86 compiler 2304 represents a compiler that can be operated to generate binary x86 code 2306 (e.g., object code) that can be executed with or without additional link processing in the processor with at least one x86 instruction set core 2316. Fig. similarly shows23 the program in the higher language 2302, which can be compiled using a compiler 2308 for an alternative instruction set to generate binary code 2310 of the alternative instruction set that can be executed natively by a processor without at least one x86 instruction set core 2314 (e.g., a processor with cores that execute the MIPS instruction set from MIPS Technology of Sunnyvale, CA, and / or that execute the ARM instruction set from ARM Holdings of Sunnyvale, CA). The instruction converter 2312 is used to translate the binary x86 code 2306 into code that can be executed natively by the processor without an x86 instruction set core 2314.This implemented code is likely not the same as the binary code 2310 of the alternative instruction set because an instruction converter capable of this is difficult to manufacture; however, the implemented code achieves general operation and may be composed of instructions from the alternative instruction set. Consequently, the instruction converter 2312 represents software, firmware, hardware, or a combination thereof that, through emulation, simulation, or any other process, enables a processor or other electronic device that does not have an x86 instruction set processor or core to execute the binary x86 code 2306. The components, features, and details described for one of Figures 1-2 and 5-11 may also apply to any of Figures 3-4. Furthermore, the components, features, and details described for any of the devices may optionally also apply to any of the methods that can be carried out in the embodiments by and / or with such a device. Each of the processors described herein may be included in any of the computer systems disclosed herein (e.g., Figures 19-23). In some embodiments, the computer system may include dynamic read / write memory (DRAM). Alternatively, the computer system may include a type of volatile memory that does not require refreshing or flash memory.The instructions disclosed herein can be executed on any of the processors shown herein that incorporate any of the microarchitectures shown herein, in any of the systems shown herein. The instructions disclosed herein can exhibit any of the features of the instruction formats shown herein (e.g., in Figures 12-14). In the description and claims, the terms "coupled" and / or "connected" may have been used together with their derivatives. These terms are not intended to be synonymous. Instead, in the embodiments, "connected" may be used to indicate that two or more elements are in direct physical and / or electrical contact with each other. "Coupled" may mean that two or more elements are in direct physical and / or electrical contact with each other. However, "coupled" may also mean that two or more elements are not in direct contact with each other but nevertheless cooperate or interact. For example, an implementing unit may be coupled to a register and / or a decoding unit by one or more intermediary components. Arrows are used in the figures to show connections and couplings. The terms "logic," "unit," "module," or "component" may have been used in the description and / or claims. Each of these terms can be used to refer to hardware, firmware, software, or various combinations thereof. In exemplary embodiments, each of these terms may refer to an integrated circuit arrangement, application-specific integrated circuits, analog circuits, digital circuits, programmed logic devices, instruction storage devices, and the like, and various combinations thereof. In some embodiments, these may include at least some hardware (e.g., transistors, gates, other circuit arrangement components, etc.). The term "and / or" may have been used. The term "and / or," as used here, means one or the other or both (A and / or B means, for example, A or B or both A and B). Specific details have been set forth in the above description to provide a comprehensive understanding of the embodiments. However, other embodiments may be practiced without some of these specific details. The scope of protection of the invention is not to be determined by the specific examples provided above, but only by the claims below. In other cases, well-known circuits, structures, devices, and operations have been shown in block diagrams and / or without details to avoid obscuring the description. Where deemed appropriate, reference numerals or the end sections of reference numerals have been repeated between figures to indicate corresponding or analogous elements, which may optionally have similar or the same properties unless otherwise specified or clearly evident. Certain operations can be performed by hardware components or can be embodied in machine-readable or circuit-executable instructions that can be used to cause and / or result in a machine, circuit, or hardware component (e.g., a processor, a portion of a processor, a circuit, etc.) programmed with the instructions to perform the operations. The operations can also optionally be performed by a combination of hardware and software. A processor, machine, circuit, or hardware component may contain a specific or special circuit arrangement or other logic (e.g., hardware, potentially combined with firmware and / or software) capable of executing and / or processing the instruction and storing a result in response to the instruction. Some embodiments include a manufactured article (e.g., a computer program product) containing a machine-readable medium. The medium may include a mechanism that provides, e.g., stores, information in a form readable by the machine. The machine-readable medium may provide or have stored an instruction or sequence of instructions that, if and / or when executed by a machine, are capable of causing the machine to execute it and / or to perform one or more operations, procedures, or techniques disclosed herein. In some embodiments, the machine-readable medium may include a non-transient machine-readable storage medium. The non-transient machine-readable storage medium may include, for example, a floppy disk, an optical storage medium, an optical disk, an optical data storage device, a CD-ROM, a magnetic disk, a magneto-optical disk, a read-only memory (ROM), a programmable ROM (PROM), an erasable and programmable ROM (EPROM), an electrically erasable and programmable ROM (EEPROM), a read / write memory (RAM), a static RAM (SRAM), a dynamic RAM (DRAM), a flash memory, a phase-change memory, a phase-change data storage material, a non-volatile memory, a non-volatile data storage device, a non-transient memory, a non-transient data storage device, or the like.The non-transient machine-readable storage medium does not consist of a transient propagated signal. In some embodiments, the storage medium may include a tangible medium containing a solid. Examples of suitable machines include, but are not limited to, a general-purpose processor, a special-purpose processor, a digital logic circuit, an integrated circuit, or the like. Other examples of suitable machines include a computer system or electronic device containing a processor, a digital logic circuit, or an integrated circuit. Examples of such computer systems or electronic devices include, but are not limited to, desktop computers, laptop computers, notebook computers, tablet computers, netbooks, smartphones, mobile phones, servers, network devices (e.g., routers and switches), mobile internet devices (MIDs), media players, smart televisions, nettops, set-top boxes, and video game controllers. References throughout this description to “a single embodiment,” “an embodiment,” “one or more embodiments,” or “some embodiments” indicate, for example, that a particular feature may be included in the practice of the invention, but it is not necessarily required that it is. Similarly, in the description, various features are sometimes grouped together in a single embodiment, figure, or their description for the purpose of streamlining the disclosure and aiding in understanding the various inventive aspects. However, this method of disclosure should not be interpreted as reflecting an intention that the invention requires more features than are expressly presented in any claim. Instead, the inventive aspects are contained in fewer than all the features of a single disclosed embodiment, as reflected in the following claims.Consequently, the claims following the detailed description are hereby expressly included in this detailed description, each claim constituting a separate embodiment of the invention. EXAMPLE EXECUTION FORMS The following examples relate to further embodiments. The features in the examples can be used in one or more embodiments. Example 1 is a processor that includes a decoding unit for decoding a data element comparison instruction. The data element comparison instruction specifies a first packed source data operand containing at least four data elements, a second packed source data operand containing at least four data elements, and one or more destination memory locations. The processor also includes an execution unit coupled to the decoding unit. In response to the data element comparison instruction, the execution unit stores at least one result mask operand at the one or more destination memory locations. The at least one result mask operand contains, for each corresponding data element in one of the first and second packed source data operands, a different mask element at the same relative position.Each mask element indicates whether the corresponding data element in one of the first and second packed source data operands is equal to any of the data elements in the other of the first and second packed source data operands. Example 2 contains the processor from Example 1, in which the execution unit, in response to the instruction, stores two result mask operands at one or more target memory locations. The two result mask operands include a first result mask operand, which contains a different mask element for each corresponding data element in the first packed source data operand at the same relative position. Each mask element of the first result mask operand indicates whether the corresponding data element in the first packed source data operand is equal to any of the data elements in the second packed source data operand. A second result mask operand contains a different mask element for each corresponding data element in the second packed source data operand at the same relative position.Each mask element of the second result mask operand indicates whether the corresponding data element in the second packed source data operand is equal to any of the data elements in the first packed source data operand. Example 3 includes the processor of Example 2, wherein the one or more target memory locations comprise a first mask register and a second mask register, and wherein, in response to the instruction, the execution unit stores the first result mask operand in the first mask register and stores the second result mask operand in the second mask register. Example 4 includes the processor of Example 2, wherein the one or more target memory locations comprise a single mask register, and wherein the execution unit, in response to the instruction, stores the first result mask operand and the second result mask operand in the single mask register. Example 5 contains the processor of Example 4, wherein, in response to the instruction, the execution unit stores the first result mask operand in a least significant section of the single mask register and stores the second result mask operand in a section of the single mask register that is more significant than the least significant section. Example 6 includes the processor of Example 1, wherein the execution unit, in response to the instruction, stores both a first result mask operand and a second result mask operand in a packed data register, and wherein each data element in the packed data register has both a mask element of the first result mask operand and a mask element of the second result mask operand. Example 7 contains the processor from Example 1, where the execution unit, in response to the instruction, stores a single result mask operand in a single mask register. Example 8 includes the processor of Example 1, wherein the execution unit, in response to the instruction, stores the at least one result mask operand in at least one mask register, and wherein an instruction set of the processor contains masked instructions for packed data that are operational to specify the at least one mask register as a memory location for a source mask operand to be used to mask an operation on packed data. Example 9 contains the processor of one of Examples 1 to 8, wherein the execution unit, in response to the instruction, stores a number of result mask bits in the at least one result mask operand that is not greater than a number of data elements in the first and second packed source data operands. Example 10 contains the processor of one of Examples 1 to 8, wherein the execution unit, in response to the instruction, stores the at least one result mask operand in which each mask element contains a single mask bit. Example 11 contains the processor of one of Examples 1 to 8, wherein the decoding unit decodes the instruction specifying the first packed source data operand containing at least eight data elements and the second packed source data operand containing at least eight data elements. Example 12 contains the processor of one of Examples 1 to 8, wherein the decoding unit decodes the instruction specifying the first packed source data operand containing at least 512 bits and the second packed source data operand containing at least 512 bits. Example 13 is a procedure in a processor that includes receiving a data element comparison instruction. The data element comparison instruction specifies a first packed source data operand containing at least four data elements, a second packed source data operand containing at least four data elements, and one or more destination memory locations. The procedure also includes storing at least one result mask operand at the one or more destination memory locations in response to the data element comparison instruction. The at least one result mask operand contains, for each corresponding data element in one of the first and second packed source data operands, a different mask element at the same relative position.Each mask element indicates whether the corresponding data element in one of the first and second packed source data operands is equal to any of the data elements in the other of the first and second packed source data operands. Example 14 includes the procedure of Example 13, with the saving step involving saving a first result mask operand to one or more target memory locations. For each corresponding data element in the first packed source data operand, the first result mask operand contains a different mask element at the same relative position. Each mask element of the first result mask operand indicates whether the corresponding data element in the first packed source data operand is equal to any of the data elements in the second packed source data operand. The saving step also includes saving a second result mask operand to one or more target memory locations. For each corresponding data element in the second packed source data operand, the second result mask operand contains a different mask element at the same relative position.Each mask element of the second result mask operand indicates whether the corresponding data element in the second packed source data operand is equal to any of the data elements in the first packed source data operand. Example 15 contains the procedure of Example 14, wherein storing the first result mask operand includes storing the first result mask operand in a first mask register, and wherein storing the second result mask operand includes storing the second result mask operand in a second mask register. Example 16 contains the procedure of Example 14, wherein storing the first result mask operand and storing the second result mask operand involves storing both the first and second result mask operands in a single mask register. Example 17 contains the procedure of Example 13, wherein storing the at least one result mask operand at the one or more target memory locations includes storing both a first result mask operand and a second result mask operand in a packed result data operand. Example 18 contains the procedure of Example 13, further including the receipt of a masked command for packed data, which specifies the at least one result mask operand as a statement operand. Example 19 is a system for processing instructions that includes an interconnect and a processor coupled to the interconnect. The processor receives a data element comparison instruction. The instruction specifies a first packed source data operand containing at least four data elements, a second packed source data operand containing at least four data elements, and one or more destination memory locations. In response to the instruction, the processor stores at least one result mask operand at the one or more destination memory locations. The at least one result mask operand contains a different mask bit for each corresponding data element in one of the first and second packed source data operands at the same relative position.Each mask bit indicates whether the corresponding data element in one of the first and second packed source data operands is equal to any of the data elements in the other of the first and second packed source data operands. The system also includes a dynamic read / write memory (DRAM) coupled to the interconnection. The DRAM optionally stores a sparse vector arithmetic algorithm. The sparse vector arithmetic algorithm optionally includes a masked data element join instruction that specifies the at least one result mask operand as a source operand to mask a data element join operation. Example 20 contains the system of Example 19, wherein the execution unit, in response to the instruction, stores two result mask operands, each corresponding to a different one of the packed source data operands, and wherein the two result mask operands are to be stored in at least one mask register. Example 21 is a manufactured item containing a non-transient machine-readable storage medium. The non-transient machine-readable storage medium stores a data element comparison instruction. The instruction specifies a first packed source data operand containing at least four data elements, specifies a second packed source data operand containing at least four data elements, and specifies one or more destination memory locations. If executed by a machine, the instruction causes the machine to perform the operations that include storing a first result mask operand at the one or more destination memory locations. The first result mask operand contains a different mask bit for each corresponding data element in the first packed source data operand at the same relative position.Each mask bit indicates whether the corresponding data element in the first packed source data operand is equal to any of the data elements in the second packed source data operand. Example 22 contains the manufacturing article from Example 21, wherein the instruction, if executed by a machine, causes the machine to perform the operations that include storing a second result mask operand in the one or more target memory locations. Optionally, the one or more target memory locations also include at least one mask register. Optionally, the first and second result mask operands together do not have more mask bits than the number of data elements in the first and second packed source data operands. Example 23 includes the processor of one of Examples 1 through 8, further comprising an optional branch prediction unit for predicting branches and an optional instruction prefetch unit coupled to the branch prediction unit, wherein the instruction prefetches in advance the instructions containing the data element comparison instruction. The processor may also optionally include an optional Level 1 instruction cache (L1 instruction cache) coupled to the instruction prefetch unit, wherein the L1 instruction cache stores instructions, an optional L1 data cache for storing data, and an optional Level 2 cache (L2 cache) for storing both data and instructions.The processor may also optionally include an instruction fetch unit coupled with the decoder, the L1 instruction cache, and the L2 cache to fetch the data element comparison instruction from one of the L1 and L2 caches in some cases and to provide the data element comparison instruction to the decoder. The processor may also optionally include a register renamer to rename registers, an optional scheduler to schedule one or more operations decoded by the data element comparison instruction for execution, and an optional store unit to store the execution results of the data element comparison instruction. Example 24 includes a system-on-a-chip that includes at least one interconnect, the processor of one of Examples 1 to 8 coupled to the at least one interconnect, an optional graphics processing unit (GPU) coupled to the at least one interconnect, an optional digital signal processor (DSP) coupled to the at least one interconnect, an optional display controller coupled to the at least one interconnect, an optional memory controller coupled to the at least one interconnect, an optional wireless modem coupled to the at least one interconnect, an optional image signal processor coupled to the at least one interconnect, an optional Universal Bus (USB) 3.0 compatible controller coupled to the at least one interconnect, and an optional Bluetooth 4.0 compatible controller.Includes 1 compatible controller coupled with at least one interconnection, and an optional wireless transmitter / receiver controller coupled with at least one interconnection. Example 25 is a processor or other device for performing the method of any of Examples 13 to 18, or for being operational to perform the method of any of Examples 13 to 18. Example 26 is a processor or other device containing means for carrying out the method of any of Examples 13 to 18. Example 27 is a manufactured article which optionally includes a non-transient machine-readable medium which optionally stores or otherwise provides an instruction which, if and / or when executed by a processor, computer system, electronic device or other machine, is operative to cause the machine to perform the procedure of any of Examples 13 to 18. Example 28 is a processor or other device such as is substantially described here. Example 29 is a processor or other device capable of performing any procedure as substantially described herein. Example 30 is a processor or other device for executing any data element comparison instruction as substantially described herein (which, for example, has components to execute any data element comparison instruction as substantially described herein, or is operational to execute any data element comparison instruction as substantially described herein). Example 31 is a computer system or other electronic device comprising a processor that has a decoding unit for decoding the instructions of a first instruction set. The processor also has one or more execution units. The electronic device also includes a storage device coupled to the processor. The storage device stores a first instruction, which may be any of the data element comparison instructions as substantially disclosed herein, and which is derived from a second instruction set. The storage device also stores instructions for translating the first instruction into one or more instructions of the first instruction set. The one or more instructions of the first instruction set, when executed by the processor, cause the processor to store one of the results of the first instruction disclosed herein.< / index:wert>
Claims
Processor (1800), comprising: a plurality of cores (1802) for executing instructions for processing sparse matrices or parts thereof (996); an interconnection network (1812) for connecting the plurality of cores (1802) to one or more memories (1814); wherein a core of the plurality of cores (1802) processes a plurality of sparse source vectors connected to a sparse source matrix, the core comprising a circuit (918) for compressing a first sparse source vector (997) comprising a first plurality of packed data elements with zero values and a second plurality of packed data elements with non-zero values to produce a first compressed result vector (998) comprising the second plurality of packed data elements with non-zero values specified in a mask (928), the mask (928) having a first bit value at each position in the mask (928)which corresponds to a position of one of the packed data elements with zero values in the first sparse source vector (997), and comprises a second bit value at each position in the mask (928) corresponding to a position of one of the packed data elements with non-zero values in the first sparse source vector (997); wherein the circuit (918) serves to compress the first compressed result vector (998) by arranging the packed data elements with non-zero values of the first plurality of packed data elements at consecutive packed data positions in the first compressed result vector (998) without changing any relative order of the packed data elements with non-zero values; and wherein the core transfers the first compressed result vector (998) via the interconnection network (1812) to a memory of the one or more memories (1814). Processor (1800) according to claim 1, wherein the interconnection network (1812) comprises one or more point-to-point connections to directly connect an output of one core of the plurality of cores (1802) to an input of another core of the plurality of cores (1802). Processor (1800) according to claim 1 or 2, wherein the core performs a dot product operation based on the first compressed result vector (998) and a second sparse source vector comprising a second plurality of packed data elements. Processor (1800) according to one of claims 1 to 3, wherein the circuit (918) for compression comprises an execution circuit for executing an instruction to consolidate masked data elements, wherein the execution circuit, in response to the instruction to consolidate masked data elements, serves to: generate the first compressed result vector; and generate the mask. Processor (1800) according to one of claims 1 to 4, wherein each core of the plurality of cores (1802) comprises a local memory (1804), wherein the processor (1800) further comprises: a shared memory (1806) connected to the interconnect network (1812), wherein the shared memory (1806) is shared by the plurality of cores (1802). Processor (1800) according to one of claims 1 to 5, further comprising: an interface for coupling the plurality of cores (1802) with one or more devices. Processor (1800) according to claim 6, wherein one of the devices comprises a coprocessor for executing coprocessor instructions. Processor (1800) according to one of claims 1 to 7, wherein the interconnection network (1812) comprises a ring connection. Processor (1800) according to one of claims 1 to 8, wherein a core of the plurality of cores (1802) comprises: a scalar unit (1708) for accessing a first set of registers (1712); and a vector unit (1710) for accessing a second set of registers (1714). Processor (1800) according to claim 9, wherein a register of the second set of registers (1714) stores a plurality of packed data elements.