Addition instruction with independent carry chain
By parallel execution of ADCX and ADOX instructions, the problem of low efficiency in existing addition operations is solved, achieving efficient addition operations and improving the processor's computing performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INTEL CORP
- Filing Date
- 2011-12-22
- Publication Date
- 2026-06-30
AI Technical Summary
Existing addition operations are inefficient in computationally intensive tasks, impacting overall performance, especially in public-key encryption and SSL transaction processing, leading to performance slowdowns.
By using the ADCX and ADOX instructions, addition operations are performed in parallel using two independent carry chains. The carry flag and overflow flag are modified respectively without affecting other arithmetic flags, thus achieving parallel addition.
It improves the efficiency of addition operations, reduces data dependencies, enables the parallel execution of multiple addition instructions, and enhances the processor's computing performance.
Smart Images

Figure CN114461275B_ABST
Abstract
Description
[0001] This application is a divisional application filed again with respect to divisional application 201710828668.9. Divisional application 201710828668.9 is a divisional application of the invention patent application entitled "Addition Instruction with Independent Carry Chain" with PCT international application number PCT / US2011 / 066941, international filing date December 22, 2011, and Chinese national phase application number 201180075816.5. Technical Field
[0002] Embodiments of the present invention generally relate to computer processor architecture, and more specifically, to instructions that, when executed, lead to specific results. Background Technology
[0003] Addition instructions are typically included within an Instruction Set Architecture (ISA). Numerous addition instructions often appear within multiplication. For example, public-key encryption generally involves long integer operations requiring multi-precision multiplication. Operations such as modulo exponentiation are highly computationally intensive and involve a large number of additions. Servers responsible for establishing a company's Secure Sockets Layer (SSL) transactions may receive a large number of connection requests from enterprise clients within a short time span. Each transaction involves cryptographic operations including numerous integer multiplications and additions. Inefficient addition operations can slow down overall performance. Attached Figure Description
[0004] In the figures of the various accompanying drawings, the embodiments of the invention are given by way of example rather than illustration, and similar reference numerals denote similar elements. It should be noted that different references to "a" or "one" embodiment in this disclosure do not necessarily refer to the same embodiment, and such references indicate at least one. Furthermore, when a particular feature, structure, or characteristic is described with reference to one embodiment, it is believed that, to the best of the skill of a person, such feature, structure, or characteristic can be implemented together with other embodiments, whether or not explicitly described.
[0005] Figure 1 This is a block diagram of an example embodiment of a processor having an instruction set containing one or more addition instructions.
[0006] Figure 2 An example of a multiplication operation that includes addition is shown.
[0007] Figure 3 An example of sample code including addition instructions is shown.
[0008] Figure 4 An example of parallel execution of addition instructions is shown.
[0009] Figure 5This is a block diagram of an example embodiment of an instruction processing apparatus having an execution unit operable to execute instructions of an example embodiment containing addition instructions.
[0010] Figure 6 An example of a flag register is shown.
[0011] Figure 7 This is a flowchart illustrating an example embodiment of a method for processing addition instructions.
[0012] Figure 8 This is a block diagram of a system according to an embodiment of the present invention.
[0013] Figure 9 This is a block diagram of a second system according to an embodiment of the present invention.
[0014] Figure 10 This is a block diagram of a third system according to an embodiment of the present invention.
[0015] Figure 11 This is a block diagram of a System-on-a-Chip (SoC) according to an embodiment of the present invention.
[0016] Figure 12 This is a block diagram of a single-core processor and a multi-core processor with an integrated memory controller and graphics device according to embodiments of the present invention.
[0017] Figure 13 This is a block diagram illustrating the comparative use of a software instruction converter to transform binary instructions in a source instruction set into binary instructions in a target instruction set, according to an embodiment of the present invention. Detailed Implementation
[0018] Numerous specific details are set forth in the following description. However, it should be understood that various embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown in detail to avoid obscuring the understanding of this description.
[0019] Numerous specific details are set forth in the following description. However, it should be understood that various embodiments of the invention may be practiced without these specific details. In other instances, well-known circuits, structures, and techniques have not been shown in detail to avoid obscuring the understanding of this description.
[0020] Embodiments of the present invention provide a mechanism for efficiently multiplying long integers. Specifically, embodiments of the present invention provide a mechanism for efficiently multiplying large long integers in parallel with addition operations.
[0021] Figure 1This is a block diagram of an example embodiment of processor 100. Processor 100 can be any of a variety of Complex Instruction Set Computing (CISC) processors, various Reduced Instruction Set Computing (RISC) processors, various Very Long Instruction Word (VLIW) processors, various hybrids thereof, or entirely other types of processors. In one or more embodiments, processor 100 can be a general-purpose processor (e.g., a general-purpose microprocessor of the type manufactured by Intel Corporation of Santa Clara, California, USA), although this is not required. Alternatively, the instruction processing apparatus can be a special-purpose processor. Examples of suitable special-purpose processors include, but are not limited to, network processors, communication processors, cryptographic processors, graphics processors, coprocessors, embedded processors, digital signal processors (DSPs), and controllers (e.g., microcontrollers), to name just a few.
[0022] Processor 100 has an instruction set architecture (ISA) 101. The instruction set architecture 101 represents a portion of the architecture of processor 100 related to programming. The instruction set architecture 101 typically includes local instructions, architecture registers, data types, addressing modes, memory architecture, interrupt and exception handling, and external input and output (I / O) of processor 100. The instruction set architecture 101 differs from a microarchitecture, which typically represents the specific processor design technology chosen to implement the instruction set architecture. Processors with different microarchitectures can share a common instruction set architecture. For example, some microprocessors from Intel Corporation in Santa Clara, California, and some microprocessors from Advanced Micro Devices, Inc. in Sunnyvale, California, use fundamentally different internal microarchitectures to implement similar portions of the x86 instruction set.
[0023] Instruction set architecture 101 includes architecture registers (e.g., architecture register set) 106. In one embodiment, architecture register 106 includes general purpose (GP) registers, flag registers, vector registers, write mask registers, scalar floating-point registers, and other registers. Architecture register 106 represents on-board processor storage locations. Architecture register 106 may also be simply referred to herein as register. The phrases architecture register, register set, and register are used herein to refer to registers that are visible to software and / or the programmer (e.g., software-visible) and / or specified by macro instructions to identify operands, unless otherwise specified or clearly apparent. These registers 106 contrast with other non-architectural registers in a given microarchitecture (e.g., temporary registers, reorder buffers, retirement registers, etc.).
[0024] The instruction set architecture 101 shown also includes an instruction set 102 supported by the processor 100. Instruction set 102 includes several different types of instructions. These instructions in instruction set 102 represent macro instructions (e.g., instructions provided to the processor 100 for execution), which are different from micro instructions or micro-operations (e.g., obtained from decoding macro instructions from the decoder 129 of the processor 100).
[0025] In one embodiment, instruction set 102 includes one or more addition instructions 103 (e.g., ADCX instruction 104 and ADOX instruction 105) operable to cause or cause processor 100 to add two operands (e.g., two four-words, two double-words, or two operands with other data widths). ADCX 104 and ADOX 105 instructions use two separate carry chains and can therefore be instructed in parallel once their respective data inputs are available.
[0026] Processor 100 also includes execution logic 109. Execution logic 109 is operable for executing or processing instructions of instruction set 102. Execution logic 109 may include execution units, functional units, arithmetic logic units, logic units, arithmetic units, etc. Processor 100 also includes a decoder for decoding macro instructions into micro instructions or micro-operations for execution by execution logic 109.
[0027] To further illustrate an embodiment of the addition instruction 103, it may be helpful to consider exemplary scenarios in which addition is required. Figure 2 This is a diagram illustrating an exemplary scenario of calculating the expression (S[7:0]=Ai x B[7:0]+S[7:0]), where Ai is a four-word (Qword), and each B... n and S n (n = 0, 1, ... 7) is also a four-word sequence. Each four-word sequence is 64 bits wide. Figure 2 In the above, S[7:0] (denoted as 230) at the top is the initial partial sum, and S[7:0] (denoted as 240) at the bottom is the partial sum that becomes the result. For each multiplication operation 210(A i x B n Given n = 0, 1, ..., 7, generate a (64x64) = 128-bit product. Each product is represented as (Hi). n :Lo n ), as shown in the entries on the diagonal of the figure, where Hi n It is the higher-order part (i.e., the most effective half) and Lo n This is the lower-order part (i.e., the least effective half). This product can be added to the partial sum S with a minimal number of micro-operations (μops) and delays. nAdding this product to the partial sum in one way requires two addition operations, each using a separate carry chain:
[0028] S n =S n +Lo n (Equation 1)
[0029] S n =S n +Hi n-1 (Equation 2).
[0030] Assume that S[7:0] is initially all zero. Then... Figure 2 After the exemplary addition operation 220 shown by the vertical dashed line in the figure, the addition operation is equivalent to: S0 = S0 + Lo0, S1 = S1 + Lo1 + Hi0, S2 = S2 + Lo2 + Hi1, S3 = S3 + Lo3 + Hi2, etc.
[0031] exist Figure 2 In the example, A i With B n Multiply, n = 0, 1, ... 7, where A i It can be a part of the first long integer A, and each B n It can be a part of the second longest integer B. This multiplication uses S. n The partial sum is stored using n = 0, 1, ..., 7. In (A0x B... n After the multiplication operation (n = 0, 1, ..., 7), the calculation can be moved to A1 x B. n A2 x B n This process continues until all parts of the long integer A have been processed. Each multiplication operation can use S. n Let's sum the partial sums. Finally, S[7:0]240 has the final result.
[0032] Embodiments of the present invention provide an addition instruction 103 that can be executed efficiently. Specifically, in a processor having multiple arithmetic logic units (ALUs), once their respective data inputs (Lo) are available, an addition instruction 103 is executed. n Hi n-1 If available, the additions in Equations 1 and 2 can be performed in parallel by two different ALUs. In one embodiment, the addition in Equation 1 can be performed by one of the ADCX 104 / ADOX 105 instructions, and the addition in Equation 2 can be performed by the other of the ADCX 104 / ADOX 105 instructions.
[0033] In one embodiment, ADCX 104 modifies the arithmetic flags except for the CF (carry flag), and ADOX 105 modifies the arithmetic flags except for the OF (overflow flag). That is, ADCX 104 only reads and writes the CF flag without changing the other flags, and ADOX 105 only reads and writes the OF flag without changing the other flags. By restricting each addition instruction to accessing only one flag, two or more such addition instructions (each accessing a different flag) can be defined and executed without causing any data dependencies. This is in contrast to existing addition instructions that rewrite multiple or all arithmetic flags and therefore cannot be executed independently.
[0034] In an alternative embodiment, ADCX 104 and ADOX 105 use their respective associated flags (i.e., CF and OF, respectively) to implement carry input and carry output without modifying their respective associated flags. However, ADCX 104 and ADOX 105 can also modify these arithmetic flags, such as by setting other arithmetic flags (e.g., SF, PF, AF, ZF, etc.) to zero or another predetermined value.
[0035] In one embodiment, the addition instruction is defined as follows:
[0036] ADCX 104:
[0037] CF:regdst = reg1 + reg2 + CF; and
[0038] ADOX 105:
[0039] OF:regdst=reg1+reg2+OF.
[0040] Although the flags CF and OF are described in the specification, it should be understood that any two different arithmetic flags of the processor's flag register can be used for the addition operations in (Equation 1) and (Equation 2). Furthermore, as mentioned above, different arithmetic flags can be used to similarly define other addition instructions; for example, the ADAX instruction can be defined as reading and writing only the AF flag without changing the other flags, the ADPF instruction can be defined as reading and writing only the PF flag without changing the other flags, and so on. The data widths of reg1, reg2, and regdst are the same and can be of any size. In some embodiments, the target regdst can be the same as reg1 or reg2 (i.e., one of the rewritable source registers).
[0041] Figure 3 This is an example of sample code 300, which includes ADCX and ADOX in the multiplication of two long numbers A[0:N-1] x B[0:N-1]. Figure 1(ADCX 104 and ADOX 105). A and B n Each of (n = 0, ..., N-1) is a four-word array (but different data widths can be used). Sample code 300 breaks down the computation into A. i The sequence x B[0:N-1](i=0,….N-1) is such as Figure 2 The diagonal sequence in the diagram. This calculation can be grouped into MULX, ADCX, and ADOX. In one embodiment, where the data width is 64 bits, the MULX instruction is defined as performing an unsigned multiplication of a 64-bit number with another 64-bit number (stored in the RDX register as an implicit operand) without affecting any arithmetic flags.
[0042] MULX:r64a,r64b,r / m64
[0043] Where r64a represents the first 64-bit register storing the most significant half of the multiplication product, r64b represents the second 64-bit register storing the least significant half of the multiplication product, and r / m64 represents the 64-bit register or memory location used as the input for multiplication. At the beginning of sample code 300, assume a 64-bit value A. i The values are allocated to the RDX register, and an XOR operation is performed to clear all arithmetic flags. In one embodiment, each of rax, rbx, and RDX is a 64-bit register, such as a general-purpose register.
[0044] use Figure 2 For example, sample code 300 corresponds to the following operation:
[0045] Hi0:Lo0=A i x B0
[0046] CF:S0=S0+Lo0+CF
[0047] OF:S1=S1+Hi0+OF
[0048] Hi1:Lo1=A i x B1
[0049] CF:S1=S1+Lo1+CF
[0050] OF:S2=S2+Hi1+OF
[0051] Hi2:Lo2=A i x B2
[0052] CF:S2=S2+Lo2+CF
[0053] OF:S3=S3+Hi2+OF
[0054] Because ADCX and ADOX use two different flags, they can be executed in parallel as long as their respective data inputs are available. In some embodiments with 3 allocated ports (i.e., 3 ALUs), assuming MULX, ADCX, and ADOX are on different ALUs all with a throughput of 1, a single μop MULX, a single μop ADCX, and a single μop ADOX can achieve the maximum throughput per cycle per multiplication triple (i.e., triple MULT / ADCX / ADOX). In another embodiment, MULX costs 2 μops and ADCX and ADOX each cost 1 μop. Therefore, assuming they are all on different ALUs with a throughput of 1, at least 4 ALUs are needed to achieve the maximum throughput per cycle per multiplication triple. MULX, ADCX, and ADOX can work on machines with fewer ALUs, but maximum performance will not be achieved.
[0055] Figure 4 This is a block diagram illustrating an embodiment of parallel processing of multiplication triples. The diagram shows that a MULT can begin execution during each cycle. As execution continues, in each cycle (e.g., 1 μop), a new MULT can begin execution simultaneously with a pair of ADCX and ADOX (e.g., shown in each of cycles 3-6). Specifically, ADCX and ADOX can be processed in parallel during the same cycle. The length of the MULT can take any number of cycles (e.g., one, two, or more) as long as a sufficient number of ALUs are available at the start of each cycle to support the start of a new MULT, and regardless of the length of the MULT operation, a throughput of one cycle per triple can be achieved.
[0056] Figure 4 The example illustrates the parallel execution of ADCX and ADOX within the same cycle. However, these two instructions can be executed in different cycles as long as their respective data inputs are available. Since there is no data dependency between the two instructions (i.e., carry input / output), ADCX / ADOX can be executed as long as the most significant half / least significant half of their associated multiplication result is available. For example, if the least significant half of the multiplication result is generated in the nth cycle and the most significant half is generated in the (n+1)th cycle, ADCX and ADOX using these results can be executed as early as possible after these cycles when their respective final sources are available (i.e., in the (N+1)th and (N+2)th cycles, respectively). That is, ADCX and ADOX can be executed in parallel or in any order without any data dependency.
[0057] Figure 5This is a block diagram of an embodiment of an instruction processing apparatus 515 having an execution unit 540, which is operable for executing instructions comprising... Figure 1 The instructions of an example embodiment of addition instruction 103. In some embodiments, instruction processing device 515 may be a processor, and / or may be included in a processor (e.g., Figure 1 The instruction processing unit 515 may be included in a different processor or electronic system.
[0058] Instruction processing device 515 receives one or more addition instructions 103 (e.g., Figure 1 (ADCX 104 and ADOX 105). The decoder 530 can be... Figure 1 The decoder 129 or a similar device, the decoder 530, receives instructions in the form of high-level machine instructions or macro instructions and decodes the instructions to generate low-level micro-operations, microcode entry points, microinstructions, or other low-level instructions or control signals that reflect and / or are derived from the original high-level instructions. The lower-level instructions or control signals can implement the higher-level instructions through lower-level (e.g., circuit-level or hardware-level) operations. The decoder 530 can be implemented using a variety of different mechanisms. Examples of suitable mechanisms include, but are not limited to, microcode, random access memory (ROM), lookup tables, hardware implementations, programmable logic arrays (PLAs), and other mechanisms for implementing decoders known in the art. In some microarchitecture embodiments, macro instructions can be executed directly without first being decoded by a decoder.
[0059] Execution unit 540 is coupled to decoder 530. Execution unit 540 may receive from decoder 530 one or more micro-operations, microcode entry points, micro-instructions, other instructions, or other control signals reflecting or derived from the received addition instruction 103. Execution unit 540 includes addition logic 542 to perform addition.
[0060] Execution unit 540 also receives input from registers, such as general purpose (GP) register 570. Execution unit 540 receives carry input from flag register 580 and stores carry output within flag register 580. In one embodiment, a first addition instruction (e.g., ADCX 104) uses a first flag 581 for carry output and carry output, and a second addition instruction (e.g., ADOX 105) uses a second flag 582 for carry output and carry output. As described above, more addition instructions can be provided, each using different flags for carry output and carry output.
[0061] To avoid obscuring the description, a relatively simple instruction processing apparatus 515 has been shown and described. It should be understood that other embodiments may have more than one execution unit. For example, the apparatus may include multiple execution units of different types, such as, for example, an arithmetic unit, an arithmetic logic unit (ALU), an integer unit, a floating-point unit, etc. At least one of these units may be responsive to embodiments of circularly aligned instructions as disclosed herein. Further embodiments of the instruction processing apparatus or processor may have multiple cores, logic processors, or execution engines. Execution units operable to execute one or more circularly aligned instructions may be included in at least one, at least two, most, or all of the cores, logic processors, or execution engines.
[0062] The instruction processing apparatus 515 or processor may optionally include one or more other known components. For example, other embodiments may include one or more of, or various combinations thereof, instruction fetch logic, scheduling logic, branch prediction logic, instruction and data cache, instruction and data translation back buffer, prefetch buffer, microinstruction queue, microinstruction sequence generator, bus interface unit, second or higher-level cache, instruction scheduling logic, retirement logic, register renaming logic, etc. It should be appreciated that many different combinations and configurations of these components actually exist in a processor, and the scope of the invention is not limited to any of these known combinations and configurations.
[0063] Figure 6 The illustration shows an example embodiment of an EFLAGS register 600 representing a flag register with multiple flags. In one embodiment, the EFLAGS register 600 is a 32-bit register that includes a set of status registers (also known as arithmetic flags, such as the COSPAZ flag), control flags, and a set of system flags.
[0064] The status flags include a carry flag (CF, bit 0) 610, a parity flag (PF, bit 2), an auxiliary carry flag (AF, bit 4), a zero flag (ZF, bit 6), a sign flag (SF, bit 7), and an overflow flag (OF, bit 11) 620. As described above, in one or more embodiments, the carry flag (CF, bit 0) and the overflow flag (OF, bit 11) may be used as the first and second flags 581, 582 associated with the addition instructions disclosed herein. CF and OF are emphasized for this reason, but their use is not required.
[0065] System flags include the trap flag (TF, bit 8), interrupt enable flag (IF, bit 9), I / O privilege level (IOPL, bits 12-13), nested task (NT, bit 14), recovery flag (RF, bit 16), virtual-8086 mode (VM, bit 17), alignment check (AC, bit 18), virtual interrupt flag (VIF, bit 19), virtual interrupt pending flag (VIP, bit 20), and ID flag (ID, bit 21). Control flags include the direction flag (DF, bit 10). Bits 22-31 of EFLAGS are reserved.
[0066] EFLAGS register 600 is a specific example embodiment of a register with appropriate flags for implementing one or more embodiments, but this specific register and these specific flags are not specifically required.
[0067] Figure 7 It handles addition instructions (such as...) Figure 1 The flowchart illustrates an example embodiment of method 700 of an example embodiment of addition instruction 103. In various embodiments, the method of method 700 may be executed by a general-purpose processor, a special-purpose processor (e.g., a graphics processor or digital signal processor), or another type of digital logic device or instruction processing apparatus. In some embodiments, method 700 may be executed by... Figure 1 Processor 100 Figure 5 The method 700 may be executed by an instruction processing device 515 or a similar processor or instruction processing device. Alternatively, the method 700 may be executed by different embodiments of the processor or instruction processing device. Furthermore, Figure 1 Processor 100 and Figure 5 The instruction processing device 515 can execute embodiments of operations and methods that are the same as, similar to, or different from those of method 700.
[0068] In one embodiment, method 700 includes a processor receiving a first addition instruction (block 710). The first addition instruction indicates a first flag in a flag register. The processor then receives a second addition instruction (block 720). The second addition instruction indicates a second flag in the flag register. The first and second addition instructions are executed without data dependency between them (block 730). The processor stores the carry output from the first addition instruction in the first flag and does not modify the second flag in the flag register (block 740). The processor also stores the carry output from the second addition instruction in the second flag and does not modify the first flag in the flag register (block 750).
[0069] The illustrated method includes operations visible from outside the processor or instruction processing device (e.g., from a software viewpoint). In other embodiments, the method may optionally include one or more other operations (e.g., one or more operations occurring within the processor or instruction processing device). As an example, upon receiving an instruction, the instruction may be decoded, converted, emulated, or otherwise transformed into one or more other instructions or control signals.
[0070] Exemplary computer systems and processors - Figure 8-12
[0071] Figure 8-12 These are exemplary computer systems and processors. Other system designs and configurations known in the art for laptops, desktops, handheld PCs, personal digital assistants, engineering workstations, servers, network devices, network hubs, switches, embedded processors, digital signal processors (DSPs), graphics devices, video game devices, set-top boxes, microcontrollers, cellular phones, portable media players, handheld devices, and various other electronic devices are also suitable. In general, a large number of systems and electronic devices capable of incorporating the processors and / or other execution logic disclosed herein are generally suitable.
[0072] Now for reference Figure 8 The diagram shown is a block diagram of a system 1300 according to an embodiment of the present invention. System 1300 may include one or more processors 1310, 1315 coupled to a graphics memory controller hub (GMCH) 1320. Optional properties of the additional processor 1315 are indicated by dashed lines. Figure 8 middle.
[0073] Each processor 1310, 1315 may be a version of processor 1700. However, it should be noted that integrated graphics logic and integrated memory control unit may not be present in processor 1310, 1315.
[0074] Figure 8 The GMCH 1320 is shown to be coupled to a memory 1340, which may be, for example, dynamic random access memory (DRAM). In at least one embodiment, the DRAM may be associated with a non-volatile cache.
[0075] The GMCH 1320 may be a chipset or part of a chipset. The GMCH 1320 may communicate with processors 1310 and 1315 and control the interaction between processors 1310 and 1315 and memory 1340. The GMCH 1320 may also serve as an accelerated bus interface between processors 1310 and 1315 and other components of system 1300. In at least one embodiment, the GMCH 1320 communicates with processors 1310 and 1315 via a multi-branch bus such as a front-side bus (FSB) 1395.
[0076] In addition, the GMCH 1320 is coupled to a display 1345 (such as a flat panel display). The GMCH 1320 may include an integrated graphics accelerator. The GMCH 1320 is also coupled to an input / output (I / O) controller hub (ICH) 1350, which can be used to couple various peripheral devices to the system 1300. Figure 8 The embodiment illustrates, by way of example, an external graphics device 1360 and another peripheral device 1370, the external graphics device 1360 being a discrete graphics device coupled to ICH1350.
[0077] Optionally, additional or different processors may also be present in system 1300. For example, additional processor 1315 may include an additional processor identical to processor 1310, an additional processor dissimilar or asymmetric to processor 1310, an accelerator (such as a graphics accelerator or digital signal processing (DSP) unit), a field-programmable gate array, or any other processor. Various differences exist between physical resources 1310 and 1315 according to a range of quality metrics including architecture, microarchitecture, thermal, power consumption characteristics, etc. These differences effectively manifest as asymmetry and dissimilarity between processing elements 1310 and 1315. For at least one embodiment, various processing elements 1310 and 1315 may reside in the same die package.
[0078] Now refer to Figure 9 The diagram shown is a block diagram of a second system 1400 according to an embodiment of the present invention. Figure 9 As shown, the multiprocessor system 1400 is a point-to-point interconnect system and includes a first processor 1470 and a second processor 1480 coupled via a point-to-point interconnect 1450. Figure 9 As shown, each of processors 1470 and 1480 may be a version of processor 1700.
[0079] Alternatively, one or more of the processors 1470 and 1480 may be elements other than processors, such as accelerators or field-programmable gate arrays.
[0080] Although only two processors 1470 and 1480 are shown, it should be understood that the scope of the invention is not limited thereto. In other embodiments, one or more additional processing elements may be present in a given processor.
[0081] Processor 1470 may also include an integrated memory controller hub (IMC) 1472 and point-to-point (PP) interfaces 1476 and 1478. Similarly, the second processor 1480 includes an IMC 1482 and PP interfaces 1486 and 1488. Processors 1470 and 1480 can exchange data via a point-to-point (PtP) interface 1450 using point-to-point (PtP) interface circuits 1478 and 1488. Figure 9 As shown, IMC 1472 and 1482 couple the processor to the corresponding memory, namely memory 1442 and memory 1444, which may be the main memory portion locally attached to the corresponding processor.
[0082] Processors 1470 and 1480 can each exchange data with chipset 1490 via their respective PP interfaces 1452 and 1454 using point-to-point interface circuits 1476, 1494, 1486, and 1498. Chipset 1490 can also exchange data with high-performance graphics circuit 1438 via high-performance graphics interface 1439.
[0083] A shared cache (not shown) may be included within any one processor or may be included outside of two processors but still connected to these processors via a PP interconnect, so that if a processor is placed in a low-power mode, the local cache information of any one or both processors can be stored in the shared cache.
[0084] Chipset 1490 may be coupled to first bus 1416 via interface 1496. In one embodiment, first bus 1416 may be a peripheral component interconnect (PCI) bus, or a bus such as PCI Express bus or other third-generation I / O interconnect bus, but the scope of the invention is not limited thereto.
[0085] like Figure 9As shown, various I / O devices 1414 can be coupled to a first bus 1416 along with a bus bridge 1418, which in turn couples the first bus 1416 to a second bus 1420. In one embodiment, the second bus 1420 may be a low pin count (LPC) bus. In one embodiment, devices may be coupled to the second bus 1420, including, for example, a keyboard and / or mouse 1422, a communication device 1426, and a data storage unit 1428, such as a disk drive or other mass storage device, which may include code 1430. Further, audio I / O 1424 may be coupled to the second bus 1420. Note that other architectures are possible. For example, instead of... Figure 9 The point-to-point architecture allows the system to implement multi-branch buses or other similar architectures.
[0086] Now refer to Figure 10 The diagram shown is a block diagram of a third system 1500 according to an embodiment of the present invention. Figure 9 and Figure 10 The same parts in the drawings are indicated by the same reference numerals, and from Figure 10 Central Province Figure 9 In certain aspects, to avoid making Figure 10 Other aspects of it become difficult to understand.
[0087] Figure 10 Processing elements 1470 and 1480 are shown to include integrated memory and I / O control logic (“CL”) 1472 and 1482, respectively. In at least one embodiment, CL 1472 and 1482 may include combinations such as those described above. Figure 8 , 9 The memory controller central logic (IMC) described in 10. Furthermore, CL 1472 and 1482 may also include I / O control logic. Figure 10 Not only are memories 1442 and 1444 coupled to CL 1472 and 1482 shown, but also I / O devices 1514, similarly coupled to control logic 1472 and 1482, are shown. Conventional I / O devices 1515 are coupled to chipset 1490.
[0088] Now for reference Figure 11 A block diagram of a SoC 1600 according to an embodiment of the present invention is shown. Figure 12 In the diagram, similar components share the same reference numerals. Additionally, dashed boxes are an optional feature for more advanced SoCs. Figure 11In this configuration, interconnect unit 1602 is coupled to: application processor 1610, including an aggregate of one or more cores 1702A-N and a shared cache unit 1706; system proxy unit 1710; bus controller unit 1716; integrated memory controller unit 1714; an aggregate of one or more media processors 1620, which may include integrated graphics logic 1708, an image processor 1624 for providing still and / or video camera functionality, an audio processor 1626 for providing hardware audio acceleration, and a video processor 1628 for providing video encoding / decoding acceleration; static random access memory (SRAM) unit 1630; direct memory access (DMA) unit 1632; and display unit 1640 for coupling to one or more external displays.
[0089] The various embodiments of the mechanisms disclosed herein can be implemented in hardware, software, firmware, or combinations of these implementations. Embodiments of the invention can be implemented as computer programs or program code executable on a programmable system, the programmable system including at least one processor, a storage system (including volatile and non-volatile memories and / or storage elements), at least one input device, and at least one output device.
[0090] Program code can be applied to input data to perform the functions described herein and generate output information. The output information can be applied to one or more output devices in a known manner. For the purposes of this application, the processing system includes any system having a processor such as, for example, a digital signal processor (DSP), a microcontroller, an application-specific integrated circuit (ASIC), or a microprocessor.
[0091] The program code can be implemented using a high-level procedural language or an object-oriented programming language to communicate with the processing system. The program code can also be implemented in assembly language or machine language if needed. In fact, the mechanisms described in this paper are not limited to any particular programming language. In either case, the language can be a compiled language or an interpreted language.
[0092] One or more aspects of at least one embodiment can be implemented by characteristic instructions stored on a machine-readable medium, which represent various logics in a processor, and which, when read by a machine, cause the machine to create logic for performing the techniques described herein. These representations, referred to as “IP cores,” can be stored on a tangible machine-readable medium and provided to multiple customers or production facilities for loading into manufacturing machines that actually manufacture the logic or processor.
[0093] Such machine-readable storage media may include, but are not limited to, non-volatile physical arrangements of particles manufactured or formed by a machine or device, including storage media such as: hard disks; any other type of disk including floppy disks, optical disks, CD-ROMs, rewritable CD-RWs, and magneto-optical disks; semiconductor devices such as read-only memory (ROM); random access memory (RAM) such as dynamic random access memory (DRAM) and static random access memory (SRAM); erasable programmable read-only memory (EPROM); flash memory; electrically erasable programmable read-only memory (EEPROM); magnetic cards or optical cards; or any other type of medium suitable for storing electronic instructions.
[0094] Therefore, embodiments of the present invention also include a non-transient, tangible machine-readable medium containing instructions in a vector-friendly instruction format or containing design data, such as a hardware description language (HDL), that defines the architectures, circuits, devices, processors, and / or system characteristics described herein. These embodiments are also referred to as program products.
[0095] In some cases, an instruction translator can be used to translate instructions from a source instruction set to a target instruction set. For example, an instruction translator can transform (e.g., using static binary transformation, including dynamically compiled dynamic binary transformation), morph, emulate, or otherwise translate instructions into one or more other instructions that will be processed by the core. Instruction translators can be implemented in software, hardware, firmware, or a combination thereof. Instruction translators can be on the processor, off the processor, or partially on the processor and partially off the processor.
[0096] Figure 13 This is a block diagram illustrating a comparative method using a software instruction converter to transform binary instructions in a source instruction set into binary instructions in a target instruction set, according to an embodiment of the present invention. In the illustrated embodiment, the instruction converter is a software instruction converter, but alternatively, it can be implemented using software, firmware, hardware, or various combinations thereof. Figure 13It is shown that a program of high-level language 1804 can be compiled using x86 compiler 1802 to generate x86 binary code 1806 that can be natively executed by processor 1816 with at least one x86 instruction set core (assuming some of the instructions are compiled in a vector-friendly instruction format). Processor 1816 with at least one x86 instruction set core represents any processor capable of performing substantially the same function as an Intel processor with at least one x86 instruction set core by compatibly executing or otherwise processing (1) a majority of the instruction set of an Intel x86 instruction set core or (2) a version of object code of an application or other software designed to run on an Intel processor with at least one x86 instruction set core, to achieve substantially the same results as an Intel processor with at least one x86 instruction set core. x86 compiler 1804 represents a compiler used to generate x86 binary code 1806 (e.g., object code) that can be executed on processor 1816 with at least one x86 instruction set core, with or without additional linking processing. Similarly, Figure 13 A program written in high-level language 1802 is shown to be compiled using an alternative instruction set compiler 1808 to generate alternative instruction set binary code 1810 that can be natively executed by a processor 1814 that does not have at least one x86 instruction set core (e.g., a processor with a core that executes the MIPS instruction set of MIPS Technologies, Inc., Sunnyvale, California, and / or the ARM instruction set of ARM Holdings, Inc., Sunnyvale, California). An instruction converter 1812 is used to translate the x86 binary code 1806 into code that can be natively executed by the processor 1814 that does not have an x86 instruction set core. This translated code is unlikely to be identical to the alternative instruction set binary code 1810, as instruction converters capable of doing so would be difficult to manufacture; however, the translated code will perform general operations and consist of instructions from the alternative instruction set. Therefore, the instruction converter 1812 represents, through emulation, simulation, or any other process, software, firmware, hardware, or a combination thereof that allows a processor or other electronic device without an x86 instruction set processor or core to execute the x86 binary code 1806.
[0097] Some operations of the vector-friendly instruction format disclosed herein can be performed by hardware components and can be embodied in machine-executable instructions that are programmed to cause, or at least cause, circuitry or other hardware components to perform, the operation. The circuitry may include general-purpose or special-purpose processors or logic circuitry, a few examples of which are given here only. These operations may also optionally be performed by a combination of hardware and software. Execution logic and / or the processor may include special-purpose or specific circuitry or other logic that responds to machine instructions derived from machine instructions or one or more control signals to store the result operand specified by the instruction. For example, embodiments of the instructions disclosed herein may be... Figure 8-12 Examples of instructions in a vector-friendly instruction format that execute in one or more systems can be stored in program code that will execute in the system. Furthermore, the processing elements in these figures can utilize one of the pipelines and / or architectures (e.g., ordered and unordered architectures) described in detail herein. For example, a decoding unit in an ordered architecture can decode instructions, pass the decoded instructions to vector or scalar units, etc.
[0098] The foregoing description is intended to illustrate preferred embodiments of the invention. Based on the above discussion, it should also be apparent to those skilled in the art that, in this rapidly developing field where further progress is difficult to foresee, modifications can be made to the invention in terms of arrangement and detail without departing from the principles of the invention falling within the scope of the appended claims and their equivalents. For example, one or more operations of the method may be combined or further separated.
[0099] Optional embodiments
[0100] Although embodiments of native execution of vector-friendly instruction formats have been described, alternative embodiments of the invention may execute vector-friendly instruction formats via an emulation layer running on a processor executing a different instruction set (e.g., a processor executing the MIPS instruction set of MIPS Technologies, Inc., Sunnyvale, California, or a processor executing the ARM instruction set of ARM Holdings, Inc., Sunnyvale, California). Similarly, although the flowcharts in the accompanying drawings illustrate a particular sequence of operations for certain embodiments of the invention, it should be understood that this sequence is exemplary (e.g., alternative embodiments may perform operations in a different order, combine certain operations, overlap certain operations, etc.).
[0101] In the foregoing description, numerous specific details have been set forth for the purpose of explanation in order to provide a thorough understanding of embodiments of the invention. However, it will be apparent to those skilled in the art that one or more other embodiments may be practiced without some of these specific details. The specific embodiments described are provided not to limit the invention but to illustrate embodiments thereof. The scope of the invention is not determined by the specific examples provided, but only by the appended claims.
[0102] It should be understood that the above description is intended to be illustrative and not restrictive. Many other embodiments will be apparent to those skilled in the art upon reading and understanding the above description. Therefore, the scope of the invention should be determined by reference to the appended claims and the full scope of their equivalents that give right to those claims.
Claims
1. An apparatus for instruction processing, comprising: A decoder circuit is used to decode an instruction, which is used to identify the position of a first operand and the position of a second operand. as well as An execution circuit is configured to execute decoded instructions to perform an unsigned addition on only the first operand, the second operand, and an overflow flag to generate a result, and to store the result.
2. The apparatus of claim 1, wherein, The location identified by the first operand is a register.
3. The apparatus as claimed in claim 1 or 2, wherein, The execution circuit is used to update the overflow flag.
4. The apparatus of claim 3, wherein, The overflow flag is a flag in the flag register.
5. The apparatus of claim 4, wherein, The execution circuit is used to avoid modifying other arithmetic flags in the flag register.
6. The apparatus of claim 1, wherein, The result is used to store in the location identified by the first operand.
7. A method for instruction processing, comprising: The instruction is decoded, wherein the instruction is used to identify the position of the first operand and the position of the second operand; as well as Execute the decoded instructions to perform an unsigned addition on only the first operand, the second operand, and the overflow flag to generate a result, and store the result.
8. The method of claim 7, wherein, The location identified by the first operand is a register.
9. The method of claim 7 or 8, wherein, The process involves updating the overflow flag.
10. The method of claim 9, wherein, The overflow flag is a flag in the flag register.
11. The method of claim 10, wherein, The execution does not modify other arithmetic flags in the flag register.
12. The method of claim 7, wherein, The result is stored in the location identified by the first operand.
13. A system for instruction processing, comprising: Memory, used to store instructions; as well as A processor, coupled to the memory, the processor comprising: A decoder circuit is used to decode the instruction, which is used to identify the position of a first operand and the position of a second operand; as well as An execution circuit is configured to execute decoded instructions to perform an unsigned addition on only the first operand, the second operand, and an overflow flag to generate a result, and to store the result.
14. The system of claim 13, wherein, The location identified by the first operand is a register.
15. The system as claimed in claim 13 or 14, wherein, The execution circuit is used to update the overflow flag.
16. The system of claim 15, wherein, The overflow flag is a flag in the flag register.
17. The system of claim 16, wherein, The execution circuit is used to avoid modifying other arithmetic flags in the flag register.
18. The system of claim 13, wherein, The result is used to store in the location identified by the first operand.
19. A computer-readable medium comprising instructions stored thereon, which, when executed, cause a computing device to perform the method as described in any one of claims 7-12.