Vector Extraction and Merge Instruction

JP2025522516A5Pending Publication Date: 2026-06-15ARM LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: JP · JP
Patent Type: Applications
Current Assignee / Owner: ARM LTD
Filing Date: 2023-06-14
Publication Date: 2026-06-15

Application Information

Patent Timeline

14 Jun 2023

Application

15 Jun 2026

Publication

JP2025522516A5

IPC: G06F9/28; G06F17/16

CPC: G06F9/30036; G06F9/30032; G06F9/30018; G06F9/30038; G06F9/30098; G06F9/30109; G06F9/30145

AI Tagging

Application Domain

Operational speed enhancementRegister arrangements

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

Improved apparatus for performing multiply / accumulate operations
CN113778376BOperational speed enhancementResource allocation
Magnetic disk device
CN116841613BOperational speed enhancementInput/output to record carriers
Data processing method and device, electronic equipment, storage medium and program product
CN121900974BImprove resource utilization easy to understandOperational speed enhancementProgram initiation/switching Computer hardware Graphics
Accelerator, method of operating accelerator, and electronic device including accelerator
CN114118348BOperational speed enhancementRegister arrangements
Accelerating system and dynamic configuration method thereof
CN116560725BOperational speed enhancementGeneral purpose stored program computerGraphics Gate array

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing data processing systems face inefficiencies in processing vector instructions, particularly when parts of vectors are dependent on each other, leading to increased instruction fetching and decoding overhead.

Method used

A data processing apparatus and method that utilizes a decoder circuit and processing circuit to execute vector extraction and merge instructions, allowing for flexible microarchitecture implementations by processing beats of vector instructions in parallel or shifted, while propagating information between beats, using beat status information to track progress and enable efficient scaling across different performance and energy points.

Benefits of technology

This approach reduces overhead and improves code density by allowing flexible microarchitecture designs that can adapt to various power and resource constraints, enabling efficient execution of vector instructions with dependent parts while maintaining compliance with the instruction set architecture.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure 00000000_0000_ABST

Patent Text Reader

Abstract

Apparatus, method, and medium are provided. The apparatus includes a decoder circuit that generates a control signal in response to a vector extraction and merge instruction that specifies control parameters, a first vector register, a second vector register, and a destination vector register. The apparatus includes a processing circuit that executes processing of a plurality of beats in response to the control signal, and each beat includes processing corresponding to at least a part of the first vector register and the destination vector register. The processing for the K 番目 beats includes extracting bits specified by the control parameters from the K 番目 portion of the first vector register, concatenating the bits with further bits, and storing the result in the K 番目 portion of the destination register. The further bits are, for the first portion, extracted from the first portion of the second vector register, and otherwise, from the (K - 1) 番目 portion of the first vector register.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The present technique relates to an apparatus, a method of operating the apparatus, and a computer-readable medium storing computer-readable code for manufacturing the apparatus.

Summary of the Invention

[0002] Some data processing systems support the processing of vector instructions where the source operand or result value of the instruction is a vector containing multiple parts. By supporting the processing of several separate parts of the vector in response to a single instruction, code density can be improved and the overhead of instruction fetching and decoding can be reduced. Sometimes, it is desirable to execute vector instructions where the parts of the vector are dependent on each other.

[0003] According to some configurations, a plurality of vector registers, a decoder circuit that generates control signals in response to vector extraction and merge instructions, the vector extraction and merge instructions specifying a control parameter, a first source vector register, a second source vector register, and a destination vector register as a specified register among the plurality of vector registers, the decoder circuit; a processing circuit that executes a plurality of beats of processing in response to the control signals, each beat including combinational processing corresponding to at least a part of the first source vector register and the destination vector register, the processing circuit configured to set beat status information indicating which beat of the vector extraction and merge instruction has been completed and suppress the completed beats of the vector extraction and merge instruction indicated by the beat status information as being completed, for each of the specified registers, the combinational processing for the K 番目 beats corresponding to the parts of K 番目 is for the K 番目Extract the bits specified by the control parameter from the portion of, concatenate the extracted bits with one or more additional bits, and store the result of the concatenation in the K 番目 portion of the destination register, and K 番目 when the portion of is not the last portion of the specified register, for the (K + 1) 番目 beats among the multiple beats, transfer at least one bit of the K 番目 portion of the first source vector register that is not stored in the destination register to be processed, and include For the first portion of the specified register, one or more additional bits are extracted from the first portion of the second source vector register, For each portion other than the first portion of the specified register, one or more additional bits are transferred from the K - 1 番目 portion of the first source vector register. Regarding the apparatus.

[0004] According to some configurations, a method for operating an apparatus including a plurality of vector registers, a decoder circuit, and a processing circuit is provided. This method includes Using the decoder circuit to generate a control signal in response to a vector extraction and merge instruction, where the vector extraction and merge instruction specifies a control parameter, a specified register among the plurality of vector registers as the first source vector register, the second source vector register, and the destination vector register, Using the processing circuit to execute a plurality of beats of processing in response to the control signal, where each beat includes combinational processing corresponding to at least a part of the first source vector register and the destination vector register, set beat status information indicating which beat of the vector extraction and merge instruction has been completed, and suppress the completed beats of the vector extraction and merge instruction indicated by the beat status information as completed. Execute including For each K 番目 portion of the specified register corresponding to the K 番目 beats of combinational processing is From the K of the first source vector register 番目 extract the bits specified by the control parameter, concatenate the extracted bits with one or more additional bits, and store the result of the concatenation in the K 番目 portion of the destination register, and K 番目 when the portion of is not the last portion of the specified register, carry at least one bit of the K 番目 portion of the first source vector register that is not stored in the destination register to be processed in the (K + 1) 番目 th beat of the plurality of beats, including For the first portion of the specified register, one or more additional bits are extracted from the first portion of the second source vector register, For each portion other than the first portion of the specified register, one or more additional bits are carried from the (K - 1) 番目 portion of the first source vector register.

[0005] According to some configurations, a computer-readable medium for storing computer-readable code for manufacturing a device, the device comprising a plurality of vector registers, and a decoder circuit that generates a control signal in response to a vector extraction and merge instruction, the vector extraction and merge instruction specifying a control parameter, a specified register among the plurality of vector registers as a first source vector register, a second source vector register, and a destination vector register, the decoder circuit, and a processing circuit that executes a plurality of beats of processing in response to the control signal, each beat including combinational processing corresponding to at least a portion of the first source vector register and the destination vector register, the processing circuit configured to set beat status information indicating which beat of the vector extraction and merge instruction has been completed and suppress the completed beats of the vector extraction and merge instruction indicated by the beat status information as completed, the processing circuit. K for each of the specified registers 番目 The K corresponding to the portion of 番目 For the beats of, the combination process is K of the first source vector register 番目 Extract the bits specified by the control parameter from the portion of, concatenate the extracted bits with one or more further bits, and store the result of the concatenation in the K of the destination register 番目 In the portion of K 番目 When the portion of is not the last portion of the specified register, carry at least one bit of the K of the first source vector register that is not stored in the destination register to be processed in the (K + 1) 番目 Of the beats of 番目 Including the portion of For the first portion of the specified register, one or more further bits are extracted from the first portion of the second source vector register For each portion other than the first portion of the specified register, one or more further bits are carried from the (K - 1) 番目 Of the portion of, a computer-readable medium is provided

[0006] In some configurations, the computer-readable medium is a non-transitory computer-readable medium

[0007] According to some configurations, a computer program for controlling a host data processing device to provide an instruction execution environment, Register logic having a plurality of vector registers, Decoder logic that generates a control signal in response to a vector extraction and merge instruction, where the vector extraction and merge instruction specifies a control parameter, a specified register among the plurality of vector registers as the first source vector register, the second source vector register, and the destination vector register, decoder logic Processing logic that executes a plurality of beats of processing in response to a control signal, each beat including combinational processing corresponding to at least a portion of a first source vector register and a destination vector register, the processing logic configured to set beat status information indicating which beats of vector extraction and merge instructions have been completed and suppress the completed beats of the vector extraction and merge instructions indicated by the beat status information as being completed, the processing logic, each K of the specified registers 番目 K corresponding to the portion of 番目 For the beats of, the combinational processing is K of the first source vector register 番目 Extract the bits specified by the control parameter from the portion of, concatenate the extracted bits with one or more additional bits, and store the result of the concatenation in K of the destination register 番目 portion of, K 番目 When the portion of is not the last portion of the specified register, carry at least one bit of the K portion of the first source vector register that is not stored in the destination register to be processed in the (K + 1)th beat of the plurality of beats, 番目 including, 番目 For the first portion of the specified register, one or more additional bits are extracted from the first portion of the second source vector register, For each portion of the specified register other than the first portion, one or more additional bits are carried from the (K - 1) portion of the first source vector register, relating to a computer program, 番目

[0008] In some configurations, the computer program is recorded on a non-transitory computer-readable medium.

[0009] This technique is further described by way of example only with reference to those configurations shown in the accompanying drawings.

Brief Description of the Drawings

[0010]

Figure 1

Figure 2

Figure 3

Figure 4

Figure 5

Figure 6

Figure 7

Figure 8

Figure 9

Figure 10

Figure 11

Figure 12

Figure 13

Figure 14

Figure 15

Figure 16

Figure 17

Figure 18

Figure 19

Figure 20

Mode for Carrying Out the Invention

[0011] Software written according to a given instruction set architecture can be executed within a range of various different data processing apparatuses having different hardware implementations. As long as the result expected by the architecture is given when a given set of instructions is executed, a particular implementation can freely change its microarchitecture design in any way that achieves this architecture compliance. For example, in some applications, energy efficiency may be more important than performance, and thus the microarchitecture design of the processing circuitry provided to execute instructions from the instruction set architecture can be designed to consume as little energy as possible, even at the expense of performance. Other applications may view performance as a more important criterion than energy efficiency, and thus may include more complex hardware structures that allow for a greater throughput of instructions but may consume more power. Thus, it may be desirable to design an instruction set architecture to support scaling across a range of different energy or performance points.

[0012] In some configurations, an apparatus is provided that includes a plurality of vector registers and a decoder circuit that generates control signals in response to vector extraction and merge instructions. The vector extraction and merge instructions specify control parameters and specify, as the designated registers of the plurality of vector registers, a first source vector register, a second source vector register, and a destination vector register. The apparatus also includes a processing circuit that executes processing of a plurality of beats in response to the control signals. Each beat includes combinational processing corresponding to at least a portion of the first source vector register and the destination vector register. The processing circuit is configured to set beat status information indicating which beat of the vector extraction and merge instruction has completed and to suppress the completed beats of the vector extraction and merge instruction as indicated by the beat status information when completed. K of each of the designated registers 番目 corresponding to portions of 番目The combination processing for the beats of K in the first source vector register extracts the bits specified by the control parameter from the 番目 part, concatenates the extracted bits with one or more additional bits, and stores the result of the concatenation in the 番目 part of the destination register. The combination processing for the 番目 part of K includes carrying at least one bit of the 番目 part of K in the first source vector register that is not stored in the destination register to be processed at the (K + 1)th beat among the multiple beats when the 番目 part is not the last part of the specified register. For the first part of the specified register, one or more additional bits are extracted from the first part of the second source vector register, and for each part other than the first part of the specified register, one or more additional bits are carried from the 番目 part of the first source vector register. 番目 For the combination processing of the 番目 part of the first source vector register, it involves extracting the bits specified by the control parameter from the 番目 part, concatenating the extracted bits with one or more additional bits, and storing the result of the concatenation in the 番目 part of the destination register. 番目 For the 番目 part of K, the combination processing includes extracting the bits specified by the control parameter from the 番目 part of the first source vector register, concatenating the extracted bits with one or more additional bits, and storing the result of the concatenation in the 番目 part of the destination register. 番目 The combination processing for the 番目 part of K 番目 When the 番目 part is not the last part of the specified register, the combination processing for the 番目 part of K includes carrying at least one bit of the 番目 part of K in the first source vector register that is not stored in the destination register to be processed at the (K + 1)th beat among the multiple beats. 番目 For the combination processing of the 番目 part of K, when the 番目 part is not the last part of the specified register, it includes carrying at least one bit of the 番目 part of K in the first source vector register that is not stored in the destination register to be processed at the (K + 1)th beat among the multiple beats. 番目 For the 番目 part of K in the first source vector register, when the 番目 part is not the last part of the specified register, the combination processing includes carrying at least one bit of the 番目 part of K that is not stored in the destination register to be processed at the (K + 1)th beat among the multiple beats. For the first part of the specified register, one or more additional bits are extracted from the first part of the second source vector register, and for each part other than the first part of the specified register, one or more additional bits are carried from the 番目 part of the first source vector register. 番目 For each part of the specified register other than the first part, one or more additional bits are carried from the 番目 part of the first source vector register.

[0013] This configuration enables a microarchitecture that supports vector instructions to scale more efficiently for different performance and energy points. By providing beat status information that tracks the completed beats of two or more vector instructions, this gives a particular microarchitecture implementation the freedom to vary the amount of overlap in the execution of different vector instructions, and as a result, it is possible to execute the respective beats of different vector instructions in parallel with each other while still tracking the progress of each of the partially executed instructions. Some microarchitecture implementations may choose not to overlap the execution of each vector instruction at all, such that all the beats of one vector instruction are completed before the next instruction starts. Other microarchitectures may shift the execution of consecutive vector instructions such that a first subset of the beats of the second vector instruction is executed in parallel with a second subset of the beats from the first vector instruction.

[0014] The vector extraction and merge instructions are instructions of an instruction set architecture that are interpreted by a decoder circuit. The instruction set architecture forms a complete set of instructions that can be used by a programmer or compiler to instruct a processing circuit to execute operations. As described, as long as the processing circuit conforms to the instruction set architecture, the actual implementation of the microarchitecture, i.e., the physical arrangement of the circuits and logic blocks that make up the processing circuit, can vary from implementation to implementation. Some microarchitecture implementations can process all of the portions of a vector in parallel, and other implementations can process one or more portions of a vector at a time. Some vector instructions may be suitable for such flexibility. For example, a vector instruction that supports element-by-element addition of multiple elements of two source vectors can be divided into a plurality of scalar additions each corresponding to an element of the vector. However, instructions in which data propagates between different elements or between different portions (which may include multiple elements of a vector), i.e., instructions in which different portions are dependent on each other, may not be so easily adapted to such flexibility of the microarchitecture implementation.

[0015] The vector extraction and merge instruction is one such instruction. In the vector extraction and merge instruction, one or more bits from a first source vector register are concatenated with one or more bits from a second source vector register. The inventor recognized that a vector extraction and merge instruction that provides such flexibility in the microarchitecture can be implemented by providing a processing circuit configured to process one or more beats (corresponding to one or more parts of a specified vector register) in parallel or shifted and carry at least one bit between processed beats (i.e., from one part to another). As a result, the processing circuit does not consider each beat to be truly independent of the other beats. Instead, certain information can be propagated from one processed beat to another. In particular, the vector extraction and merge instruction specifies, as input, control parameters and a plurality of vector registers. The plurality of vector registers includes a first source vector register, a second source vector register, and a destination vector register. The control parameter indicates the number of bits to be extracted from the first source vector register during each beat of the process and can be specified explicitly within the instruction as a parameter passed to the decoder circuit or implicitly within the instruction as having a fixed value. For example, an instruction set architecture can define one or more vector extraction and merge instructions, each of which implicitly defines a fixed control parameter. The control parameter may be an indication value and thus may indirectly specify the number of bits to extract.

[0016] The combination processing defined in this way causes the propagation of bits from a first beat (first part) in which one or more additional bits of a second source vector register are concatenated with one or more bits (specified by a control parameter) extracted from a first source vector register. Bits from the first source vector register of the first beat (K = 1) are then carried (propagated) to the second beat (K = 2) and concatenated with one or more bits of the first source vector register within the subsequent beats during the processing of the subsequent beats. Next, this process is repeated and one or more bits of the beat of K 番目 are carried to the beat of (K + 1) 番目 . A carry is generated when the part of K 番目 is not the last part of the specified register. In some configurations, no carry is generated for the last part of the specified vector register. In some alternative configurations, a carry is generated for at least one bit of the last part of the first source vector register. It will be appreciated that the ordering of the beats can be independent of the ordering of the bits within the vector register. In one configuration, the first beat (K = 1) can correspond to the least significant set of bits of the vector register and the last beat can correspond to the most significant set of bits of the vector register. However, in some alternative configurations, the first beat (K = 1) can correspond to the most significant set of bits of the vector register and the last beat can correspond to the least significant set of bits of the vector register.

[0017] In this way, the apparatus provides a processing circuit that enables the implementation of vector extraction and merge instructions whose microarchitecture implementation can be changed while still enabling compliance with the instruction set architecture, thereby providing a flexible implementation that can be adapted based on power constraints and circuit size requirements.

[0018] In some configurations, the decoder circuit responds to vector extraction and merge instructions that specify scalar registers, a plurality of beats includes a currently executing subset of one or more beats, the currently executing subset of beats excludes completed beats, the processing circuit stores at least one item of carry data in a scalar register in response to a control signal, and at least one item of carry data includes one or more bits that are carried between the currently executing subset of one or more beats of the plurality of beats and a further subset of one or more beats. The currently executing subset of beats includes one or more beats of the plurality of beats and excludes a further subset of at least one beat of the plurality of beats. In such a configuration, the scalar register is used to carry at least one item of carry data between the currently executing subset of beats and one or more further subsets of one or more beats. The scalar register can be explicitly specified as one of a plurality of scalar registers, for example, as a parameter in vector extraction and merge instructions. Alternatively, the processing circuit can comprise a specific carry register that is implicitly defined in vector extraction and merge instructions.

[0019] A carry register can be used to propagate carry data within or from a currently executing subset of beats. In some configurations, for a first beat of a currently executing set of one or more beats, when the beat status information before execution of a vector extraction and merge instruction indicates that at least one beat should be suppressed, in response to a control signal, the processing circuit fetches one or more additional bits from a scalar register. A subset of one or more beats of the processing is executed in sequence using one or more bits of information propagated from a first subset of beats to a next subset of beats. During execution, the processing circuit reads control information to determine which beats include the first beat of a currently executing subset of one or more beats. If one or more beats of the processing have been previously executed, the control information indicates that these one or more beats should be suppressed. Thus, the processing circuit can presume that carry data is available within the scalar register and extracts one or more additional bits from the carry data within the scalar register.

[0020] The data included in the carry data can take various forms. In some configurations, one or more bits to be carried comprise all the bits of a portion of a first source vector register, and taking one or more additional bits from a scalar register includes taking the last subset of bits from the scalar register. As a result, the extraction of one or more additional bits follows the same pattern independent of whether the extraction is from the scalar register or from a second source vector register, resulting in a simpler implementation, thereby leading to a simplified implementation. In some configurations, one or more bits to be carried include the last set of M bits from a portion of the first source vector register stored in a temporary set of bit positions within the scalar register, and taking one or more additional bits from the scalar register includes taking bits from the temporary set of bit positions of the scalar register. As a result, fewer bits need to be carried within the scalar register. In some configurations, the last subset of bits is the most significant subset of bits that results in the propagation of data from the most significant bits of a portion of (K - 1) 番目 to a portion of the vector register of K 番目 . In alternative implementations, the data can be propagated in the opposite direction, and in such configurations, the last subset of bits is the least significant subset of bits.

[0021] In some configurations, concatenating the extracted bits includes storing the extracted bits in a first consecutive set of bit positions of a portion of the destination register of K 番目 and storing one or more additional bits in a second consecutive set of bit positions of a portion of the destination register of K 番目 . In some configurations, the union of the first subset of bit positions and the second subset of bit positions comprises all the bit positions of a portion of the destination register of K 番目 . In some configurations, the first consecutive set of bit positions and the second consecutive set of bit positions are non-overlapping bit positions. As a result, the K of the destination register 番目All bit positions within the portion are defined as either one of one or more further bits or one of the bits that have been extracted.

[0022] The ordering of the first consecutive set of bit positions and the second set of bit positions can be implementation-dependent. In some configurations, the first consecutive set of bit positions is the most significant set of bit positions of the K 番目 portion of the destination register, and the second consecutive set of bit positions is the least significant set of bit positions of the K 番目 portion of the destination register. Alternatively, the order of processing of the specified vector can be reversed. Thus, in some configurations, the first consecutive set of bit positions is the least significant set of bit positions of the K 番目 portion of the destination register, and the second consecutive set of bit positions is the most significant set of bit positions of the K 番目 portion of the destination register.

[0023] In some configurations, the bits that have been extracted are extracted from consecutive bit positions of the K 番目 portion of the first source vector register. The consecutive bit positions are specified by a control parameter and can be defined, for example, based on a first bit position and a second bit position or based on a first bit position and the number of bits to be extracted.

[0024] In some configurations, the consecutive bit positions are the set of least significant consecutive bit positions of the K 番目 portion of the first source vector register. In such a configuration, the control parameter is only required to specify the number of consecutive bit positions to be extracted. The number of consecutive bit positions to be extracted may be specified as an immediate value or may be included within a register specified in a vector extraction and merge instruction. In an alternative configuration, the consecutive bit positions are the K 番目is a set of the most significant consecutive bit positions of the portion. In some configurations, only a subset of the possible number of consecutive bit positions to be extracted may be supported. For example, some configurations may support only consecutive bit positions that are 8, 16, or 24 bits in length. Thus, in such a configuration, the control parameter may indirectly specify the number of consecutive bit positions to be extracted by selecting one of the supported lengths. Such a configuration reduces the number of bits required to represent the control parameter.

[0025] In some configurations, each portion of each of the specified registers is an N-bit portion, the control parameter indicates a shift distance M that specifies the number of bits, one or more additional bits include M bits, and the K 番目 bits extracted from the portion of the first source vector register include the bits obtained by subtracting M from N. As a result, the vector extraction and merge instructions combine the M bits from the first portion of the second source vector register with the bits obtained by subtracting M from N of the first portion of the first source vector register to form the first portion of the destination register. Further, the vector extraction and merge instructions combine the M bits from the (K-1) 番目 portion of the first source vector register with the bits obtained by subtracting M from N of the K 番目 portion of the first source vector register. In other words, the M bits of each portion of the first source vector register are shifted so as to be stored in the next portion of the destination vector register.

[0026] For the first portion of the specified register, one or more additional elements can be selected in various ways. In some configurations, each N-bit portion is divided into a plurality of elements, the shift distance corresponds to an integer of the elements, and for the first portion of the specified register, one or more additional bits include the most significant subset of the elements of the first portion of the second source vector register. As a result, the shift and merge instructions take the most significant subset of the second source vector register that is concatenated with the bits of the first source vector register to generate the result vector register.

[0027] Alternatively, in some configurations, each N-bit portion is divided into a plurality of elements, the shift distance corresponds to an integer of the elements, and for a first portion of a specified register, one or more additional bits include a least significant subset of the elements of a first portion of a second source vector register, excluding the least significant element. For some use case scenarios, it may be beneficial to repeatedly apply vector extraction and merge instructions to sequentially generate shifted vectors that are shifted by only some bits (or some elements). For example, when implementing a finite impulse response filter, it may be necessary to sequentially generate vectors that are shifted by only a single element from a previous vector in a sequence. The vector extraction and merge instructions take an initial vector, e.g., a second source vector register, and by generating a sequence of vectors that are shifted by only one element, enable a sequence of shift vectors to be generated. In such cases, rather than holding the first and second source vector registers, the previous destination register can be used as the second source vector register. In such a situation, the positions of the required bits included in one or more additional bits are already shifted by one or more bit positions from the most significant element. Thus, by selecting a least significant subset of the elements, excluding the least significant element, in the case of the first portion of the specified register as one or more additional bits, the vector extraction and merge instructions can be adapted when the second source vector register includes the result of a previous vector extraction and merge instruction. In some configurations, the element width can be controlled by a width parameter of the vector extraction and merge instructions. In some configurations, the control parameter can indicate both which bits are to be extracted and the element width. In such configurations, the number of bits required to encode the parameter is reduced in situations where only a limited number of combinations of the element width and the number and position of the bits to be extracted are supported.

[0028] Rather than specifying a separate vector register for each of the first source vector register, the second source vector register, and the destination vector register, in some configurations, the destination vector register is the second source vector register. By reusing the second source vector register as the destination register, the register requirements and encoding space necessary for vector extraction and merge instructions are reduced.

[0029] As described, vector extraction and merge instructions can be flexibly implemented using hardware capable of executing one or more of a plurality of beats of processing in a given cycle. In some configurations, the processing circuit is configured to process at least two of the plurality of beats in parallel. The hardware provision of such a processing circuit may be sufficient only to process at least two beats, and the processing circuit may be configured to process the beats of adjacent instructions in parallel with processing at least two of the plurality of beats. Alternatively, the processing circuit may be sufficient to process all of the beats of the plurality of beats in parallel.

[0030] In some configurations, the processing circuit comprises hardware that is insufficient to execute all of the plurality of beats of a given vector instruction in parallel. Thus, the processing circuit may execute a second subset of the beats of a given vector instruction after completing the first subset. The first and second subsets may include a single beat or multiple beats, depending on the processor implementation.

[0031] In some configurations, the processing circuit is configured to process all of the plurality of beats of a given vector instruction in parallel. A processing circuit having such hardware can still generate and use the beat status information specified above, but the beat status information typically indicates that there were no completed beats. Thus, by defining the beat status information, the architecture can support a variety of different implementations.

[0032] In some configurations, the decoder circuit, in program counter order, responds to a memory data transfer instruction adjacent to a vector extraction and merge instruction by specifying a memory address and a transfer register among a plurality of vector registers to generate a data transfer control signal, and the apparatus executes a plurality of beats of a memory data transfer process in response to the data transfer control signal, each beat including executing a data transfer to a corresponding portion of the transfer register, setting beat status information indicating which beat of the data transfer instruction has been completed, and suppressing the completed beat of the memory data transfer instruction indicated by the beat status information as completed, and the apparatus further includes a data control circuit, and when the transfer register is one of the specified registers, the processing circuit, in response to the vector extraction and merge instruction, executes a first subset of a plurality of beats of a memory data transfer process corresponding to a first subset of the portion of the transfer register in parallel with executing a second subset of a plurality of beats of a process corresponding to a second subset of the portion of the transfer register. Each of the first subset of beats and the second subset of beats can include the same number of beats or a different number of beats. For example, in some configurations, the apparatus may be provided with enough hardware to perform memory data transfer operations for a plurality of portions (corresponding to a plurality of beats of a memory data transfer process), but only enough hardware to perform a single beat of a process for a vector extraction and merge instruction may be provided. Alternatively, the apparatus may be provided with enough hardware to perform memory data transfer operations for half of the portions and enough hardware to perform a beat of a process for a vector extraction and merge instruction for half of the vector length. In each of these situations, there is no overlap between the data and hardware used for the first subset of a plurality of beats of a process and the second subset of a plurality of beats of a process. Therefore, by providing a processing apparatus capable of parallelizing the first subset of beats and the second subset of beats, a greater instruction throughput can be achieved.

[0033] In some configurations, the control parameter is specified as an immediate value in the vector extraction and merge instructions. In some alternative configurations, the control parameter may be specified as a register in which the control parameter is defined.

[0034] In some configurations, the first part of the specified register is the least significant part of the specified register, and the last part of the specified register is the most significant part of the specified register. In an alternative configuration, the first part of the specified register is the most significant part of the specified register, and the last part of the specified register is the least significant part of the specified register. In this way, a circuit may be provided in the processing device to perform vector extraction and merge instructions by shifting one or more additional bits extracted from a second source vector register to a destination register from the least significant end or the most significant end according to a specific implementation choice.

[0035] The concepts described herein may be embodied in computer-readable code for the manufacture of an apparatus embodying the described concepts. For example, the computer-readable code may be used in one or more stages of a semiconductor design and manufacturing process, including an Electronic Design Atomation (EDA) stage, to manufacture an integrated circuit comprising the apparatus embodying the concepts. The computer-readable code described above may additionally or alternatively enable the definition, modeling, simulation, verification, and / or testing of an apparatus embodying the concepts described herein.

[0036] For example, the computer-readable code for manufacturing an apparatus embodying the concepts described herein can be embodied in code that defines a Hardware Description Language (HDL) representation of the concepts. For example, the code can define a Register-Transfer-Level (RTL) abstraction of one or more logic circuits for defining the apparatus that embodies the concepts. The code can use Verilog, SystemVerilog, Chisel, or Very High-Speed Integrated Circuit Hardware Description Language (VHDL), as well as an intermediate representation such as FIRRTL, to define the HDL representation of one or more logic circuits that embody the apparatus. The computer-readable code can provide a definition for embodying the concepts using system-level modeling languages such as SystemC and SystemVerilog or other behavioral representations of the concepts that can be interpreted by a computer to enable simulation, functional and / or formal verification, and testing of the concepts.

[0037] Additionally or alternatively, the computer-readable code can define a low-level description of integrated circuit elements that embody the concepts described herein, such as one or more netlists or integrated circuit layout definitions, including representations such as GDSII. One or more netlists or other computer-readable representations of the integrated circuit elements can be generated by applying one or more logic synthesis processes to the RTL representation to generate a definition for use in manufacturing the apparatus that embodies the present invention. Alternatively or additionally, one or more logic synthesis processes can generate a bitstream from the computer-readable code for configuring a Field Programmable Gate Array (FPGA) to embody the described concepts. The FPGA can be deployed for purposes of verification and testing of the concepts prior to manufacturing in an integrated circuit, or the FPGA can be deployed directly in a product.

[0038] The computer-readable code can include a mixture of code representations for manufacturing an apparatus, including, for example, one or more mixtures of RTL representations, netlist representations, or other computer-readable definitions used in semiconductor design and manufacturing processes for manufacturing an apparatus embodying the present invention. Alternatively or additionally, the concept can be defined in combination with a computer-readable definition used in semiconductor design and manufacturing processes for manufacturing an apparatus and computer-readable code that defines instructions to be executed by the apparatus once manufactured.

[0039] Such computer-readable code can be disposed on any well-known transient computer-readable medium (such as wired or wireless transmission of code over a network), or a non-transient computer-readable medium such as a semiconductor, magnetic disk, or optical disk. An integrated circuit manufactured using the computer-readable code comprises components such as a central processing unit, a graphics processing unit, a neural processing unit, a digital signal processor, or one or more of other components that individually or collectively embody the concept.

[0040] Here, specific configurations of the present invention will be described with reference to the accompanying drawings.

[0041] Figure 1 schematically shows an example of a data processing apparatus 2 that supports the processing of vector instructions. This is a simplified diagram for ease of explanation, and it will be understood that in reality, the apparatus may have many elements not shown in Figure 1 for the sake of brevity. The apparatus 2 includes a processing circuit 4 for performing data processing in response to instructions decoded by an instruction decoder 6. Program instructions are fetched from a memory system 8, decoded by the instruction decoder, and generate control signals to control the processing circuit 4 to process the instructions in a manner defined by the architecture. For example, the decoder 6 may interpret the opcode of the decoded instruction and any additional control fields of the instruction to generate control signals that activate appropriate hardware units in the processing circuit 4 to execute operations such as arithmetic operations, load / store operations, or logical operations. The apparatus has a set of registers 10 for storing data values to be processed by the processing circuit 4 and control information for configuring the operation of the processing circuit. In response to an arithmetic or logical instruction, the processing circuit 4 reads operands from the registers 10 and writes the result of the instruction back to the registers 10. In response to a load / store instruction, data values are transferred between the registers 10 and the memory system 8 via the processing circuit. The memory system 8 may include one or more levels of data cache and main memory.

[0042] The registers 10 include a scalar register file 12 that includes a plurality of scalar registers for storing scalar values that each contain a single data element. Some of the instructions supported by the instruction decoder 6 and the processing circuit 4 are scalar instructions that process scalar operands read from the scalar registers 12 and generate scalar results that are written back to the scalar registers.

[0043] Register 10 also includes a vector register file 14 that includes several vector registers for storing vector values, each of which contains a plurality of data elements. In response to a vector instruction, instruction decoder 6 controls processing circuit 4 to execute several lanes of vector processing on each element of a vector operand read from one of vector registers 14 to generate either a scalar result to be written to scalar register 12 or a further vector result to be written to vector register 14. Some vector instructions may generate a vector result from one or more scalar operands, or may perform additional scalar operations on scalar operands in the scalar register file and execute lanes of vector processing on vector operands read from vector register file 14. Thus, some instructions may be mixed scalar-vector instructions in which at least one of the one or more source registers and destination registers of the instruction is vector register 14 and another of the one or more source registers and destination registers is scalar register 12. Vector instructions may also include vector load / store instructions that transfer data values between vector register 14 and a location in memory system 8. The load / store instructions may include contiguous load / store instructions in which the locations in memory correspond to a contiguous range of addresses, or may be of the scatter / gather type of vector load / store instructions that specify several discrete addresses and control processing circuit 4 to load data from each of those addresses into respective elements of a vector register or store data from respective elements of a vector register to the discrete addresses.

[0044] Processing circuit 4 can support the processing of vectors having ranges of various different data element sizes. For example, 128-bit vector register 14 can be divided into, for example, 16 8-bit data elements, 8 16-bit data elements, 4 32-bit data elements, or 2 64-bit data elements. The control registers within register bank 10 may specify the current data element size being used, or alternatively, may be parameters of a given vector instruction being executed.

[0045] Register 10 also includes several control registers for controlling the processing of processing circuit 4. For example, these can include a program counter register 16 for storing a program counter address indicating the address of the instruction corresponding to the current execution point being processed, a link register 18 for storing a return address to which processing is directed following the processing of a function call, a stack pointer register 20 for indicating a position within memory system 8 of a stack data structure, and a beat status register 22 for storing beat status information to be described in more detail below. These are only some of the types of control information that can be stored, and it will be understood that in fact a given instruction set of an architecture can store many other control parameters as defined by the architecture. For example, a control register can specify the full width of a vector register, or the current data element size being used for a given instance of vector processing.

[0046] The processing circuit 4 may include several distinct hardware blocks for processing different classes of instructions. For example, load / store instructions that interact with the memory system 8 may be processed by a dedicated load / store unit, while arithmetic or logical instructions can be processed by an arithmetic logic unit (ALU). The ALU itself may be further divided into a multiply-accumulate unit (MAC) for performing operations including multiplication, and additional units for processing other types of ALU operations. To process floating-point instructions, a floating-point unit may also be provided. Pure scalar instructions that do not include vector processing can be processed by a separate hardware block compared to vector instructions, or the same hardware block can be reused.

[0047] In some applications such as digital signal processing (DSP), there may be approximately equal numbers of ALU and load / store instructions, and thus some large blocks such as the MAC can be left idle for a significant amount of time. This inefficiency can be exacerbated on a vector architecture since the execution resources scale with the number of vector lanes to obtain higher performance. On smaller processors (e.g., single-issue, in-order cores), the area overhead of a fully scaled-out vector pipeline can become very large. One way to minimize the area impact while better utilizing the available execution resources is to overlap the execution of instructions, as shown in FIG. 2. In this example, three vector instructions include a load instruction VLDR, a multiply instruction VMUL, and a shift instruction VSHR, and these instructions can all be executed simultaneously even if there are data dependencies between them. This is because element 1 of VMUL depends only on element 1 of Q1 and not on the entire Q1 register, and thus the execution of VMUL can start before the execution of VLDR is complete. By overlapping the instructions, expensive blocks such as multipliers can be kept active for more time.

[0048] Thus, it may be desirable for a microarchitecture implementation to allow overlapping execution of vector instructions. However, if the architecture assumes a fixed amount of instruction overlap, a microarchitecture implementation can provide high efficiency if it actually matches the amount of instruction overlap assumed by the architecture, but it can cause problems when scaled to different microarchitectures that use different overlaps or no overlap at all.

[0049] Instead, the architecture can support different ranges of overlap, as shown in the example of FIG. 3. The execution of a vector instruction is divided into parts called "beats", where each beat corresponds to the processing of a portion of a vector of a given size. A beat is a tiny part of a vector instruction that is either fully executed or not executed at all, and cannot be partially executed. The size of the portion of the vector processed in one beat is defined by the architecture and can be any fraction of the vector. In the example of FIG. 3, a beat is defined as the processing corresponding to one quarter of the vector width, and thus there are four beats per vector instruction. Clearly, this is just an example, and other architectures can use a different number of beats, such as 2 or 8. The portion of the vector corresponding to one beat may be the same size as the data element size of the vector being processed, larger, or smaller. Thus, even if the element size varies per implementation or at run time between different instructions, a beat is a specific fixed width of vector processing. If the portion of the vector being processed in one beat contains multiple data elements, a carry signal can be disabled at the boundary between each element to ensure that each element is processed independently. If the portion of the vector processed in one beat corresponds to only a part of an element and the hardware is not sufficient to compute several beats in parallel, the carry output generated in one beat of processing can be input as a carry input to the next beat of processing so that the results of two beats together form a data element.

[0050] As shown in FIG. 3, different microarchitecture implementations of the processing circuit 4 can execute different numbers of beats in one "tick" of the abstract architecture clock. Here, a "tick" corresponds to a unit of progression of the architectural state (e.g., in a simple architecture, each tick can correspond to an instance that updates all architectural states associated with the execution of an instruction, including updating the program counter to point to the next instruction). It will be understood by those skilled in the art that known microarchitecture techniques such as pipelining may require multiple clock cycles for a single tick to execute at the hardware level, and in fact, a single clock cycle at the hardware level may mean that it can process multiple parts of multiple instructions. However, such microarchitecture techniques are not visible to software because the tick is microscopic at the architectural level. For the sake of brevity, the microarchitecture is ignored during the further description of the present disclosure.

[0051] As shown in the lower example of FIG. 3, some implementations can schedule all four beats of a vector instruction within the same tick by providing sufficient hardware resources to process all beats in parallel within one tick. This may be suitable for more high-performance implementations. In this case, since the entire instruction can be completed in one tick, there is no need for overlap between instructions at the architectural level.

[0052] On the other hand, a more area-efficient implementation can provide a narrower processing unit that can only process two beats per tick. As shown in the middle example of FIG. 3, instruction execution can overlap with the first and second beats of a second vector instruction that is executed in parallel with the third or fourth beat of the first instruction. These instructions are executed on different execution units within the processing circuit (e.g., in FIG. 3, the first instruction is a load instruction executed using a load / store unit, and the second instruction is a multiply-accumulate instruction executed using a MAC unit).

[0053] Even more energy / area-efficient implementations can provide hardware units that are narrower and can process only a single beat at a time, in which case one beat can be processed per tick and instruction execution can overlap and be shifted by only one beat, as shown in the upper example of Figure 3 (which is the same as the example shown in Figure 2 above).

[0054] The overlap shown in Figure 3 is only a few examples, and it will be understood that other implementations are possible. For example, some implementations of the processing circuit 4 can support double-issue of multiple instructions in parallel in the same tick, thereby improving the instruction throughput. In this case, two or more vector instructions that start together in one cycle can have some beats that overlap with two or more vector instructions that start in the next cycle.

[0055] Not only can the amount of overlap be varied for each implementation to scale to different performance points, but the amount of overlap between vector instructions can also vary at runtime between different instances of the execution of vector instructions within a program. Thus, the processing circuit 4 can be provided with a beat control circuit 30 as shown in Figure 1 to control the timing at which a given instruction is executed relative to a previous instruction. This gives the microarchitecture the freedom to choose not to overlap instructions in certain difficult cases where the implementation is more difficult or depending on the resources available for the instructions. For example, if there are consecutive instructions of a given type (e.g., multiply-accumulate) that require the same resources and all available MAC or ALU resources are already being used by another instruction, there may not be enough free resources to start the execution of the next instruction, and thus, instead of overlapping, the issuance of the second instruction can wait until the first instruction is completed.

[0056] Allowing different overlapping ranges of execution vector instructions can enable more efficient use of hardware resources across a range of performance points, but it can introduce some complexity in handling exceptions, or debug events, or other events that trigger a suspension of the current execution thread. For example, in the example shown in FIG. 2, if an exception occurs on the fourth tick, the register file contains partial updates from some instructions. One way to handle this is to treat the partial updates as speculative states that can be reverted if an exception occurs, but this can increase the amount of hardware required as it may be necessary to buffer storage requests for storing data in the memory system 8 until they are committed and provide additional registers in the hardware to track the speculative states. Another approach is to disable exceptions taken in the middle of vector instructions altogether and delay taking exceptions until the oldest uncompleted instruction completes, but an increase in exception handling latency may not be desirable and such behavior can violate architecture guarantees related to the exception if the exception is an accurate fault.

[0057] Instead, as shown in FIG. 4, the beat status register 22 can be used to record beat status values that track which beats of groups of adjacent instructions have completed at the time of an exception, debug event, or other event that leads to an interruption of the current thread. By exposing the overlapping nature of execution to the architecture, this can help reduce the complexity of the microarchitecture and increase power and area efficiency.

[0058] In the example of FIG. 4, the beat status information tracks the completed beats of a group of three vector instructions A, B, C, where instruction A corresponds to the oldest uncompleted vector instruction, instruction B is the next vector instruction after instruction A, and instruction C is the next vector instruction after instruction B. The notation Ax represents the xth beat of instruction A 番目refers to the beats, where x is between 1 and 4 in the case of a 4-beat vector implementation. For example, A2 is the second beat of instruction A. Figure 4 shows an example where three instructions are tracked using beat status information. In other examples that allow more instructions to be partially completed at a given point, the beat status information can track more instructions. For example, when dual issue is supported, it may be desirable to show the beat progress for more than three instructions. Each value of the beat status field is assigned to a given combination of completed beats. For example, the beat status value 0011 indicates that the first and second beats of instruction A and the first beat of instruction B have been completed. The specific mapping of each encoded value of the beat status information to a specific set of beats for each instruction group is arbitrary and can be changed. The beat status value 0000 in this example indicates that there are no incomplete instructions and thus no completed beats of incomplete instructions. This can occur, for example, when the processor executes scalar instructions.

[0059] Figure 5 shows some examples of beat status information recorded at the point when there is a suspension of the current execution thread. In the upper example of Figure 5, the vector instruction is executed at 1 beat per tick, and a debug event or exception occurs at the 4th tick. Thus, at this point, the first 3 beats of instruction A, the first 2 beats of instruction B, and the first beat of instruction C have already been completed, but beats A4, B3, C2, D1 have not yet been executed. Therefore, the beat status information will have the value 0111, which, according to the example of Figure 4, indicates that beats A1, A2, A3, B1, B2, and C1 have already been completed.

[0060] Similarly, at the bottom of the example of FIG. 5, the executed instructions were such that instructions B and C could not overlap (for example, because they required the use of the same hardware unit), and thus, this time, instructions C and D had not yet started at the time of the debug event or exception. At this time, the exception occurring at tick 4 triggers the recording of beat status information 0110 indicating that beats A1, A2, A3, B1, and B2 have already completed, but C1 has not.

[0061] Similarly, in the example of 2 beats per tick in FIG. 3, if an exception occurs at tick 2, only beats A1 and A2 are completed, and the beat status value becomes 0010. The values 0001 and 0010 of the beat status information indicate that only one instruction A is partially completed at the time of the exception, but note that the beat status information identifies that none of the beats of the next two instructions B and C are completed, still indicating which beats of a group of multiple instructions are completed.

[0062] In the example of 4 beats per tick in FIG. 3, since each instruction is completed within one tick, there are no partially completed instructions at the time of the exception, and thus the beat status value is 0000 regardless of when the exception occurs.

[0063] When a debug event or exception occurs, the return address is set to the current value of program counter 16 representing the address of the oldest uncompleted instruction. Thus, in both examples of FIG. 5, the return address will be set to the address of instruction A. The return address can be stored in various locations, including the position on the stack relative to the value of the stack pointer register, or within the return address register.

[0064] As shown in FIG. 6, thereby, in response to a return request from an event (e.g., upon return from a debug mode or an exception handler), the processor can resume processing from a point determined based on the return address and the beat status information in the beat status register 22. The return request from an event can be made by a debugger in the case of a debug event, or can be made by an exception handler in the case of an exception event. Following the return request from an event, the fetch of the instruction to be processed is resumed from the address indicated by the return address, which in this case corresponds to instruction A. Instructions B, C, and D follow (this example corresponds to the example at the top of FIG. 5). However, in the first few cycles after the return, the beats already indicated as completed by the beat status information are suppressed. The processor can suppress these beats by ensuring that the corresponding processing operations are not performed at all (e.g., suppressing requests to load or store data, or disabling the ALU or MAC). Alternatively, the operation can still be executed in the case of an ALU operation, but the processor can suppress writing the result of the operation so as not to affect the register state (i.e., suppressing updates to a part of the destination vector register). When reaching the fourth tick, the pipeline reaches the point where a debug event or an exception occurred previously, and the processing continues as normal. Therefore, during the first few cycles after an exception return, the processor cannot perform any useful work and essentially only refetches a plurality of instructions that were being executed when the original exception or debug event occurred. However, since the exception return latency is often not important for some applications, this can be a good trade-off for reducing the latency when taking an exception, and also helps reduce the amount of architectural state that needs to be stored upon an exception since there is no need to speculatively store the results of incomplete instructions. This approach also enables handling of exceptions that are an exact disruption caused by the beats of vector instructions.

[0065] In some cases, in response to the occurrence of a debug event or an exception, it is possible to set beat status information indicating the completed beats of a group of instructions. However, in some implementations, it may be easier to update the beat status register every time an instruction completes, regardless of whether an exception has occurred. Thus, if an exception occurs in a subsequent tick, the beat status register 22 already indicates the beats of the instruction group that have already completed.

[0066] Figure 4 shows an example of the encoding of beat status information. Another possibility is to provide the beat status information as a bitmap that includes several bits, each corresponding to one beat of one of the groups of instructions A, B, C, etc. Each bit is set to 1 if the corresponding beat has completed and set to 0 if the corresponding beat has not completed (or vice versa). However, in practice, a beat after a given instruction cannot complete if the previous beat has not yet completed, so it is not necessary to provide a bit for each beat. As in the example of Figure 4, it may be more efficient to assign a specific encoding of a smaller bit field to a specific combination of completed beats.

[0067] Figure 7 schematically shows the details of the apparatus 30 arranged according to various configurations of the present technique. In particular, the apparatus 30 is provided with a decoder circuit 38, a processing circuit 40, and a set of registers 32. The registers 32 include one or more scalar registers 34 and one or more vector registers 36. The decoder circuit is configured to receive instructions (e.g., based on program code generated by a programmer or compiler) and interpret the instructions based on an instruction set architecture. In particular, the decoder circuit is configured to interpret a vector extraction and merge instruction that specifies a first source vector register 44, a second source vector register 46, a destination register 54, and a control parameter 43. When the decoder circuit receives a vector extraction and merge instruction, it generates a control signal that causes the processing circuit 40 to perform a vector extraction and merge process. In response to the control signal, the processing circuit 40 performs a vector extraction and merge process by executing one or more beats 48 of a plurality of beats of processing. Each beat of processing corresponds to at least a portion of each of the first source vector register 44 and the destination vector register 54. The processing circuit 40 is configured to execute one or more beats of processing corresponding to one or more portions 48 of the first source vector register 44 and one or more portions 49 of the second source vector register to generate one or more portions 50 stored in the destination vector register 50. The processing circuit 40 is configured to extract one or more bits from the K 番目 'th beat of a plurality of beats of processing from the K 番目 'th portion of the first source vector register 48 and concatenate those bits with one or more additional bits. If the K 番目 'th beat is the first beat of the plurality of beats, the one or more additional bits are extracted from the first portion (the K 番目 'th portion when K = 1) of the second source vector register 49. If the K 番目 'th beat is a beat other than the first beat (K>1), the one or more additional bits are from the (K - 1) 番目 'th portion of the first source vector register 44 corresponding to the (K - 1) 番目It is the carry bit 52 carried from the beat of 番目 When the beat of K 番目 is not the last beat of a plurality of beats, it is configured to output one or more bits as the carry data used in the (K + 1)

[0068] FIG. 8 schematically shows details of a processing device 60 arranged according to some configurations of the present technique. In particular, the processing device 60 is provided with a register 62, a decoder circuit 68, a processing circuit 70, and a data control circuit 72. The register 62 includes a plurality of scalar registers 64 and a plurality of vector registers 66. The decoder circuit 68 is configured to generate control signals in response to instructions that form part of an instruction set architecture. The control signals are passed (routed) to the processing circuit 70 and the data control circuit 72. The processing circuit 70 is configured to execute a plurality of processing beats in response to vector extraction and merge instructions. Details of the processing circuit are the same as the details of the processing circuit 40 referred to in FIG. 7. The data control circuit 72 executes a plurality of beats of memory transfer processing in response to data transfer instructions in response to the data control signals generated by the decoder circuit 68. For a given tick, the device 60 is configured to execute a plurality of beats including a first subset of a plurality of beats of memory transfer processing executed by the data control circuit 72 and a second subset of a plurality of beats of combination processing in response to vector extraction and merge instructions executed by the processing circuit 70. The device 60 is configured to execute the first subset of the plurality of beats and the second subset of the plurality of beats while referring to non-overlapping portions of the same vector register 72.

[0069] FIG. 9 is a diagram schematically showing details of a vector extraction and merge instruction according to some configurations of the present technique. The vector extraction and merge instruction specifies a first source vector register, a second source vector register, a destination vector register, and a control parameter M. In the illustrated example, the processing circuit performs a 2-bit process corresponding to an N-bit portion of each of the first source vector register, the second source vector register, and the destination register. The first source vector register includes a first N-bit portion 82. The first N-bit portion 82 includes a most significant M bits 84 and a least significant N-M bits 86. The processing circuit extracts the N-M bits 86 of the first portion of the first source vector register 82 for the first bit of the process corresponding to the first portion of the first source vector register 82, the first portion of the second source vector register 88, and the first portion of the destination vector register 102, and is configured to concatenate the extracted N-M bits with the M bits (one or more additional bits) 90 extracted from the first portion of the second source vector register 88. Specifically, the N-M bits 86 extracted from the first portion of the first source vector register 82 are stored as the most significant N-M bits 98 of the first portion of the destination vector register 102. The M bits 90 extracted from the first portion of the second source vector register 88 are stored as the least significant M bits 100 of the first portion of the destination vector register 102. The processing circuit is further configured to carry the most significant M bits 84 of the first portion of the first source vector register 82 as a carry bit 96. The carry bit may be a carry bit carried between bits of a process executed in parallel, or a carry bit output to a scalar register configured to carry bits between bits of a process not executed in parallel. In the second bit of the process, the M bits 96 carried from the first portion of the first source vector register 82 are stored as the least significant M bits 94 of the second portion of the destination register.During a second beat of the process, the processing circuit extracts the least significant N-M bits 95 of a second N-bit portion of the first source vector register 80 and stores the N-M bits 95 of the second portion of the first source vector register 80 as the most significant N-M bits 92 of a second portion of the destination vector register 104. In this way, the processing circuit supports vector extraction and merge instructions over multiple beats. In this example, the control parameter indicates the number of M bits 84 (one or more additional bits) to be carried between portions. In other examples, the control parameter can indicate the number of bits to be extracted from a first portion of the first source vector register 86 and stored in a first portion of the destination vector register.

[0070] Figures 10-12 schematically show the bits extracted from a first portion of a second source vector register according to various configurations of the present technique. A particular use case of the vector extraction and merge instructions is to generate vectors that are not aligned to 32-bit boundaries. In particular, some devices are configured to load data aligned to 32-bit boundaries. Thus, it is relatively easy to generate a vector of data values that are shifted by only 32 bits. However, generating data that is not aligned to the 32-bit boundary may not be possible using only load instructions, or may incur a performance penalty such that it may be preferable to use aligned loads. One technique for generating data that is not aligned to the 32-bit boundary requires that a shift be performed.

[0071] Figure 10 schematically shows a case where the data stored in the specified register is 16-bit data. The illustrated first source vector register is divided into four bits, each containing 4 bytes (32 bits). The data stored in the first and second source vector registers corresponds to different parts of the same data set. The data stored in the second source vector register is offset by 32 bits from the data loaded into the first source vector register. In the case of 16-bit data, according to the aforementioned use case, it is desirable to generate a vector shifted by only 16 bits. In such a situation, one or more additional bits extracted from the second source vector register are bytes 2 and 3 (bits 16 to 31) of the first part of the second source vector register. The combination of extracting these bits from the illustrated part of the second source vector register and the shifted data stored in the first source vector register results in the generation of data within the destination vector register that is not aligned to the 32-bit boundary.

[0072] Figure 11 schematically shows the part of the second source vector register that needs to be extracted to perform such a shift for 8-bit data. In particular, bytes 1, 2, and 3 of the second source vector register are extracted as one or more additional bits in the first beat of the process to generate a set of data that is 24 bits out of alignment with the 32-bit boundary. Bytes 2 and 3 of the second source vector register are extracted as one or more additional bits in the first beat of the process to generate a set of data that is 16 bits out of alignment with the 32-bit boundary. Byte 3 of the second vector register is extracted as one or more additional bits in the first beat of the process to generate a set of data that is 8 bits out of alignment with the 32-bit boundary. In this way, it is possible to generate a sequence of vectors having data elements that are not aligned to the 32-bit boundary.

[0073] Figure 12 schematically shows the portion of the second source vector register that needs to be extracted to perform such a shift on 8-bit data when the destination data vector is the second source data vector. In the illustrated example, a sequence of three vector extraction and merge instructions is applied. Each of the vector extraction and merge instructions specifies, as a control parameter, a different number of bits to shift the first source vector register. As in the example of Figure 10, the data stored in the second source vector register is offset by 32 bits from the data loaded into the first source vector register. In the illustrated example, one or more additional bits extracted from the second source vector register comprise the least significant set of bits of the bytes of the first portion of the second source vector register, excluding the least significant byte. In the first vector extraction and merge instruction, a 24-bit (3-byte) shift is defined as the control parameter. As a result, the bytes extracted from the second source vector register are bytes 3, 2, and 1. These are concatenated by the processing circuit during a plurality of processing beats to produce, as the content of the destination vector register for the first vector extraction and merge instruction, a vector of values that are offset by 24 bits from alignment at the 32-bit boundary. In the second vector extraction and merge instruction, a 16-bit (2-byte) shift is defined as the control parameter, and the destination vector register of the first vector extraction and merge instruction is used as the second source vector register. As a result, the bytes extracted from the second source vector register for the second instruction are bytes 3 and 2. These are concatenated by the processing circuit during a plurality of processing beats to produce, as the content of the destination vector register for the second vector extraction and merge instruction, a vector of values that are offset by 16 bits from alignment at the 32-bit boundary. In the third vector extraction and merge instruction, an 8-bit (1-byte) shift is defined as the control parameter, and the destination vector register of the second vector extraction and merge instruction is used as the second source vector register. As a result, the byte extracted from the second source vector register is byte 3.This byte is concatenated by a processing circuit during a plurality of processing beats to produce a vector of values offset from alignment by 8 bits from a 32-bit boundary as the content of a destination vector register for a third vector extraction and merge instruction.

[0074] Figures 13 through 17 schematically illustrate a sequence of operations performed by a processing circuit in response to vector extraction and merge instructions. For purposes of explanation, the elements of the vector register are selected for a use case in which the vector extraction and merge instructions produce a vector that is not aligned to a 32-bit boundary. Examples of this use case are selected for purely illustrative purposes, and it will be readily apparent to one of ordinary skill in the art that the techniques described herein do not require any relationship between the content of the first source vector register and the content of the second source vector register. In particular, for the general vector extraction and merge instructions described herein, it will be apparent that the vector stored in the first source vector register can be any first vector that is either loaded from memory or, for example, generated as a result of one or more other operations. Similarly, the second source vector stored in the second source vector register can be any second vector, and in some use cases, the programmer may elect to select the first and second vectors such that there is some overlap between the elements present in the first source vector register and the second source vector register. In other use cases, the programmer may choose to select the first and second source vectors such that there is no overlap between the elements present in the first source vector register and the second source vector register.

[0075] FIG. 13 schematically shows a sequence of operations executed by a processing circuit in response to a vector extraction and merge instruction that specifies a first source vector register 110, a second source vector register 112, a destination register 114, a scalar register, and control information. Each of the first source vector register 110, the second source vector register 112, and the destination register 114 is arranged as a plurality of parts to be processed in processing a plurality of beats. In the illustrated example, the processing circuit is configured to execute processing of a single beat for a given tick. Each part includes two elements, and the control information specifies that a shift corresponding to a single element is to be executed. For purposes of illustration only, the first source vector register and the second source vector register are shown as 128-bit vector registers and include a set of numbered data items. In particular, the first source vector register includes data items 9 down to 2, and the second source vector register includes data items 7 down to 0. Thus, the first source vector register and the second source vector register include 16-bit data items loaded from addresses in memory aligned to 32-bit boundaries. In the first beat of processing, the processing circuit extracts the least significant element (data item 2) of the first part of the first source vector register 110(D). The extracted least significant element of the first part of the first source vector register 110(D) is concatenated with the most significant element (data item 1) of the first part of the second source vector register 112(D), and the result of the concatenation is stored as the first part of the destination vector register 114(D). During the first beat of processing, the most significant element (data item 3) of the first part of the first source vector register 110(D) is extracted as carry data 116 and stored in the scalar register as the most significant element. During the second beat of processing, the processing circuit extracts the least significant element (data item 4) of the second part of the first source vector register 110(C). The extracted least significant element of the second part of the first source vector register 110(C) is concatenated with the carry data 116 stored in the most significant element (data item 3) of the scalar register, and the result of the concatenation is stored in the second part of the destination vector register 114(C).During the second beat of the process, the processing circuit also extracts the most significant element (data item 5) of the second part of the first source vector register 110(C) as carry data 118 to be stored in the scalar register as the most significant element. During the third beat of the process, the processing circuit extracts the least significant element (data item 6) of the third part of the first source vector register 110(B). The extracted least significant element of the third part of the first source vector register 110(B) is concatenated with the carry data 118 stored in the most significant element (data item 5) of the scalar register, and the result of the concatenation is stored in the third part of the destination vector register 114(B). During the third beat of the process, the processing circuit also extracts the most significant element (data item 7) of the third part of the first source vector register 110(B) as carry data 120 to be stored in the scalar register as the most significant element. During the fourth beat of the process, the processing circuit extracts the least significant element (data item 8) of the fourth part of the first source vector register 110(A). The extracted least significant element of the fourth part of the first source vector register 110(C) is concatenated with the carry data 120 stored in the most significant element (data item 7) of the scalar register, and the result of the concatenation is stored in the fourth (last) part of the destination vector register 114(A). In some alternative configurations, during the fourth beat of the process, the processing circuit also extracts the most significant element of the fourth part of the first source vector register 110(A) as carry data to be stored in the scalar register as the most significant element. This carry data remains stored in the scalar register following the execution of the vector extraction and merge instructions. The value within the unused element (the least significant element shown in FIG. 13) of the scalar register is arbitrary. In some examples, this element can be set to a dummy value such as 0. In other examples, this element can be set to the value of an adjacent element from the current part of the first source vector register.

[0076] FIG. 14 schematically shows a sequence of operations executed by a processing circuit in response to a vector extraction and merge instruction that specifies a first source vector register 140, a second source vector register 142, a destination register 144, a scalar register, and control information. As in FIG. 13, each of the first source vector register and the second source vector register includes data items extracted from a region of memory aligned to a 32-bit boundary. In contrast to FIG. 13, each of the data items stored in the elements of the first and second source vector registers is an 8-bit data item. Each of the first source vector register 140, the second source vector register 142, and the destination register 144 is arranged as a plurality of parts to be processed in a plurality of beats of processing. In the illustrated example, the processing circuit is configured to execute processing of a single beat for a given tick. Each part includes four elements, and the control information specifies that a shift corresponding to two of the elements is to be executed. In the processing of the first beat, the processing circuit extracts the two least significant elements (data items 5 and 4) of the first part of the first source vector register 140 (D). The two extracted least significant elements of the first part of the first source vector register 140 (D) are concatenated with the two most significant elements (data items 3 and 2) of the first part of the second source vector register 142 (D), and the result of the concatenation is stored as the first part of the destination vector register 144 (D). During the first beat of processing, the first part (data items 7 to 4) of the first source vector register 140 (D) is extracted as carry data 146 and stored in the scalar register. During the second beat of processing, the processing circuit extracts the two least significant elements (data items 9 and 8) of the second part of the first source vector register 140 (C). The two extracted least significant elements of the second part of the first source vector register 140 (C) are concatenated with the two most significant elements (data items 7 and 6) of the carry data 146 stored in the scalar register, and the result of the concatenation is stored in the second part of the destination vector register 144 (C).During the second beat of the process, the processing circuit also extracts the second part (data items 11 to 8) of the first source vector register 140(C) as carry data 148 to be stored in the scalar register. During the third beat of the process, the processing circuit extracts the two least significant elements (data items 13 and 12) of the third part of the first source vector register 140(B). The two extracted least significant elements of the third part of the first source vector register 140(B) are concatenated with the two most significant elements (items 11 and 10) of the carry data 148 stored in the scalar register, and the result of the concatenation is stored in the third part of the destination vector register 144(B). During the third beat of the process, the processing circuit also extracts the third part (items 15 to 12) of the first source vector register 140(B) as carry data 150 to be stored in the scalar register. During the fourth beat of the process, the processing circuit extracts the two least significant elements (data items 17 and 16) of the fourth part of the first source vector register 140(A). The two extracted least significant elements of the fourth part of the first source vector register 140(A) are concatenated with the two most significant elements (data items 15 and 14) of the carry data 150 stored in the scalar register, and the result of the concatenation is stored in the fourth (last) part of the destination vector register 144(A). In some alternative configurations, during the fourth beat of the process, the processing circuit also extracts the fourth part of the first source vector register 140(A) as carry data to be stored in the scalar register. This carry data remains stored in the scalar register following the execution of the vector extraction and merge instructions.

[0077] Figure 15 schematically shows a sequence of operations executed by a processing circuit in response to vector extraction and merge instructions according to an alternative implementation form. Figure 15 differs from Figure 14 in that for each of the first beat of processing, the second beat of processing, and the third beat of processing, the data extracted from the corresponding part of the first source vector register 160 and stored as carry data in the scalar register are the two most significant elements of the corresponding part and are stored as the two least significant elements of the scalar register. In particular, the operations are different from those described in relation to Figure 14 as follows. In the first beat of processing, the processing circuit extracts the two most significant elements (data items 7 and 6) of the first part of the first source vector register 160(D) and stores them as carry data 166 in the two least significant elements of the scalar register. In the second beat of processing, one or more further bits of data are extracted from the two least significant elements of the scalar register, and the processing circuit extracts the two most significant elements (data items 11 and 10) of the second part of the first source vector register 160(C) and stores them as carry data 168 in the two least significant elements of the scalar register. In the third beat of processing, one or more further bits of data are extracted from the two least significant elements of the scalar register, and the processing circuit extracts the two most significant elements (data items 15 and 14) of the third part of the first source vector register 160(B) and stores them as carry data 170 in the two least significant elements of the scalar register. In the fourth beat of processing, one or more further bits of data are extracted from the two least significant elements of the scalar register. The position of the carry data within the scalar register is arbitrary, and Figures 14 and 15 show two possibilities, but it will be understood that other configurations are possible.

[0078] FIG. 16 schematically shows a sequence of operations executed by a processing circuit in response to a vector extraction and merge instruction that specifies a first source vector register 180, a second source vector register 182, a destination register 184, a scalar register, and control information. Each of the first source vector register 180, the second source vector register 182, and the destination register 184 is arranged as a plurality of parts to be processed in processing of a plurality of beats. FIG. 16 is different from FIGS. 15 and 14 in that the processing circuit includes hardware capable of executing two beats out of a plurality of beats of processing for a given tick. In other words, two of the beats are executed in parallel. Each part of the first source vector register 180 and the second source vector register 182 includes four 8-bit elements, and the control information specifies that shifts corresponding to two elements are to be executed. In response to the first tick, the processing circuit executes the first and second beats of processing corresponding to the two least significant parts of the first source vector register 180 (C), 180 (D). The processing circuit is configured to extract two most significant elements (data items 3 and 2) from the least significant part of the second source vector register 182 (D) as one or more additional bits. The one or more additional bits are concatenated with the two least significant elements (data items 5 and 4) of the least significant part of the first source vector register 180 (D). The result of the concatenation is stored in the least significant part of the destination vector register 184 (D). The two most significant elements (data items 7 and 6) of the least significant part of the first source vector register 180 (D) are carried to be used in the second beat. Since the second beat is executed in parallel with (at the same tick as) the first beat, the two most significant elements (data items 7 and 6) of the least significant part of the first source vector register 180 (D) are carried as one or more additional bits to be concatenated with the two least significant elements (data items 9 and 8) of the second part of the first source vector register 180 (C).The result of the connection is stored in the second part of the destination vector register 184(C). The processing circuit is also configured to store the two most significant elements (data items 11 and 10) from the second part of the first source vector register 180(C) in the two least significant elements of the scalar register 188 to be carried for processing during the next tick. The processing circuit is also configured to set status information indicating that processing has been completed for the first and second beats of the processing to be executed in response to the vector extraction and merge instructions.

[0079] During the second beat of the process, the processing circuit can determine from the status information that the process has been completed for the first and second beats of the process. Accordingly, the processing circuit starts the process from the third beat corresponding to the third part of the first source vector register 180(B). The processing circuit extracts the two least significant elements (data items 13 and 12) from the third part of the first source vector register 180(B) and concatenates these elements with one or more additional bits. Since the processing circuit can determine that the beat being processed does not include the first beat (the least significant part), one or more additional bits are extracted from the scalar register 188. In particular, one or more additional bits include the two least significant elements (data items 11 and 10) of the scalar register 188, which are extracted and concatenated with the two least significant elements (data items 13 and 12) of the third part of the first source vector register 180(B), and the result of the concatenation is stored in the third part of the destination register 184(B). The processing circuit is also configured to extract the two most significant elements (data items 15 and 14) of the third part of the first source vector register 180(B) to be carried in the fourth beat. Since the processing circuit can execute two beats of the process in a given tick, beats 3 and 4 are executed in parallel and the data carried does not require storage in the scalar register 188. Rather, the two most significant elements (data items 15 and 14) of the third part of the first source vector register 180(B) are carried as one or more additional bits to be used in the fourth beat. During the fourth beat, the two least significant elements (data items 17 and 16) of the fourth (most significant) part of the first source vector register 180(A) are extracted and concatenated with one or more additional bits carried from the third part of the first source vector register. The result of the concatenation is stored in the fourth part (the most significant part) of the destination vector register 184(A).

[0080] In some alternative configurations, during the fourth beat of the process, the processing circuit also extracts the fourth portion of the first source vector register 180(A) as carry data to be stored in the scalar register 188. This carry data remains stored in the scalar register 188 following the execution of the vector extraction and merge instructions and is potentially used as part of a further instruction.

[0081] FIG. 17 schematically shows an alternative configuration in which the sequence of operations is executed by a processing circuit in response to a vector extraction and merge instruction that specifies a first source vector register 240, a second source vector register 242, a destination register 244, a scalar register, and control information. FIG. 17 differs from FIGS. 14 - 16 in that the extraction and merge instructions are reversed. In particular, the vector extraction and merge instructions are executed from the most significant part of the specified register, rather than from the least significant part of the source vector register. In the illustrated example, the processing circuit is configured to execute a single-beat process for a given tick. Each part includes four elements, and the control information specifies that a shift corresponding to one element is to be executed. In the processing of the first beat (which in this case corresponds to the most significant part of the specified register), the processing circuit extracts the three most significant elements (data items 15 to 13) of the first part (the most significant part) of the first source vector register 240(A). The three extracted most significant elements of the first part of the first source vector register 240(A) are concatenated with the least significant element (data item 16) of the first part (the most significant part) of the second source vector register 242(A), and the result of the concatenation is stored as the first part of the destination vector register 244(A). During the first beat of the process, the first part (data items 15 to 12) of the first source vector register 240(A) is extracted as carry data 246 and stored in the scalar register. During the processing of the second beat, the processing circuit extracts the three most significant elements (data items 11 to 9) of the second part of the first source vector register 240(B). The three extracted most significant elements of the second part of the first source vector register 240(B) are concatenated with the least significant element (data item 12) of the carry data 246 stored in the scalar register, and the result of the concatenation is stored in the second part of the destination vector register 244(B). During the second beat of the process, the processing circuit also transfers the second part (data items 11 to 8) of the first source vector register 240(B) as carry data 248 to be stored in the scalar register.During the third beat of the process, the processing circuit extracts the three most significant elements (data items 7 through 5) of the third portion of the first source vector register 240(C). The three extracted most significant elements of the third portion of the first source vector register 240(C) are concatenated with the least significant element (data item 8) of the carry data 248 stored in the scalar register, and the result of the concatenation is stored in the third portion of the destination vector register 244(C). During the third beat of the process, the processing circuit also extracts the third portion (data items 7 through 4) of the first source vector register 240(C) as carry data 250 to be stored in the scalar register. During the fourth beat of the process, the processing circuit extracts the three most significant elements (data items 3 through 1) of the fourth (least significant) portion of the first source vector register 210(D). The three extracted most significant elements of the fourth portion of the first source vector register 210(D) are concatenated with the least significant element (data item 4) of the carry data 250 stored in the scalar register, and the result of the concatenation is stored in the fourth (least significant) portion of the destination vector register 244(D). In some alternative configurations, during the fourth beat of the process, the processing circuit also extracts the fourth portion of the first source vector register 240(D) as carry data to be stored in the scalar register. This carry data remains stored in the scalar register following the execution of the vector extraction and merge instructions. As shown in the previous figure, the position of the data elements carried within the scalar register and the values of the unused elements within the scalar register are arbitrary. Various combinations of other configurations are possible, for example, storing the elements to be carried in the most significant elements of the scalar register and setting the unused elements to 0.

[0082] FIG. 18 schematically shows a sequence of steps executed by a processing circuit in response to a vector extraction and merge instruction. The flow starts at step S170, where it is determined whether a vector extraction and merge instruction specifying a first source vector register, a second source vector register, a destination vector register, and control parameters has been received by the decoder circuit. If "no", the flow remains at step S170. In step S170, if it is determined that the decoder circuit has received a vector extraction and merge instruction, the decoder circuit generates a control signal based on the vector extraction and merge instruction. Next, the flow proceeds to step S172, where, based on the control signal, a value K is set based on status information. If the status information indicates that a processing beat has not been executed, K is set to indicate the first beat of the processing. On the other hand, if the status information indicates that one or more of the first of a plurality of beats has been completed, K is set to indicate the first uncompleted beat of the plurality of beats. Then, the flow proceeds to step S174, where the processing circuit extracts the bits specified by the control parameters from the K 番目 portion of the first source vector register. Next, the flow proceeds to step S176, where it is determined whether K indicates that its portion is the first portion. If so, the flow proceeds to step S178, where the processing circuit extracts one or more further bits from the first portion of the second source vector register (indicated by the control parameters). Then, the flow proceeds to step S182. On the other hand, in step S176, if it is determined that K 番目 indicates that its portion is not the first portion, the flow proceeds to step S180, where one or more further bits are the (K - 1) of the first source vector register 番目are obtained as one or more additional bits conveyed from the portion of. The carry may be, for example, an internal carry within the processing circuit if the processing circuit is provided with sufficient hardware to execute more than one bit per tick. Alternatively, the carry data may be extracted from a scalar register in which one or more additional bits are stored as part of the preceding bits of the vector extraction and merge instructions. Next, the flow proceeds to step S182. In step S182, the one or more extracted bits are concatenated with one or more additional bits. Next, the flow proceeds to step S184, where the result of the concatenation is stored in the K 番目 portion of the destination register. Next, the flow proceeds to step S186, where it is determined whether the K 番目 portion is the last portion of the first source vector register. If so, the flow returns to step S170. In step S186, if it is determined that the K 番目 portion is not the last portion, the flow proceeds to step S188, where at least one bit of the K 番目 portion of the first source vector register not stored in the destination register is conveyed to be processed in the (K + 1) 番目 th bit. The carry may be, for example, an internal carry within the processing circuit if the processing circuit is provided with sufficient hardware to execute the processing of a plurality of (plural) bits per tick. Alternatively, the carry may be executed by storing at least one bit of the K 番目 portion of the first source vector register in a scalar register specified in the vector extraction and merge instructions. Next, the flow proceeds to step S190, where K is incremented, and then the flow returns to step S174.

[0083] The sequence of the steps in FIG. 18 has been described by sequentially incrementing K. However, if sufficient hardware is provided to execute the processing of multiple beats per tick, the steps corresponding to each beat (each value of K) being executed within the same tick are executed in parallel. For example, when beats K and K + 1 are being executed in parallel, step S174 extracts the bits specified by the control parameter for the (K + 1) 番目 portion of the first source vector register in parallel with extracting the bits specified by the control parameter from the K 番目 portion of the first source vector register. Next, one or more additional bits for each of the beats of K and (K + 1) 番目 are extracted in parallel. Potentially, if K indicates that the K 番目 portion is the first portion, one or more additional bits for the K 番目 portion are extracted from the K 番目 portion of the first source vector register in parallel with one or more additional bits for the (K + 1) 番目 portion extracted from the second source vector register. The concatenation step S182 is executed in parallel for the K 番目 and (K + 1) 番目 portions, and the storage step S184 for the K 番目 and (K + 1) 番目 portions is executed in parallel. The determination in step S186 regarding whether K corresponds to the last portion is made based on the highest (most significant) portion of K being processed. If the flow continues, based on this determination, the process proceeds to step S188, and for processing in subsequent ticks, a carry is extracted from the (K + 1) 番目 portion. Those skilled in the art will understand that any number of processing beats can be executed in parallel depending on the details of the hardware provided.

[0084] FIG. 19 schematically shows a non-transitory computer-readable medium containing computer-readable code for manufacturing a data processing apparatus according to various configurations of the present technique. The manufacturing is performed based on the computer-readable code 1002 stored in the non-transitory computer-readable medium 1000. The computer-readable code can be used in one or more stages of a semiconductor design and manufacturing process including an electronic design automation (EDA) stage to manufacture an integrated circuit comprising an apparatus embodying the concept. The manufacturing process includes applying the computer-readable code 1002 directly to one or more programmable hardware units such as a field programmable gate array (FPGA) to configure the FPGA to embody the above-described configuration, or facilitating the manufacture of an apparatus implemented as one or more integrated circuits or an apparatus embodying the above-described configuration. As an example, the manufactured design 1004 comprises an apparatus 30 having a register 32, a decoder circuit 38, and a processing circuit 40 as described in connection with FIG. 7. However, the manufactured design may correspond to any of the circuits shown in FIGS. 1, 7, and 8 capable of implementing vector extraction and merge instructions as described in connection with FIGS. 9-18.

[0085] FIG. 20 illustrates an example of a simulator implementation form that can be used. The foregoing examples implement the present invention from the perspective of an apparatus and method for operating specific processing hardware that supports the technique, but it is also possible to provide an instruction execution environment according to the examples described herein, and the instruction execution environment is implemented by the use of a computer program. Such a computer program is often referred to as a simulator as long as the computer program provides a software-based implementation of a hardware architecture. Various simulator computer programs include binary translators including emulators, virtual machines, models, and dynamic binary translators. Typically, the simulator implementation form can be executed on a host processor 515 and, optionally, execute a host operating system 510 and support a simulator program 505. In some configurations, there may be multiple layers of simulation between the hardware and the provided instruction execution environment, and / or multiple different instruction execution environments may be provided on the same host processor. Historically, powerful processors have been required to provide simulator implementations that execute at a reasonable speed, but such an approach may be justified in certain situations, such as when it is desired to execute native code for another processor for reasons of compatibility or reuse. For example, the simulator implementation may provide an instruction execution environment having additional functionality not supported by the host processor hardware, or may provide an instruction execution environment typically associated with a different hardware architecture. An overview of simulation is described in "Some Efficient Architecture Simulation Techniques", Robert Bedichek, Winter 1990, USENIX Conference, pp. 53-63.

[0086] To the extent that examples have been described above with reference to specific hardware constructs or features, in a simulated implementation, equivalent functionality may be provided by suitable software constructs or features. For example, a particular circuit may be provided as computer program logic in a simulated implementation. Similarly, memory hardware such as registers or caches may be provided as software data structures in a simulated implementation. In configurations where one or more of the hardware elements referred to in the examples above are present in the host hardware, some simulated implementations may use the host hardware if appropriate.

[0087] The simulator program 505 may be stored on a computer-readable storage medium (which may be a non-transitory medium) and provides a virtual hardware interface (instruction execution environment) to the target code 500 (which may include applications, operating systems, and hypervisors). The virtual hardware interface is the same as the hardware interface of the hardware architecture modeled by the simulator program 505. Thus, the program instructions of the target code 500 can be executed from within the instruction execution environment using the simulator program 505. As a result, a host computer 515 that does not actually have the hardware features of the device 30 discussed above can emulate these features. The simulator program may include register logic 532 that emulates the operation of the registers 32, decoder circuit logic 538 that emulates the operation of the decoder circuit 38, and processing logic 540 that emulates the operation of the processing circuit 40. Additionally, the simulator program may include logic for implementing any of the circuits shown in FIGS. 1, 7, and 8 that are capable of implementing vector extraction and merge instructions as described in connection with FIGS. 9-18. Thus, the techniques described herein may be implemented in software by the simulator program 505 in the example of FIG. 20.

[0088] In summary, a processing apparatus, method, and medium are provided. The apparatus includes a decoder circuit that generates a control signal in response to a vector extraction and merge instruction that specifies control parameters, a first vector register, a second vector register, and a destination vector register. The apparatus includes a processing circuit that executes processing of a plurality of beats in response to the control signal, where each beat includes processing corresponding to at least a portion of the first vector register and the destination vector register. K 番目 For the beat of K 番目 the processing includes extracting bits specified by the control parameters from a K 番目 portion of the first vector register, concatenating the bits with further bits, and storing the result in a K 番目 portion of the destination register. The further bits are, for the first portion, extracted from a first portion of the second vector register, or otherwise from a (K-1) 番目 portion of the first vector register.

[0089] In this application, the term "configured to..." is used to mean that an element of an apparatus has a configuration capable of performing a defined operation. In this context, "configuration" means an arrangement or interconnection of hardware or software. For example, the apparatus may have dedicated hardware that provides a defined operation, or a processor or other processing device may be programmed to perform the function. "Configured to" does not mean that an apparatus element needs to be modified in any way to provide a defined operation.

[0090] Although exemplary configurations have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to configurations that are identical thereto, and that various changes, additions, and modifications can be made by those skilled in the art without departing from the scope and spirit of the invention as defined in the appended claims. For example, various combinations of the features of the dependent claims can be made with the features of the independent claims without departing from the scope of the invention.

Claims

1. It is a device, Multiple vector registers, A decoder circuit that generates a control signal in response to a vector extraction and merge instruction, wherein the vector extraction and merge instruction specifies a control parameter and a specified register among the plurality of vector registers, namely a first source vector register, a second source vector register, and a destination vector register. A processing circuit that performs a plurality of beats of processing in response to the control signal, wherein each beat includes a combination of processing corresponding to at least a portion of the first source vector register and the destination vector register, and the processing circuit sets beat status information indicating which beats of the vector extraction and merge instructions have been completed, and is configured to suppress the completed beats of the vector extraction and merge instructions indicated by the beat status information as completed, The combination process for the K-th beat corresponding to each K-th portion of the specified register is: Extracting a bit specified by the control parameter from the k-th portion of the first source vector register, concatenating the extracted bit with one or more further bits, and storing the result of the concatenation in the k-th portion of the destination vector register, If the K-th portion is not the last portion of the designated register, the process includes carrying at least one bit of the K-th portion of the first source vector register that is not stored in the destination vector register to be processed in the (K+1)th beat of the plurality of beats, With respect to the first portion of the designated register, one or more additional bits are extracted from the first portion of the second source vector register. A device in which, for each portion of the designated register other than the first portion, one or more additional bits are transported from the (K-1)-th portion of the first source vector register.

2. The decoder circuit responds to the vector extraction and merge instruction specifying a scalar register, The plurality of beats includes a currently executing subset of one or more beats, and the subset of currently executing beats excludes the completed beats. The apparatus according to claim 1, wherein the processing circuit stores at least one item of carry data in the scalar register in response to the control signal, and the at least one item of carry data includes one or more bits that are carried between one or more currently executing subsets of one or more beats among the plurality of beats and further subsets of one or more beats.

3. The apparatus according to claim 2, wherein the processing circuit, in response to the control signal, takes out one or more additional bits from the scalar register when the beat status information prior to the execution of the vector extraction and merge instruction indicates that at least one beat should be suppressed for a first beat of the currently executing set of one or more beats.

4. The one or more bits to be transported comprise all the bits of the portion of the first source vector register, The apparatus according to claim 3, wherein taking one or more further bits from the scalar register includes taking the last subset of bits from the scalar register.

5. The one or more bits to be transported include the last set of M bits from the portion of the first source vector register stored in a temporary set of bit positions in the scalar register, The apparatus according to claim 3, wherein taking one or more additional bits from the scalar register includes taking bits from the temporary set of bit positions of the scalar register.

6. The apparatus according to any one of claims 1 to 5, wherein concatenating the extracted bits includes storing the extracted bits in a first consecutive set of bit positions in the K-th portion of the destination vector register, and storing one or more further bits in a second consecutive set of bit positions in the K-th portion of the destination vector register.

7. The apparatus according to claim 6, wherein the first consecutive set of bit positions and the second consecutive set of bit positions are non-overlapping bit positions.

8. The apparatus according to claim 6, wherein the first consecutive set of bit positions is the most significant set of bit positions in the K-th portion of the destination vector register, and the second consecutive set of bit positions is the least significant set of bit positions in the K-th portion of the destination vector register.

9. The apparatus according to claim 6, wherein the first consecutive set of bit positions is the least significant set of bit positions of the K-th portion of the destination vector register, and the second consecutive set of bit positions is the most significant set of bit positions of the K-th portion of the destination vector register.

10. The apparatus according to any one of claims 1 to 5, wherein the extracted bits are extracted from consecutive bit positions of the K-th portion of the first source vector register.

11. The apparatus according to claim 10, wherein the consecutive bit positions are a set of the least significant consecutive bit positions of the K-th portion of the first source vector register.

12. Each portion of the aforementioned specified register is an N-bit portion, The aforementioned control parameter indicates a shift distance M that specifies the number of bits. The one or more further bits include M bits, The apparatus according to any one of claims 1 to 5, wherein the bit extracted from the K-th portion of the first source vector register includes the bit obtained by subtracting M from N.

13. Each N-bit portion is divided into multiple elements, The aforementioned shift distance corresponds to an integer element, The apparatus according to claim 12, wherein, with respect to the first portion of the designated register, one or more further bits include the least significant subset of the elements of the first portion of the second source vector register, excluding the least significant element.

14. Each N-bit portion is divided into multiple elements, The aforementioned shift distance corresponds to an integer element, The apparatus according to claim 12, wherein, with respect to the first portion of the designated register, one or more further bits include the most significant subset of the elements of the first portion of the second source vector register.

15. The apparatus according to any one of claims 1 to 5, wherein the destination vector register is the second source vector register.

16. The apparatus according to any one of claims 1 to 5, wherein the processing circuit is configured to process at least two of the plurality of beats in parallel.

17. The apparatus according to any one of claims 1 to 5, wherein the processing circuit includes hardware that is insufficient to execute all of the plurality of beats of a given vector instruction in parallel.

18. The apparatus according to any one of claims 1 to 5, wherein the processing circuit is configured to process all of the plurality of beats of a given vector instruction in parallel.

19. The decoder circuit, in response to memory data transfer instructions adjacent to the vector extraction and merge instructions in program counter order, generates a data transfer control signal by specifying a memory address and a transfer register among the plurality of vector registers. The device further includes a data control circuit that, in response to the data transfer control signal, executes a plurality of beats of memory data transfer processing, each beat of which performs a data transfer to a corresponding portion of the transfer register, sets beat status information indicating which beats of the memory data transfer instruction have been completed, and suppresses completed beats of the memory data transfer instruction indicated by the beat status information as completed. The apparatus according to any one of claims 1 to 5, wherein when the transfer register is one of the designated registers, the processing circuit is configured to execute a first subset of the memory data transfer processing corresponding to a first subset of the transfer register in parallel with executing a second subset of the processing corresponding to a second subset of the portion of the transfer register in response to the vector extraction and merge instruction.

20. The apparatus according to any one of claims 1 to 5, wherein the control parameter is specified as an immediate value in the vector extraction and merge command.

21. The apparatus according to any one of claims 1 to 5, wherein the first portion of the designated register is the lowest portion of the designated register, and the last portion of the designated register is the highest portion of the designated register.

22. A method for operating a device comprising a plurality of vector registers, a decoder circuit, and a processing circuit, wherein the method is Using the decoder circuit, a control signal is generated in response to a vector extraction and merge instruction, wherein the vector extraction and merge instruction specifies a control parameter and a designated register among the plurality of vector registers, namely a first source vector register, a second source vector register, and a destination vector register. Using the processing circuit, execute a plurality of beats of processing in response to the control signal, wherein each beat includes a combination of processing corresponding to at least a portion of the first source vector register and the destination vector register, set beat status information indicating which beats of the vector extraction and merge instructions have been completed, and suppress the completed beats of the vector extraction and merge instructions indicated by the beat status information as completed. The combination process for the K-th beat corresponding to each K-th portion of the specified register is: Extracting a bit specified by the control parameter from the k-th portion of the first source vector register, concatenating the extracted bit with one or more further bits, and storing the result of the concatenation in the k-th portion of the destination vector register, If the K-th portion is not the last portion of the designated register, the process includes carrying at least one bit of the K-th portion of the first source vector register that is not stored in the destination vector register to be processed in the (K+1)th beat of the plurality of beats, With respect to the first portion of the designated register, one or more additional bits are extracted from the first portion of the second source vector register. A method in which, for each portion of the designated register other than the first portion, one or more additional bits are transported from the (K-1)-th portion of the first source vector register.

23. A computer-readable medium for storing computer-readable code for the manufacture of a device, wherein the device is Multiple vector registers, A decoder circuit that generates a control signal in response to a vector extraction and merge instruction, wherein the vector extraction and merge instruction specifies a control parameter and a specified register among the plurality of vector registers, namely a first source vector register, a second source vector register, and a destination vector register. A processing circuit that performs a plurality of beats of processing in response to the control signal, wherein each beat includes a combination of processing corresponding to at least a portion of the first source vector register and the destination vector register, and the processing circuit sets beat status information indicating which beats of the vector extraction and merge instructions have been completed, and is configured to suppress the completed beats of the vector extraction and merge instructions indicated by the beat status information as completed, The combination process for the K-th beat corresponding to each K-th portion of the specified register is: Extracting a bit specified by the control parameter from the k-th portion of the first source vector register, concatenating the extracted bit with one or more further bits, and storing the result of the concatenation in the k-th portion of the destination vector register, If the K-th portion is not the last portion of the designated register, the process includes carrying at least one bit of the K-th portion of the first source vector register that is not stored in the destination vector register to be processed in the (K+1)th beat of the plurality of beats, With respect to the first portion of the designated register, one or more additional bits are extracted from the first portion of the second source vector register. A computer-readable medium in which, for each portion of the designated register other than the first portion, one or more additional bits are transported from the (K-1)-th portion of the first source vector register.

24. A computer program for controlling a host data processing device to provide an instruction execution environment, Register logic with multiple vector registers, Decoder logic that generates control signals in response to vector extraction and merge instructions, wherein the vector extraction and merge instructions specify control parameters and a first source vector register, a second source vector register, and a destination vector register as specified registers among the plurality of vector registers, comprising: Processing logic that performs a plurality of beats of processing in response to the control signal, wherein each beat includes a combination of processing corresponding to at least a portion of the first source vector register and the destination vector register, and the processing logic is configured to set beat status information indicating which beats of the vector extraction and merge instructions have been completed, and to suppress the completed beats of the vector extraction and merge instructions indicated by the beat status information as completed, The combination process for the K-th beat corresponding to each K-th portion of the specified register is: Extracting a bit specified by the control parameter from the k-th portion of the first source vector register, concatenating the extracted bit with one or more further bits, and storing the result of the concatenation in the k-th portion of the destination vector register, When the K-th portion is not the last portion of the designated register, the method includes carrying at least one bit of the K-th portion of the first source vector register that is not stored in the destination vector register to be processed in the (K+1)th beat of the plurality of beats, With respect to the first portion of the designated register, one or more additional bits are extracted from the first portion of the second source vector register. A computer program in which, for each portion of the designated register other than the first portion, one or more additional bits are transported from the (K-1)-th portion of the first source vector register.