Scalar-vector heterogeneous pipelined interaction chip and interaction method
By using a scalar-vector heterogeneous pipeline interaction chip, the preparation and execution of scalar operands for vector instructions are decoupled, which solves the problem of vector instructions occupying resources in the scalar pipeline and improves the execution efficiency of scalar instructions and the overall system performance.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANDONG UNIV
- Filing Date
- 2026-06-01
- Publication Date
- 2026-06-30
AI Technical Summary
In existing coprocessor designs based on the RISC-V vector instruction set, vector instructions consume resources in the scalar pipeline, resulting in low execution efficiency of scalar instructions and increased execution latency of vector instructions, which cannot fully utilize the parallelism of vector processing units.
By decoupling the preparation of scalar operands for vector instructions from the issuance and execution of vector instructions, a scalar-vector heterogeneous pipeline interaction chip is adopted. The cross-queue module is used to realize the decoupled interaction between scalar and vector pipelines, avoiding vector instructions from occupying the issuance channel of the scalar pipeline and improving the execution efficiency of vector instructions.
It significantly improves the execution efficiency of scalar threads and the overall system instruction throughput, shortens the execution path of vector instructions, reduces startup latency, and improves the utilization of vector functional units.
Smart Images

Figure CN122308924A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the field of pipelined interactive chips, and particularly relates to a scalar-vector heterogeneous pipelined interactive chip and interaction method. Background Technology
[0002] The statements in this section are merely background information related to the present invention and do not necessarily constitute prior art.
[0003] Currently, coprocessor implementations based on the RISC-V (Reduced Instruction Set Computer five) vector instruction set typically employ a design approach that tightly integrates the scalar pipeline and vector processing units. The basic process is as follows: When a vector instruction enters the scalar core pipeline, it undergoes preliminary parsing during the decoding stage. After preliminary decoding, the instruction enters the issue queue to await issuance. Once all scalar operands are ready, the instruction itself and its required scalar operands are sent to the vector coprocessor via a dedicated communication interface. Upon receiving the instruction and scalar operands, the vector coprocessor initiates its internal detailed decoding process. Based on the detailed decoding results, the vector instruction is typically broken down into one or more micro-operations to accommodate the actual execution capabilities of the vector processing unit. These micro-operations are then dispatched to different functional units within the vector pipeline (such as vector arithmetic logic units, vector load-memory units, or mask register units) for execution.
[0004] The existing coprocessors based on the RISC-V vector instruction set have the following problems: (1) In the current design method, vector instructions must complete the entire fetch, decode, operand preparation and issue process in the scalar core pipeline, just like ordinary scalar instructions. This process will inevitably compete with scalar instructions for valuable pipeline resources, thus causing the occupation and blocking of the scalar pipeline issue channel; (2) The core computation of vector instructions occurs in a dedicated vector coprocessor. Before reaching the coprocessor, it must go through all stages of the scalar pipeline. These stages do not contribute to the final result of the vector instructions, increasing the execution latency of vector instructions, thus causing the vector instructions to wait unnecessarily in the scalar pipeline. Summary of the Invention
[0005] To address the technical problems mentioned above, this invention provides a scalar-vector heterogeneous pipelined interaction chip and method. By separating the preparation of scalar operands for vector instructions from the issuance and execution of the vector instructions themselves, the scalar pipeline can continuously issue subsequent scalar instructions without waiting for the slow preparation or execution of vector operands, thereby significantly improving the execution efficiency of scalar threads and the instruction throughput of the overall system.
[0006] To achieve the above objectives, the present invention adopts the following technical solution: The first aspect of the present invention provides a scalar-vector heterogeneous pipelined interaction chip.
[0007] A scalar-vector heterogeneous pipelined interaction chip includes: a scalar processor, a vector coprocessor, and a cross-queue module that communicates with the scalar processor and the vector coprocessor; the scalar processor has a scalar pipeline, and the vector coprocessor has a vector pipeline. The scalar processor decodes the fetched vector instructions and divides them into two data streams. One data stream sends the vector instructions directly to the vector coprocessor for vector pipeline execution within the vector coprocessor. The other data stream sends the vector instruction ID and the decoded scalar register number to the cross-queue module. The cross-queue module allocates an entry for each vector instruction that requires scalar operands. When the vector pipeline needs to use scalar operands, the vector coprocessor retrieves the required scalar operands from the corresponding entries in the cross-queue module. After the vector instruction is executed, the vector instruction execution result is returned to the cross-queue module via the vector emission queue and written to the corresponding entry, thereby achieving decoupled interaction between the scalar pipeline and the vector pipeline.
[0008] In one implementation, the scalar processor includes an instruction fetch module, a scalar decoding module, a scalar issue module, a scalar execution module, and a scalar register file; The instruction fetch module is used to fetch instructions and transmit them to the scalar decoding module. The scalar decoding module is used to decode the instructions and determine whether they are vector instructions. If not, they directly enter the scalar execution module. If they are, one data stream assigns an instruction ID to each vector instruction and further determines whether it requires scalar operands. If scalar operands are required, the scalar register numbers in the scalar register file that are needed are decoded and sent to the cross-queue module along with the vector instruction ID. The other data stream allows the vector instructions to directly enter the vector coprocessor.
[0009] In one implementation, the vector coprocessor includes a vector micro-instruction splitting module, a vector dispatching module, a vector register renaming module, and a vector reordering cache module. The vector micro-instruction splitting module splits vector instructions into micro-instructions and transmits the splitting results to the vector dispatching module. In the vector dispatching module, two source registers and one destination register are renamed and then forwarded to the vector register renaming module for processing. After completing the renaming, the vector register renaming module returns the renaming result to the vector dispatching module, which then forwards the renaming result and instruction ID information to the vector reordering cache module for entry allocation.
[0010] As one implementation, after the vector instruction is submitted in the vector reordering cache, the cross-queue module is also used to perform a write-back operation on the scalar operand.
[0011] As one implementation, after the vector coprocessor completes the forwarding operation, the vector instruction is further decoded. This decoding operation decodes the vector instruction into a specific operation type.
[0012] As one implementation method, vector instruction decoding produces specific operation types including arithmetic instructions, memory access instructions, and mask instructions.
[0013] As one implementation method, after decoding is completed, all relevant information about the instruction and the decoding result are sent to the vector launch queue module. The vector launch queue module allocates an entry for each vector instruction that enters it to store all source operand information of the vector instruction.
[0014] As one implementation method, once all operands are ready, the vector instruction is in a ready-to-issue state. When the vector instruction is the oldest in the issue queue, it is issued to the execution unit for computation.
[0015] In one implementation, the entry stores the vector instruction ID, scalar register number, and scalar operand.
[0016] A second aspect of the present invention provides a scalar-vector heterogeneous pipeline interaction method.
[0017] A scalar-vector heterogeneous pipeline interaction method, based on the aforementioned scalar-vector heterogeneous pipeline interaction chip; comprising: The scalar processor decodes the fetched vector instructions and splits them into two data streams. One data stream sends the vector instructions directly to the vector coprocessor for vector pipeline execution within the vector coprocessor. The other data stream sends the vector instruction ID and the decoded scalar register number to the cross-queue module, which then allocates an entry for each vector instruction that requires scalar operands. When the vector pipeline needs to use scalar operands, the vector coprocessor retrieves the required scalar operands from the corresponding entries in the cross-queue module. After the vector instruction is executed, the execution result is returned to the cross-queue module via the vector emit queue and written to the corresponding entry, thereby achieving decoupled interaction between the scalar pipeline and the vector pipeline.
[0018] The beneficial effects of this invention are: The scalar-vector heterogeneous pipeline interaction chip of the present invention consists of a scalar processor, a vector coprocessor, and a cross-queue module that communicates with both. The cross-queue module realizes the decoupled interaction between the scalar and vector pipelines, preventing vector instructions from occupying the issue channel of the scalar pipeline and thus affecting the efficiency of scalar instruction execution. Furthermore, after the scalar result of the vector instruction is calculated, the entry in the vector issue queue can be released and the result written to the cross-queue module, preventing this vector instruction from occupying the entries of other vector instructions for a long time, and enabling the issue matrix to receive new vector instructions as quickly as possible.
[0019] Advantages of additional aspects of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. Attached Figure Description
[0020] The accompanying drawings, which form part of this invention, are used to provide a further understanding of the invention. The illustrative embodiments of the invention and their descriptions are used to explain the invention and do not constitute an improper limitation of the invention.
[0021] Figure 1 This is a schematic diagram of a scalar-vector heterogeneous pipelined interactive chip structure according to an embodiment of the present invention; Figure 2 This is a flowchart of the scalar-vector heterogeneous pipeline interaction method according to an embodiment of the present invention; Figure 3 This is a timing diagram of the cross queue corresponding to the scalar-vector heterogeneous pipeline interaction instructions in an embodiment of the present invention; Figure 4 This is a timing diagram of releasing scalar operands according to an embodiment of the present invention. Detailed Implementation
[0022] The present invention will be further described below with reference to the accompanying drawings and embodiments.
[0023] It should be noted that the following detailed description is illustrative and intended to provide further explanation of the invention. Unless otherwise specified, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains.
[0024] It should be noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the scope of exemplary embodiments according to the invention. As used herein, the singular form is intended to include the plural form as well, unless the context clearly indicates otherwise. Furthermore, it should be understood that when the terms "comprising" and / or "including" are used in this specification, they indicate the presence of features, steps, operations, devices, components, and / or combinations thereof.
[0025] In the implementation of a coprocessor based on the RISC-V vector instruction set, the initial decoding stage is primarily responsible for identifying the scalar register operands involved in the instruction and determining the major category of the vector instruction (such as load-memory, arithmetic operations, masking operations, etc.), while extracting basic control information such as vector length and vector data type. After the initial decoding is complete, while the instruction is waiting to be issued in the issue queue, the issue logic is responsible for fetching the scalar operands required by the instruction, such as the base address register, stride value, or immediate value. This process may involve accessing the register file and performing data dependency checks to ensure that all scalar operands are ready and the instruction can be safely issued.
[0026] All scalar operands are sent to the vector coprocessor. Upon receiving the instruction and scalar operands, the vector coprocessor, during the decoding phase, further parses the instruction into lower-level control signals, including determining the micro-operation sequence, the location of computational elements, generating memory address sequences, and configuring the data path for the computation unit. Since RISC-V vector instructions support variable-length vector operations, this stage also requires configuring the processor's runtime state based on the vector type (vtype) and vector length (vl). Based on the detailed decoding results, vector instructions are typically broken down into one or more micro-operations (μops) to suit the actual execution capabilities of the vector processing unit. For example, long vectors may be divided into multiple chunks for batch processing. These micro-operations are then dispatched to different functional units in the vector pipeline (such as the vector arithmetic logic unit, vector load memory unit, or mask register unit) for execution. Ultimately, these micro-operations perform the actual data-parallel computation within the vector processing unit.
[0027] In existing RISC-V vector instruction set-based coprocessor implementations, scalar processors typically have limited issue slots and issue bandwidth. A vector instruction may remain in the issue queue while its scalar operands are not ready (e.g., instructions preceding the scalar registers it depends on have not yet been written back). During this time, it occupies an issue position, potentially preventing subsequent ready scalar instructions from being issued, even if these instructions are unrelated to the current vector instruction. This resource occupancy directly reduces scalar instruction throughput and disrupts the smoothness of the scalar pipeline.
[0028] Vector instructions typically require multiple scalar operands, such as the base address register for memory operations, the stride value, and the vector length (vl). Preparing these operands may involve accessing the register file, performing forwarding to resolve data dependencies, and even waiting for long-latency operations (such as cache-missing load instructions) to complete. All of these actions occur within the scalar pipeline, further extending the time vector instructions are "stuck" in the scalar phase and amplifying interference with the scalar instruction flow. This design tightly couples scalar performance with vector performance, causing them to influence each other. When there are intensive scalar computations in the program, the issuance of vector instructions may be delayed; conversely, when vector instructions wait for operands, it slows down the execution of scalar threads. This mutual constraint is detrimental to building efficient and predictable heterogeneous computing systems.
[0029] Existing coprocessor implementations based on the RISC-V vector instruction set also suffer from the problem of unnecessary waiting for vector instructions in the scalar pipeline. This problem focuses on the execution efficiency of the vector instructions themselves. Current methods force vector instructions to undergo a waiting period in the scalar pipeline that is "useless" for their own execution. The core computation of vector instructions occurs in a dedicated vector coprocessor. However, before reaching the coprocessor, it must go through all stages of the scalar pipeline, such as integer ALU operations and branch prediction, which do not contribute to the final result of the vector instruction and are purely overhead. This increases the execution latency of vector instructions.
[0030] For short vector operations or latency-sensitive applications, this "prelude" overhead accounts for a considerable proportion of the total execution time, severely undermining the performance gains brought by introducing vector extensions. Ideally, vector instructions should be able to leave the scalar pipeline as quickly as possible and enter their dedicated, highly parallelized execution units to minimize startup latency. While a vector instruction is waiting for its scalar operand in the scalar pipeline, its subsequent vector operations (such as address calculations, element accesses, and data operations) cannot start at all. This is a serialization bottleneck that fails to fully utilize the potential independence and parallelism of vector units. More advanced designs consider overlapping or decoupling "scalar operand preparation" from "vector operation startup."
[0031] Existing coprocessor implementations based on the RISC-V vector instruction set still suffer from inefficiencies caused by sequential issue and spurious dependencies, a key bottleneck in the internal execution mechanism of vector coprocessors. While the sequential issue strategy is simple, it cannot effectively handle the complex instruction sequences in modern programs. Sequential issue means that the vector coprocessor must strictly follow the instruction sequence, dispatching all micro-operations (μops) of the previous instruction before processing the next. If the previous instruction is a long-latency operation (such as a large-scale vector memory load), even if the subsequent instruction is an independent, immediately executable vector arithmetic operation, it must block and wait, resulting in idle computing units and severely reduced hardware utilization. These spurious dependencies are very common in code, especially in unoptimized code or compiler-generated code. The sequential issue mechanism cannot distinguish between true dependencies (RAW, real data stream) and false dependencies (WAR / WAW, just register name reuse), and processes them all serially in the most conservative way. This greatly limits the exploitation of instruction-level parallelism (ILP), resulting in low issue efficiency of vector instruction streams. Multiple vector functional units cannot be fully utilized, ultimately making it difficult to achieve the peak performance of the vector coprocessor.
[0032] according to Figure 1 and Figure 2 The scalar-vector heterogeneous pipelined interaction chip of this invention includes: a scalar processor, a vector coprocessor, and a cross-queue module that communicates with the scalar processor and the vector coprocessor; the scalar processor is provided with a scalar pipeline, and the vector coprocessor is provided with a vector pipeline. The scalar processor decodes the fetched vector instructions and splits them into two data streams. One data stream sends the vector instructions directly to the vector coprocessor for vector pipeline execution within the vector coprocessor. The other data stream sends the vector instruction ID and the decoded scalar register number to the cross-queue module. The cross-queue module allocates an entry for each vector instruction that requires scalar operands. When the vector pipeline needs to use scalar operands, the vector coprocessor retrieves the required scalar operands from the corresponding entries in the cross-queue module. After the vector instruction is executed, the vector instruction execution result is returned to the cross-queue module via the vector emit queue and written to the corresponding entry, thereby achieving decoupled interaction between the scalar pipeline and the vector pipeline.
[0033] It should be noted here that the table entries store the vector instruction ID, scalar register number, and scalar operand.
[0034] In specific implementation, the scalar processor includes an instruction fetch module, a scalar decoding module, a scalar issue module, a scalar execution module, and a scalar register file. The instruction fetch module is used to fetch instructions and transmit them to the scalar decoding module. The scalar decoding module is used to decode the instructions and determine whether they are vector instructions. If not, they directly enter the scalar execution module. If they are, one data stream assigns an instruction ID to each vector instruction and further determines whether it requires scalar operands. If scalar operands are required, the scalar register numbers in the scalar register file that are needed are decoded and sent to the cross-queue module along with the vector instruction ID. Another data stream allows the vector instructions to directly enter the vector coprocessor.
[0035] In its implementation, the vector coprocessor includes a vector micro-instruction splitting module, a vector dispatching module, a vector register renaming module, and a vector reordering cache module. The vector micro-instruction splitting module splits vector instructions into micro-instructions and then transmits them to the vector dispatching module. In the vector dispatching module, two source registers and one destination register are renamed, and the renaming is forwarded to the vector register renaming module for processing. After renaming, the vector register renaming module returns the renaming result to the vector dispatching module, which then forwards the renaming result and instruction ID information to the vector reordering cache module for entry allocation.
[0036] After the vector instruction is submitted in the vector reordering cache, the cross-queue module is also used to perform write-back operations on scalar operands.
[0037] After the vector coprocessor completes the forwarding operation, it further decodes the vector instructions, which decodes the vector instructions into specific operation types. These specific operation types include, but are not limited to, arithmetic instructions, memory access instructions, and masking instructions.
[0038] After decoding is complete, all relevant information about the instruction and the decoding result are sent to the vector launch queue module. The vector launch queue module allocates an entry for each vector instruction that enters it to store all source operand information of that vector instruction.
[0039] Once all operands are ready, the vector instruction is in a ready-to-issue state. When the vector instruction is the oldest in the issue queue, it is issued to the execution unit for computation.
[0040] Vector instructions are assigned a unique instruction ID at the decoder stage of the scalar pipeline. This instruction ID uniquely identifies a specific vector instruction and is then subjected to a simple decoding process. This decoding module is called vec_dec. In this module, the basic information of the vector instruction needs to be decoded initially, including whether the vector instruction is a vector configuration instruction (vector configuration instruction of type vset), whether it is a vector memory access instruction (vector load instruction of type vld and vector store instruction of type vst), and more importantly, whether the vector instruction needs to use scalar operands. If it does, the scalar register number needs to be decoded.
[0041] After the vector instruction is initially decoded in the vector pre-decoder module, the output of this module is divided into two data streams. One data stream continues in the scalar pipeline, carrying the ID of the vector instruction and the register numbers of the scalar operands required by the instruction. This data stream enters the scalar issue module, where it is processed and used as a channel to send operand fetch requests to the scalar register file. After fetching, the fetched data and the instruction ID are sent to a module called the cross queue. The core of this module is to act as an interaction module between the scalar and vector pipelines. In this module, an entry is allocated for each vector instruction that requires scalar operands. Each entry stores the ID of the vector instruction, the scalar register number, and the scalar operand. When a vector instruction that requires scalar operands is issued by the vector coprocessor's issue module, it retrieves the required operands from the corresponding entry in this cross queue.
[0042] Another data flow occurs after vector instructions undergo initial decoding in the vector pre-decoding module, entering the vector coprocessor. First, they enter the vector micro-instruction splitting module, where the vector instructions are split into micro-instructions. After splitting, they enter the vector dispatch module. In the vector dispatch module, two source registers and one destination register are renamed. This operation is forwarded to the vector register renaming module for processing. After renaming, the result is returned to the vector dispatch module, which forwards the renaming result, along with the instruction ID and other relevant information, to the vector reordering cache module for entry allocation. After forwarding, the vector instructions undergo further decoding, which decodes the instructions into specific operation types, such as arithmetic instructions, memory access instructions, and masking instructions.
[0043] After decoding is complete, all relevant information about the instruction and the decoding result are sent to the vector transmission queue module. The vector transmission queue module allocates an entry for each vector instruction entering the module. This entry stores all source operand information for this vector instruction, as shown in Table 1. Here, src0_rdy, src1_rdy, dst_rdy, and v0_rdy indicate whether each operand is in a ready-to-complete state. The corresponding src0, src1, dst, and v0 each store the corresponding operand. When all operands are ready to complete, this vector instruction is in a waiting-to-transmit state. When this vector instruction is the oldest in the transmission queue, it is transmitted to the execution unit for computation.
[0044] Table 1. Information on all source operands of vector instructions;
[0045] Wherein, src0_rdy, src1_rdy, dst_rdy and v0_rdy represent the ready signal of source register 0, the ready signal of source register 1, the ready signal of destination register and the ready signal of mask register, respectively; src0, src1, dst and v0 represent the data of source register 0, the data of source register 1, the data of destination register and the data of mask register, respectively.
[0046] Because vector instructions may require scalar operands, the states of src0_rdy and src1_rdy may depend on the states of the vector instructions stored in the cross queue. In the vector emission queue module, when the source operand of a vector instruction in an entry is a scalar operand, this entry checks the state of the corresponding scalar operand in the cross queue at each clock cycle. If the corresponding operand is ready, the corresponding rdy bit in this entry will be pulled high, and the operand in the cross queue will be written to the corresponding src0 or src1.
[0047] Meanwhile, the result of a vector instruction may also need to be written back to the scalar register file, which can also be done through this cross queue. When the vector instruction is executed, the result is returned to the vector issue queue, that is, the scalar result is written to the corresponding entry in dst, and the corresponding dst_rdy is pulled high. In the next cycle, the vector issue queue returns this result to the cross queue module and writes it to the corresponding cross queue entry. After the vector instruction is committed in the vector reordering cache, the cross queue performs the write-back operation of the scalar operand.
[0048] This cross-queue architecture enables decoupled interaction between scalar and vector pipelines, acting as a relay station to prevent vector instructions from occupying the scalar pipeline's emit channel, which would otherwise affect the efficiency of scalar instruction execution. Furthermore, the scalar result of a vector instruction can be released from the vector emit queue and written to the cross-queue after computation, preventing this vector instruction from occupying other vector instruction entries for an extended period and allowing the emit matrix to receive new vector instructions as quickly as possible.
[0049] This invention achieves efficient decoupling of scalar and vector pipelines, significantly improving overall performance: By separating the preparation of scalar operands for vector instructions from the issuance and execution of the vector instructions themselves, this architecture avoids vector instructions occupying valuable issue slots and register read ports of the scalar core for extended periods. The scalar pipeline can continuously issue subsequent scalar instructions without waiting for the slow preparation or execution of vector operands, thus significantly improving the execution efficiency of scalar threads and the overall system instruction throughput (IPC). After initial decoding, the main body of a vector instruction can immediately enter the vector coprocessor for subsequent micro-operation splitting and renaming preparations, without waiting for all scalar operands to be ready in the scalar pipeline. This "early start" mechanism effectively shortens the execution path of vector instructions and reduces their startup latency.
[0050] This invention optimizes the execution efficiency of vector instructions through asynchronous operations and fine-grained state management: The cross queue acts as an asynchronous interaction hub, allowing the scalar pipeline to prepare operands for multiple vector instructions in parallel and temporarily store them in the queue. Vector execution units do not need to block and wait; they can directly retrieve ready operands from the queue when needed, achieving overlapping execution of scalar data supply and vector computation, which greatly improves the utilization rate of vector functional units.
[0051] Each entry in the vector launch module has a fine-grained operand readiness bit (*_rdy) that is linked to the state of the cross queue. This design allows the vector launch logic to accurately perceive the readiness status of each operand, thus supporting a data stream triggering mechanism that can launch as soon as all operands are available, rather than the traditional sequential launch, laying the foundation for subsequent implementation of more advanced out-of-order launches.
[0052] This invention eliminates the need to copy an entire scalar physical register file or a complex forwarding network within the vector coprocessor to obtain scalar operands; instead, data exchange is performed solely through a centrally managed cross queue. This significantly simplifies the hardware architecture of the vector coprocessor, reducing design complexity and power consumption.
[0053] The scalar results of vector instructions are also temporarily stored and committed through a cross queue, enabling ordered and non-blocking interaction between vector results and the scalar pipeline. Entries in the vector emit module can be released immediately after the computation result is written to the cross queue, without waiting for the entire vector instruction to be committed at the scalar end. This speeds up the turnaround time of the vector instruction queue, improves the utilization of the issue matrix, and increases the receive rate of vector instructions.
[0054] The cross-queue intervenes in the decoding and renaming phases of the scalar processor; the specific timing waveforms are as follows: Figure 3 As shown. When the scalar decoding module recognizes a specific type of vector instruction, i.e., an instruction that needs to access a scalar register, it initiates an allocation request to the cross-queue module. The cross-queue module allocates a maximum of two free entries per cycle based on the input pointer in_ptr. After successful allocation, the cross-queue returns the current entry index as the isid (entry index in the cross-queue) to the decoding and renaming module, writes the instruction's iid (instruction ID) to the entry, and initializes Reg_vld (the register's valid signal) to 0. After the isid allocation is completed, and after the decoding and renaming modules complete the renaming, the read operation on the GPR (General Purpose Register) and the allocated isid need to be sent to the scalar issue queue. For the vector part, the vector opcode and the allocated isid need to be sent to the vector micro-instruction splitting module of the vector coprocessor for further micro-instruction splitting, and then enter the vector dispatch module.
[0055] exist Figure 3 In this context, clk is the clock signal; rstn is the active-low asynchronous reset signal; *_vld represents the valid signal of the corresponding register; *_iid represents the instruction within the corresponding register; *_isid represents the index of the table entry allocated to the corresponding register; *_entry_iid represents the instruction corresponding to the table entry; and *_entry_vld represents the valid signal of the table entry.
[0056] The scalar portion executes within the scalar pipeline. Since scalar instructions typically execute faster than vector instructions, and dependencies between different scalar instructions can lead to out-of-order completion, the cross-queue module must support out-of-order write-back. After the scalar execution unit completes its GPR read operation, it does not send the data directly to the vector unit. Instead, using the carried isid as an address, it writes the read scalar data to the Reg_data field (the data field of the register) of the corresponding entry in the cross-queue module and sets the Reg_vld bit of that entry to 1. This action signifies that the scalar operands required for the vector instruction are ready. This process is completely independent of the vector pipeline state, achieving decoupling.
[0057] A complex vector instruction may be broken down into multiple microinstructions by the vector microinstruction splitting module, and these microinstructions may all need to access the same scalar operand. Therefore, the release of cross queue entries cannot be determined solely by whether the data is read once. Thus, when the vector microinstruction splitting module performs the splitting, it writes the total number of split microinstructions to the `mop_cnt` (microinstruction counter) field of the corresponding cross queue entry via `isid`. Subsequently, whenever a `mop` belonging to that instruction is issued and executed, the vector issue module notifies the cross queue to decrement the corresponding `mop_cnt`. Only when `mop_cnt` reaches 0, indicating that all vector micro-operations corresponding to that entry no longer require the scalar operand, does the entry meet the release condition. The release logic checks entries sequentially according to `out_ptr` (the output pointer in the cross queue). If the entry pointed to by `out_ptr` satisfies `mop_cnt` equal to 0, the entry is reclaimed and `out_ptr` is advanced. A maximum of two entries are released per cycle, with the specific timing as follows: Figure 4 As shown. In Figure 4 In this table, clk is the clock signal; *_vld represents the valid signal of the corresponding register; *_mop_num represents the microinstruction in the corresponding register; *_entry_mop_num represents the microinstruction in the corresponding entry; *_mop_vld represents the valid signal of the microinstruction; *_entry_data_vld represents the valid signal of the data corresponding to the entry; *_entry_vld represents the valid signal of the entry; *_entry_rls_vld represents the valid signal of the entry's permissions; *_entry_mop_cnt represents the microinstruction counter of the entry; and *_entry_done represents a released entry.
[0058] In one or more embodiments, a scalar-vector heterogeneous pipeline interaction method, based on the aforementioned scalar-vector heterogeneous pipeline interaction chip, includes: Step 1: The scalar processor decodes the fetched vector instructions and splits them into two data streams. One data stream sends the vector instructions directly to the vector coprocessor for vector pipeline execution within the vector coprocessor. The other data stream sends the vector instruction ID and the decoded scalar register number to the cross-queue module, which then allocates an entry for each vector instruction that requires scalar operands. Step 2: When the vector pipeline needs to use scalar operands, the vector coprocessor retrieves the required scalar operands from the corresponding entries in the cross-queue module. After the vector instruction is executed, the vector instruction execution result is returned to the cross-queue module via the vector emit queue and written to the corresponding entry, so as to achieve decoupled interaction between the scalar pipeline and the vector pipeline.
[0059] The above are merely preferred embodiments of the present invention and are not intended to limit the present invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A scalar-vector heterogeneous pipelined interaction chip, characterized in that, include: A scalar processor, a vector coprocessor, and a cross-queue module that communicates with the scalar processor and the vector coprocessor; The scalar processor is equipped with a scalar pipeline, and the vector coprocessor is equipped with a vector pipeline. The scalar processor decodes the fetched vector instructions and divides them into two data streams. One data stream sends the vector instructions directly to the vector coprocessor for vector pipeline execution within the vector coprocessor. The other data stream sends the vector instruction ID and the decoded scalar register number to the cross-queue module. The cross-queue module allocates an entry for each vector instruction that requires scalar operands. When the vector pipeline needs to use scalar operands, the vector coprocessor retrieves the required scalar operands from the corresponding entries in the cross-queue module. After the vector instruction is executed, the vector instruction execution result is returned to the cross-queue module via the vector emission queue and written to the corresponding entry, thereby achieving decoupled interaction between the scalar pipeline and the vector pipeline.
2. The scalar-vector heterogeneous pipelined interaction chip as described in claim 1, characterized in that, The scalar processor includes an instruction fetch module, a scalar decoding module, a scalar issue module, a scalar execution module, and a scalar register file; The instruction fetch module is used to fetch instructions and transmit them to the scalar decoding module. The scalar decoding module is used to decode the instructions and determine whether they are vector instructions. If not, they directly enter the scalar execution module. If they are, one data stream assigns an instruction ID to each vector instruction and further determines whether it requires scalar operands. If scalar operands are required, the scalar register numbers in the scalar register file that are needed are decoded and sent to the cross-queue module along with the vector instruction ID. The other data stream allows the vector instructions to directly enter the vector coprocessor.
3. The scalar-vector heterogeneous pipelined interaction chip as described in claim 1, characterized in that, The vector coprocessor includes a vector micro-instruction splitting module, a vector dispatching module, a vector register renaming module, and a vector reordering cache module. The vector micro-instruction splitting module is used to split vector instructions into micro-instructions and then transmit them to the vector dispatching module. In the vector dispatching module, two source registers and one destination register are renamed and then forwarded to the vector register renaming module for processing. After completing the renaming, the vector register renaming module returns the renaming result to the vector dispatch module. The vector dispatch module then forwards the renaming result and instruction ID information to the vector reordering cache module for entry allocation.
4. The scalar-vector heterogeneous pipelined interaction chip as described in claim 3, characterized in that, After the vector instruction is submitted in the vector reordering cache, the cross-queue module is also used to perform write-back operations on scalar operands.
5. The scalar-vector heterogeneous pipelined interaction chip as described in claim 3, characterized in that, After the vector coprocessor completes the forwarding operation, it further decodes the vector instructions, which decodes the vector instructions into specific operation types.
6. The scalar-vector heterogeneous pipelined interaction chip as described in claim 5, characterized in that, Vector instructions are decoded to produce specific operation types, including arithmetic instructions, memory access instructions, and mask instructions.
7. The scalar-vector heterogeneous pipelined interaction chip as described in claim 5, characterized in that, After decoding is complete, all relevant information about the instruction and the decoding result are sent to the vector launch queue module. The vector launch queue module allocates an entry for each vector instruction that enters it to store all source operand information of that vector instruction.
8. The scalar-vector heterogeneous pipelined interaction chip as described in claim 7, characterized in that, Once all operands are ready, the vector instruction is in a ready-to-issue state. When the vector instruction is the oldest in the issue queue, it is issued to the execution unit for computation.
9. The scalar-vector heterogeneous pipelined interaction chip as described in claim 1, characterized in that, The entry stores the vector instruction ID, scalar register number, and scalar operand.
10. A scalar-vector heterogeneous pipeline interaction method, characterized in that, Based on the scalar-vector heterogeneous pipelined interaction chip as described in any one of claims 1-9; comprising: The scalar processor decodes the fetched vector instructions and splits them into two data streams. One data stream sends the vector instructions directly to the vector coprocessor for vector pipeline execution within the vector coprocessor. The other data stream sends the vector instruction ID and the decoded scalar register number to the cross-queue module, which then allocates an entry for each vector instruction that requires scalar operands. When the vector pipeline needs to use scalar operands, the vector coprocessor retrieves the required scalar operands from the corresponding entries in the cross-queue module. After the vector instruction is executed, the execution result is returned to the cross-queue module via the vector emit queue and written to the corresponding entry, thereby achieving decoupled interaction between the scalar pipeline and the vector pipeline.