Vector processor system and method based on self-routing logic
The vector processor system, which utilizes self-routing logic, enables dynamic reconfiguration and bubble-free scheduling of vector instructions across multiple execution units. This addresses the performance bottleneck of vector processors during the execution of non-matrix operators, thereby improving instruction-level parallelism and computational efficiency.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING TSINGMICRO INTELLIGENT TECH CO LTD
- Filing Date
- 2026-02-14
- Publication Date
- 2026-06-19
AI Technical Summary
Existing vector processors have limited instruction-level parallelism during the execution of non-matrix operators. The lack of reconfigurable data paths between vector execution units leads to performance loss under static scheduling and makes it difficult to maintain stability when dealing with complex data dependencies or irregular computation patterns.
A vector processor system based on self-routing logic is adopted. Through the combination of vector execution units, arbitration and routing modules and vector register files, the dynamic reconstruction of vector instructions among multiple execution units and bubble-free chained pipeline scheduling are realized. Data dependencies are dynamically determined and real-time memory access addresses are generated, avoiding the performance loss of static interconnect structure.
It improves the instruction-level parallelism and data consistency of non-matrix operators, significantly reduces pipeline bubbles, enhances overall computational efficiency and energy efficiency ratio, and ensures the stability and continuity of the chain pipeline structure in dynamic operating scenarios.
Smart Images

Figure CN122240184A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of vector processor microarchitecture design technology, and in particular to a vector processor system and method based on self-routing logic. Background Technology
[0002] With the widespread application of large-scale artificial intelligence models (such as large language models and multimodal models), the inference and training processes have placed higher demands on computing power density, parallel processing capabilities, and energy efficiency. While existing general-purpose processors (CPUs) and standard vector processors have relatively mature instruction and hardware optimization capabilities for matrix multiplication operators, they are still generally limited by microarchitectural bottlenecks such as insufficient instruction-level parallelism (ILP), fixed data paths, and low execution unit utilization for the large number of non-matrix computations in AI workloads, such as normalization, non-matrix dependent operations in attention mechanisms, activation function operations, and large-scale element-wise operations. These bottlenecks make it difficult to meet the high throughput performance requirements of high-concurrency scenarios.
[0003] The existing NeuralScale-VME technology builds a vector execution engine based on the RISC-V Vector extension. It accelerates element-wise operations through the SIMD (Single Instruction Multiple Data) mechanism and sets up dedicated functional units for complex arithmetic operations commonly found in activation functions to improve the execution efficiency of such operations. However, this type of computing architecture mainly relies on static scheduling, and lacks a data routing mechanism that can be reconfigured at runtime between execution units, making it impossible to dynamically organize instruction execution paths according to the characteristics of real-time data streams. For non-matrix operators with complex control dependencies or irregular data access patterns, this architecture is significantly limited in terms of improving instruction-level parallelism and overall processing efficiency.
[0004] Another existing technology uses CGRA (Coarse-Grained Reconfigurable Architecture) as its computing architecture, which achieves instruction-level parallelism and dataflow parallelism of loops by configuring multiple processing elements (PEs). Although CGRA has certain advantages in reconfigurability, its hardware configuration latency is relatively long, and it lacks an instruction over-issue capability during the execution phase. This means that switching between different loops requires reloading the configuration and waiting for instruction issuance, which can easily generate a large number of execution bubbles in the pipeline, resulting in a decrease in overall throughput and limiting system efficiency.
[0005] Therefore, there is an urgent need for a new vector processor microarchitecture that features reconfigurable data paths, supports fine-grained dynamic scheduling, and can efficiently organize the parallel operation of heterogeneous execution units.
[0006] This section is intended to provide background or context for the embodiments of the invention set forth in the claims. The description herein is not an admission that it is prior art simply because it is included in this section. Summary of the Invention
[0007] This invention provides a vector processor system based on self-routing logic to achieve dynamic reconstruction of data paths for vector instructions among multiple vector execution units and bubble-free chained pipeline scheduling, thereby improving instruction-level parallelism, data consistency, and overall computational efficiency during the execution of non-matrix operators.
[0008] The vector processor system based on self-routing logic includes: a vector execution unit, an arbitration and routing module, and a vector register file; The vector execution unit is used to obtain the corresponding first vector instruction, determine the instruction state of the first vector instruction according to the state machine corresponding to the first vector instruction, and send the second vector instruction whose instruction state is waiting for arbitration and its corresponding arbitration request instruction to the arbitration and routing module. The arbitration and routing module is used to arbitrate each of the second vector instructions based on the arbitration request instruction to determine the execution permission of each of the second vector instructions; generate the memory access address of the vector register file based on the second vector instruction that has obtained the execution permission; and the vector execution unit obtains read data from the vector register file based on the memory access address to perform vector calculation.
[0009] In some embodiments, the system further includes: a decoding and scheduling module, configured to decode each of the acquired first vector instructions to obtain decoding information; and to send each of the first vector instructions to the instruction queue in the corresponding vector execution unit according to the operation type field in the decoding information.
[0010] In some embodiments, the arbitration and routing module includes: an arbitrator, configured to receive the second vector instruction and its corresponding arbitration request instruction; determine, based on the source register field and the target register field in the second vector instruction, whether there is a data dependency relationship between the second vector instruction and a third vector instruction that is in a waiting execution state or an execution state; if there is no data dependency relationship, determine that the second vector instruction has obtained the execution permission.
[0011] In some embodiments, the arbitration and routing module includes a self-routing logic module, which, after determining that the second vector instruction has obtained the execution permission, generates a memory access address for the vector register file based on the source register field and the target register field in the second vector instruction, and sends the memory access address to the vector register file through a first multiplexer.
[0012] In some embodiments, the memory access address includes a read address; the first multiplexer selects the corresponding source vector register in the vector register stack according to the read address, reads the read data in the source vector register, and sends the read data to the input buffer in the vector execution unit.
[0013] In some embodiments, the vector execution unit further includes: a vector array, used to obtain the read data from the input buffer, perform vector calculations based on the read data, and output the calculation results to the output buffer.
[0014] In some embodiments, the memory access address further includes a write address; the vector execution unit stores the calculation result into the vector register file through a second multiplexer; the second multiplexer selects the corresponding target vector register in the vector register file according to the write address, reads the calculation result from the output buffer, and writes it into the target vector register.
[0015] In some embodiments, when any of the vector execution units is in a paused state, the vector execution unit is further configured to send the back pressure signal up level by level to the vector execution units that have a data dependency relationship with it; after receiving the back pressure signal, each of the vector execution units performs a latching operation.
[0016] This invention also provides a vector processor method based on self-routing logic, which enables dynamic reconstruction of data paths for vector instructions among multiple vector execution units and bubble-free chained pipeline scheduling, thereby improving instruction-level parallelism, data consistency, and overall computational efficiency during the execution of non-matrix operators.
[0017] This vector processor method based on self-routing logic includes: The vector execution unit obtains the corresponding first vector instruction, determines the instruction state of the first vector instruction based on the state machine corresponding to the first vector instruction, and sends the second vector instruction whose instruction state is waiting for arbitration and its corresponding arbitration request instruction to the arbitration and routing module. The arbitration and routing module arbitrates each of the second vector instructions based on the arbitration request instruction to determine the execution permission of each of the second vector instructions; it generates the memory access address of the vector register file based on the second vector instruction that has obtained the execution permission; and the vector execution unit retrieves read data from the vector register file based on the memory access address to perform vector calculation.
[0018] This invention also provides a chip, including the above-described vector processor system based on self-routing logic.
[0019] Specifically, this invention is applicable to wafer-level chips, wherein the wafer-level chip can be configured with multiple computing cores, some or all of which include the matrix multiplication and addition operation unit of this invention; the chip can be used in scenarios such as AI large model training and high-performance scientific computing, and by integrating a high-density MAC array at the wafer-level scale, the advantages of this invention, such as multi-data type fusion, low power consumption, and high parallelism, are realized.
[0020] This invention also provides a board card including the above-mentioned chip.
[0021] This invention also provides an electronic device including the aforementioned circuit board.
[0022] The vector processor system and method based on self-routing logic provided in this invention uses a state machine within the vector execution unit to finely manage vector instructions. This allows vector instructions in the waiting-for-arbitration state to initiate arbitration requests to the arbitration and routing module in advance, thereby reducing pause time during instruction delivery and improving instruction throughput. The arbitration and routing module dynamically determines data dependencies based on the arbitration request instructions and generates real-time read / write addresses for the vector register file without requiring pre-configuration of the interconnect structure by the compiler. This enables data paths between vector execution units to be reconstructed in real-time based on the data dependencies of vector instructions, avoiding performance losses caused by traditional static scheduling methods when handling complex data dependencies or irregular computation patterns. Accessing the vector register file based on the real-time generated memory access addresses allows the vector execution unit to obtain operands with lower memory access latency, thus accelerating the vector computation process. This vector processor system significantly reduces pipeline bubbles and improves instruction-level parallelism while maintaining instruction execution correctness. This makes the chained pipeline structure more stable and continuous in dynamic operating scenarios, thereby improving the execution efficiency of non-matrix vector operators and the overall system energy efficiency ratio. Attached Figure Description
[0023] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. In the drawings: Figure 1 This is a schematic diagram of the workflow of a vector processor system based on self-routing logic in an embodiment of the present invention; Figure 2 This is a schematic diagram of the execution pipeline of a vector processor system based on self-routing logic in an embodiment of the present invention; Figure 3 This is a schematic diagram of the state machine of the instruction queue in an embodiment of the present invention; Figure 4 This is a fully interconnected cross switch network in this embodiment of the invention; Figure 5 This is a schematic diagram of multiple pipeline switching in an embodiment of the present invention; Figure 6 This is a schematic diagram of multi-vector instruction pipeline execution in an embodiment of the present invention; Figure 7 This is a schematic diagram of the pipelined backpressure logic of a vector processor system based on self-routing logic in an embodiment of the present invention. Detailed Implementation
[0024] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings. Here, the illustrative embodiments and their descriptions are used to explain the present invention, but are not intended to limit the present invention. It should be noted that, unless otherwise specified, the embodiments and features in the embodiments of this application can be arbitrarily combined with each other. The acquisition, storage, use, and processing of data in the technical solutions of this application all comply with relevant laws and regulations. The user information in the embodiments of this application is obtained through legal and compliant means, and the acquisition, storage, use, and processing of user information have been authorized and agreed upon by the customer.
[0025] To facilitate understanding of the technical solution provided in this application, the relevant content of the technical solution in this application will be explained below.
[0026] To address the technical challenges of existing vector processors, such as limited instruction-level parallelism during non-matrix operator execution, lack of reconfigurable data paths between vector execution units (VEUs), reliance on static compilation strategies for vector instruction scheduling, frequent pipeline stalls caused by register read / write conflicts, and the difficulty in maintaining stability of multi-vector execution unit chained pipeline structures under operating conditions, this application proposes a vector processor system based on self-routing logic.
[0027] This vector processor system comprises vector execution units, an arbitration and routing module, and a vector register file (VRF). Dynamic scheduling and real-time data routing mechanisms are introduced during the instruction issue phase. The vector processor system can automatically determine the existence of data dependencies based on the register access fields of vector instructions, and after successful arbitration, the self-routing logic generates real-time read / write addresses for the vector register file. This eliminates the need for pre-compiled static interconnect configurations, allowing for the construction of arbitrary chained execution pipelines during runtime. High-throughput vector computation is achieved through a high-bandwidth execution path composed of input buffers, vector arrays, and output buffers. Combining global back-pressure and hold mechanisms, when any vector execution unit enters a paused state due to processing latency, throughput mismatch, or memory miss, the back-pressure signal can be passed up the pipeline stages, freezing the relevant pipeline states. This ensures the stability and consistency of the chained pipeline across vector execution units under dynamic operating conditions.
[0028] This invention provides a vector processor system based on self-routing logic. For example... Figure 1 As shown, the vector processor system includes: a vector execution unit, an arbitration and routing module, and a vector register file.
[0029] The vector execution unit is used to acquire the corresponding first vector instruction, determine the instruction state of the first vector instruction based on the state machine corresponding to the first vector instruction, and send the second vector instruction with the instruction state of waiting for arbitration and its corresponding arbitration request instruction to the arbitration and routing module.
[0030] The arbitration and routing module arbitrates each second vector instruction based on the arbitration request instruction to determine the execution authority of each second vector instruction. Based on the second vector instruction that has obtained execution authority, it generates the memory access address for the vector register file. The vector execution unit then retrieves data from the vector register file based on the memory access address and performs vector calculations.
[0031] According to the above embodiments, the vector execution unit's internal state machine performs fine-grained management of vector instructions, enabling vector instructions in the waiting-for-arbitration state to initiate arbitration requests to the arbitration and routing module in advance. This reduces pause time during instruction delivery and improves instruction throughput. The arbitration and routing module dynamically determines data dependencies based on the arbitration request instructions and generates real-time read / write addresses for the vector register file without requiring pre-configuration of the interconnect structure by the compiler. This allows data paths between vector execution units to be reconstructed in real-time based on the data dependencies of vector instructions, avoiding performance losses caused by traditional static scheduling methods when handling complex data dependencies or irregular computation patterns. Accessing the vector register file based on the real-time generated memory access addresses allows the vector execution unit to obtain operands with lower memory access latency, thereby accelerating the vector computation process. This vector processor system significantly reduces pipeline bubbles and improves instruction-level parallelism while maintaining instruction execution correctness. This makes the chained pipeline structure more stable and continuous in dynamic operating scenarios, thereby improving the execution efficiency of non-matrix vector operators and the overall system energy efficiency ratio.
[0032] In some embodiments, such as Figure 1 As shown, the vector processor system based on self-routing logic also includes a decoding and dispatch module, which decodes each acquired first vector instruction to obtain decoding information. Based on the operation type field in the decoding information, each first vector instruction is sent to the instruction queue in the corresponding vector execution unit.
[0033] In embodiments of the present invention, such as Figures 1 to 2As shown, in the decoding and dispatching stages of the vector processor, the decoding and dispatching module decodes multiple vector instructions obtained from the central processing unit (CPU) to generate corresponding decoding information.
[0034] The vector instructions of this invention adopt an instruction format that conforms to the RISC-V instruction set architecture standard, as shown in Table 1 below. The decoding information includes: the first source vector register (vs1) field, the second source vector register (vs2) field, the destination vector register (vd) field, the vector mask register (vm) field, the operation type (opcode) field, and the function field.
[0035] Table 1
[0036] like Figure 1 As shown, each vector execution unit has its own independent instruction queue. After decoding the vector instructions, the Decode and Dispatch module encapsulates the vector instructions according to the operation type (opcode) field in the decoding information, and sends each encapsulated vector instruction to the instruction queue of the corresponding vector execution unit. The operation type field includes: vector addition (Vector ADD, vadd), vector multiplication (Vector Multiply, vmul), vector comparison (Vector Compare, vcmp), and vector type conversion (Vector CVT, vcvt). The type of vector execution unit corresponds one-to-one with the instruction operation type. The vector execution unit types include: vector addition execution unit (Vector ADD), vector multiplication execution unit (Vector MUL), vector comparison execution unit (Vector CMP), and vector type conversion execution unit (Vector CVT).
[0037] According to the above embodiments, standardized RISC-V vector instruction decoding information is generated during the decoding stage, and vector instructions are precisely assigned to the independent instruction queues of the corresponding vector execution units according to the operation type. This makes the organization of the vector instruction stream clearer and the scheduling path simpler, thereby effectively improving the utilization efficiency of execution resources. Through the operation type-based instruction routing mechanism, structural resource contention between different types of vector instructions in the same execution unit is avoided, reducing execution conflicts and improving instruction-level parallelism. Combined with the independent instruction queues set within each vector execution unit, different types of vector instructions can achieve higher instruction throughput while maintaining non-interference in data paths, ultimately significantly improving the overall execution efficiency of the vector processor in non-matrix operation scenarios.
[0038] In some embodiments, such as Figure 3 As shown, the instruction queue contains a per-instruction finite state machine (FSM) for managing each instruction. This state machine stores and controls the instruction states of multiple vector instructions. The instruction states include: WaitingArbiter, Arbiter Done – Waiting Processing, Processing, and Processing Done – Waiting Exit.
[0039] like Figures 1 to 3 As shown, during the instruction issue phase of the vector processor, after receiving the vector instructions sent by the decoding and scheduling module, the instruction queue determines the current instruction state according to the state machine entry corresponding to each vector instruction, and sends the vector instructions in the waiting arbitration state to the arbiter to perform data dependency judgment.
[0040] According to the above embodiments, each vector instruction is configured with an independent state machine in the instruction queue for fine-grained control between four instruction states. This allows multiple vector instructions to be executed continuously within the same vector execution unit in a bubble-free pipeline manner. This instruction-by-instruction state management mechanism allows vector instructions to enter the instruction queue in advance for over-issuance before obtaining execution privileges, avoiding the problem of insufficient instruction delivery bandwidth due to limitations in the CPU's issue width. By allowing vector instructions to enter the instruction queue in advance while waiting for arbitration, this invention effectively reduces pipeline idle cycles caused by the issue rate being lower than the execution rate, significantly improving the utilization and instruction throughput of the vector execution unit, thereby improving the execution performance of the vector processor in high-concurrency non-matrix operation scenarios.
[0041] In some embodiments, such as Figure 1 As shown, the arbitration and routing module includes an arbiter, used to receive the second vector instruction and its corresponding arbitration request instruction. Based on the source register field and destination register field in the second vector instruction, it determines whether there is a data dependency between the second vector instruction and a third vector instruction that is in a waiting execution state or an execution state. If there is no data dependency, the second vector instruction is granted execution permission. Here, the second vector instruction is a vector instruction in a waiting arbitration state, and the third vector instruction is a vector instruction in a waiting execution state or an execution state.
[0042] In embodiments of the present invention, such as Figures 1 to 2 As shown, during the arbitration and routing phase of the vector processor, the arbiter receives vector instructions awaiting arbitration and their corresponding arbitration request instructions from the instruction queues of each vector execution unit. Upon receiving an arbitration request instruction, the arbiter arbitrates the execution permission of the vector instruction. Based on the first source vector register field, the second source vector register field, and the target vector register field of the vector instruction, the arbiter determines whether the registers accessed by the vector instruction have data dependencies with other preceding vector instructions that are either awaiting execution or in execution. These data dependencies include: Read After Write (RAW), Write After Read (WAR), and Write After Write (WAW).
[0043] In this context, RAW indicates that the preceding vector instruction performs a write operation on a register, while the following vector instruction reads that register. WAR indicates that the preceding vector instruction reads a register, while the following vector instruction writes to the same register. WAW indicates that two vector instructions sequentially write to the same register.
[0044] When the arbitrator determines that there is no data dependency between the vector instruction in the pending arbitration state and the preceding vector instruction, the arbitrator determines that the vector instruction in the pending arbitration state has execution permission and sends it to the routing logic module to generate a real-time memory access address.
[0045] When the arbitrator determines that a vector instruction in the pending arbitration state has a data dependency with its preceding vector instruction, the arbitrator determines that the vector instruction in the pending arbitration state does not have execution permission and needs to wait for the preceding vector instruction to complete its read or write operation before sending the vector instruction in the pending arbitration state to the self-routing logic module. This ensures the correct read and write order of the vector register file and avoids operand read errors or write overwrite errors.
[0046] For example, suppose vector instruction 1 is: `vadd v0, v1, v2`; and vector instruction 2 is: `vsub v3, v1, v0`. Here, vector instruction 1 indicates performing element-by-element addition on corresponding elements of vector registers v1 and v2, and writing the result to the target vector register v0. Vector instruction 2 performs an element-by-element subtraction operation between corresponding elements of vector register v1 and vector register v0, and writes the result to vector register v3. .
[0047] Since both the second source operand of vector instruction 2 and the destination operand of vector instruction 1 are stored in vector register v0, the two vector instructions form a write-after-read dependency on register v0. Therefore, vector instruction 2 cannot obtain execution privileges until vector instruction 1 completes its write operation on vector register v0. Vector instruction 2 needs to wait in the arbitrator for vector instruction 1 to complete its write operation before it can begin execution, thus ensuring the correctness of the source operand read and the consistency of the register data.
[0048] According to the above embodiments, the arbitrator determines the data dependency relationship between the source register field and the target register field of the vector instruction. This allows it to identify and avoid erroneous readings or write overwrite problems caused by disordered register read / write order before the vector instruction enters the execution stage. The arbitrator allows the vector instruction to enter the self-routing logic module for execution only when there is no data conflict. When a data dependency exists, the scheduling of the vector instruction is automatically delayed, ensuring that the vector register stack maintains the correctness and consistency of data access even in high-concurrency read / write scenarios. By dynamically resolving register dependencies at runtime, this invention significantly improves the correctness and stability of pipeline instruction scheduling, reduces pipeline rollback, pauses, or structural bubbles caused by data conflicts, and enables the chained pipeline composed of multiple vector execution units to run continuously and stably under dynamic execution conditions, thereby effectively improving the computational efficiency and execution performance of the vector processor system.
[0049] In some embodiments, such as Figure 1 As shown, the arbitration and routing module also includes a self-routing logic module. After determining that the second vector instruction has obtained execution permission, the self-routing logic module generates the memory access address of the vector register file based on the source register field and the destination register field in the second vector instruction, and sends the memory access address to the vector register file through a first multiplexer. The first multiplexer is a 32-to-1 multiplexer.
[0050] In embodiments of the present invention, such as Figure 4 As shown, the fully interconnected crossbar network includes: a self-routing logic module, multiple N-to-1 multiplexers (MUX N-to-1), a vector register file, and multiple 32-to-1 multiplexers (MUX 32-to-1). The vector register file used in this invention consists of 32 vector registers, denoted as Vector Register (VREG) 0, Vector Register 1, ..., Vector Register 31. Each vector register in the vector register file stores a set of vector data. The vector data has a length of 32 and a bit width of DLEN.
[0051] The write end of each vector register is connected to the output end of the corresponding N-to-1 multiplexer. The read end of each vector register is connected to the input ends of 32 32-to-1 multiplexers, so that any vector execution unit can select the required data from any vector register in the vector register file through the 32-to-1 multiplexer, thereby realizing full interconnection access between the vector register file and multiple vector execution units.
[0052] The inputs of each N-to-1 multiplexer are connected to the outputs of multiple vector execution units. The outputs of each 32-to-1 multiplexer are connected to the inputs of the corresponding vector execution unit.
[0053] like Figures 1 to 4 As shown, during the arbitration and routing phase of the vector processor, after the self-routing logic module determines that a vector instruction in a waiting-for-arbitration state has obtained execution permission, the self-routing logic module reads the first source vector register field, the second source vector register field, and the target vector register field of the vector instruction, and generates a read register select signal and a write register select signal, respectively. The read register select signal carries the selection code indicating the real-time read address of the vector register file, and the write register select signal carries the selection code indicating the real-time write address of the vector register file.
[0054] The self-routing logic module outputs a read register select signal to the corresponding 32-to-1 multiplexer to select the source register for reading data. The self-routing logic module also outputs a write register select signal to the corresponding N-to-1 multiplexer to select the source register for writing data.
[0055] According to the above embodiments, by constructing a fully interconnected cross-connect network between the vector register file and multiple vector execution units, dynamic routing of register read / write paths is achieved. This allows any vector execution unit to access any vector register in the vector register file within a single clock cycle, significantly improving the flexibility and parallelism of the data path. The self-routing logic module generates read register selection signals and write register selection signals in real time based on the register fields in the vector instructions. This allows the register access path to be automatically established during runtime without relying on a preset interconnect configuration, avoiding resource conflicts and scheduling constraints caused by static interconnect structures. Through the combination of N-to-1 multiplexers and 32-to-1 multiplexers, this vector processor system achieves non-blocking access to the vector register file under high concurrency conditions, improving the continuity of vector instruction pipelined execution and overall throughput, thereby effectively enhancing the execution performance and energy efficiency of the vector processor in non-matrix operator scenarios.
[0056] In some embodiments, the memory access address includes a read address. The first multiplexer selects the corresponding source vector register in the vector register file according to the read address, reads the read data in the source vector register, and sends the read data to the input buffer in the vector execution unit.
[0057] In this embodiment, such as Figures 1 to 4As shown, during the operand fetch stage of the vector processor, the 32-to-1 multiplexer selects the corresponding source vector registers in the vector register file based on the read addresses of the first and second source vector registers carried in the read register selection signal, and reads the first and second source operands from them. The 32-to-1 multiplexer then sends the read first and second source operands as read data to the input buffer of the corresponding vector execution unit.
[0058] According to the above embodiments, by using a 32-to-1 multiplexer to select and read any source vector register from the vector register file based on the real-time read address, the vector execution unit can obtain the required operands within a single clock cycle. This avoids access conflicts and additional latency caused by fixed interconnection methods, thereby forming a high-bandwidth, low-latency data path. This mechanism improves the continuity of vector instruction pipelined execution and overall computational throughput, enhancing the data access efficiency of the vector processor in high-concurrency scenarios.
[0059] In some embodiments, such as Figure 1 As shown, the vector execution unit also includes a vector array, which is used to obtain read data from the input buffer, perform vector calculations based on the read data, and output the calculation results to the output buffer.
[0060] In embodiments of the present invention, such as Figures 1 to 2 As shown, during the execution phase of the vector processor, the vector array reads data from the input buffer and performs corresponding vector computation operations based on the first and second source operands. After completing the vector computation, the vector array outputs the computation result to the output buffer.
[0061] According to the above embodiments, before the vector instruction enters the execution stage, the input buffer temporarily stores the source operands read from the vector register file, decoupling the read operation from the subsequent vector calculation process. This avoids the vector execution unit from idling due to register access latency or read port contention, thereby significantly improving the continuity and instruction throughput of the vector calculation pipeline. The output buffer temporarily stores the calculation results of the vector array, decoupling the write-back operation from the vector register file from the vector calculation process. This reduces the blocking of the vector calculation pipeline caused by write port conflicts or write-back latency, enabling multiple vector execution units to maintain stable chained pipeline operation even in a high-concurrency environment.
[0062] In some embodiments, the memory access address further includes a write address. The vector execution unit stores the computation result into the vector register file through a second multiplexer. The second multiplexer selects the corresponding target vector register in the vector register file according to the write address, reads the computation result from the output buffer, and writes it into the target vector register. The second multiplexer is an N-to-1 multiplexer.
[0063] In this embodiment of the invention, during the write-back phase of the vector processor, the N-to-1 multiplexer selects the corresponding target vector register in the vector register file according to the write address of the target vector register carried in the write register selection signal, and writes the calculation result of the output buffer into the target vector register to complete the write-back operation of the vector calculation result.
[0064] According to the above embodiments, by using an N-to-1 multiplexer to select the target vector register based on the real-time write address and complete the write-back of the calculation result, the write path of the vector register file has dynamic reconfigurability, avoiding write port conflicts caused by fixed interconnections, thereby improving the bandwidth utilization and pipeline continuity of the write-back stage.
[0065] In some embodiments, such as Figure 5 As shown, the fully interconnected crossbar network constructed based on the aforementioned embodiments can support the dynamic combination and switching of different vector execution units to realize the chained pipelined execution of vector instructions among multiple vector execution units.
[0066] For example, in the first pipeline (Pipeline 0), the vector addition execution unit first reads the source operand from the vector register file and performs vector addition, writing the addition result to vector register 0. Next, the vector multiplication execution unit reads the addition result from vector register 0 as its source operand, performs vector multiplication, and writes the multiplication result to vector register 1. Subsequently, the vector type conversion execution unit retrieves the source operand from vector register 1, performs vector type conversion, and writes the converted target operand to vector register 2. Finally, the vector comparison execution unit reads the operand from vector register 2, performs vector comparison, and outputs the comparison result, thus achieving the complete execution of multiple vector instruction chains.
[0067] In Pipeline 1, the vector multiplication execution unit first reads the source operand from the vector register file and performs the vector multiplication operation, writing the multiplication result to vector register 0. Next, the vector addition execution unit reads the multiplication result from vector register 0 as its source operand, performs the vector addition operation, and writes the addition result to vector register 1. Subsequently, the vector comparison execution unit reads the operand from vector register 1, performs the vector comparison operation, and writes the comparison result to vector register 2. Finally, the vector type conversion execution unit retrieves the source operand from vector register 2, performs the vector type conversion operation, and outputs the conversion result, thus achieving the complete execution of multiple vector instruction chains.
[0068] According to the above embodiments, by utilizing the vector register file as a data exchange channel between vector execution units, different types of vector execution units can be connected in series as needed during runtime to form a chain-like pipeline, thereby realizing the sequential pipelined execution of vector instructions between heterogeneous vector execution units. This mechanism achieves dynamic reconstruction of vector execution paths without requiring a fixed interconnect structure, enabling multiple types of vector operators to be processed continuously in a pipelined manner, improving instruction-level parallelism and overall execution efficiency during AI operator execution.
[0069] In some embodiments, such as Figure 6 As shown, the multi-instruction pipelined execution process in this invention is illustrated using a vector addition execution unit as an example. Figure 6 In the context of hardware pipeline, the addition operation pipeline consists of multiple pipeline stages within the vector addition execution unit. The vector instruction in execution state (Processing Instruction) represents the vector instruction number (Inst X) being processed by each stage of the pipeline within the same clock cycle.
[0070] In this embodiment of the invention, each vector addition instruction needs to go through four addition pipeline stages to complete the operation. Taking the continuously issued addition instructions inst0, inst1, inst2, inst3, and inst4 as an example, their flow in the pipeline during different clock cycles is as follows: like Figure 6 As shown, at time T0, the vector addition instruction inst0 enters addition pipeline stage 0. At time T1, the vector addition instruction inst0 enters addition pipeline stage 1, and the vector addition instruction inst1 enters addition pipeline stage 0. At time T2, the vector addition instruction inst0 enters addition pipeline stage 2, the vector addition instruction inst1 enters addition pipeline stage 1, and the vector addition instruction inst2 enters addition pipeline stage 0.
[0071] At time T3, the vector addition instruction inst0 enters addition pipeline stage 3, the vector addition instruction inst1 enters addition pipeline stage 2, the vector addition instruction inst2 enters addition pipeline stage 1, and the vector addition instruction inst3 enters addition pipeline stage 0. At time T4, the vector addition instruction inst0 completes execution and exits the pipeline.
[0072] According to the above embodiments, the addition execution unit can receive new vector instructions in each clock cycle, enabling multiple vector instructions to be executed continuously and without bubbles within the same addition execution unit, thereby significantly improving instruction throughput and execution efficiency.
[0073] In some embodiments, such as Figure 7 As shown, when any vector execution unit is in a paused state, the vector execution unit is also used to send the backpressure signal step by step upwards to the vector execution units that have a data dependency relationship with it. After receiving the backpressure signal, each vector execution unit performs a latching operation.
[0074] In this embodiment of the invention, the example is taken where the vector comparison execution unit is unable to continue processing data and enters a paused state.
[0075] When a vector comparison execution unit pauses, it sends a backpress signal to its parent vector register (VREG). Upon receiving the backpress signal, the parent vector register enters a hold state. Simultaneously, the vector comparison execution unit also sends a backpress signal to its corresponding self-routing logic module, which then forwards this backpress signal to the Vector Register File Scoreboard. Upon receiving the backpress signal, the Vector Register File Scoreboard sends a backpress signal to the self-routing logic module of the parent vector type conversion execution unit that has a data dependency with the vector comparison execution unit, causing that vector type conversion execution unit to enter a hold state.
[0076] Upon receiving the backpressure signal, the vector type conversion execution unit continues to pass the backpressure signal to its parent vector register, causing the vector register to enter a holding state. Through the vector register file state scorer and the self-routing logic modules at each level, the backpressure signal is passed up the hierarchy, causing the previous-level vector multiplication execution unit, vector addition execution unit, and their corresponding parent vector registers to enter a holding state in sequence.
[0077] When the vector register file status scorer receives the backpressure signal, it performs a latch operation on all vector execution units that have data dependencies with the vector comparison execution unit. Vector execution units in the hold state stop receiving new vector instructions and stop triggering read / write operations. The vector registers corresponding to each vector execution unit retain their internal data, and the read and write addresses of each vector register stop updating during the latching period.
[0078] According to the above embodiments, the backpressure progressive upward propagation mechanism, implemented collaboratively by the vector register file state scorer and the self-routing logic module, can promptly freeze all upper-level vector execution units and their corresponding vector registers that have data dependencies on them when the lower-level vector execution unit enters a paused state. This maintains the consistency of the data path and avoids data out-of-order, write-after-write overwrite, or write-after-read errors caused by continuing to schedule instructions. Simultaneously, each level of vector execution unit stops triggering register read / write access in the hold state, effectively reducing pipeline conflicts and resource contention. This allows the vector computing pipeline to operate stably under complex dependency conditions, further improving the operational reliability and instruction execution correctness of the vector processor.
[0079] This application provides a vector processor method based on self-routing logic, applied to the aforementioned vector processor device based on self-routing logic. This vector processor method based on self-routing logic is based on the same inventive concept as the vector processor device based on self-routing logic in one embodiment of this application, and the principle of solving the problem is similar. Therefore, the implementation of the vector processor method based on self-routing logic is the same as the implementation of the vector processor device based on self-routing logic in one embodiment of this application, and repeated details will not be described again. As used below, the terms "unit" or "module" can refer to a combination of software and / or hardware that implements a predetermined function. Although the system described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.
[0080] This vector processor method based on self-routing logic includes: The vector execution unit obtains the corresponding first vector instruction and determines the instruction state of the first vector instruction based on the state machine corresponding to the first vector instruction. It then sends the second vector instruction, whose instruction state is "waiting for arbitration," and its corresponding arbitration request instruction to the arbitration and routing module.
[0081] The arbitration and routing module arbitrates each second vector instruction based on the arbitration request instruction to determine the execution authority of each second vector instruction. Based on the second vector instruction that has obtained execution authority, it generates the memory access address for the vector register file. The vector execution unit retrieves data from the vector register file based on the memory access address and performs vector calculations.
[0082] In some embodiments, the vector processor method based on self-routing logic further includes: a decoding and scheduling module decoding each acquired first vector instruction to obtain decoding information. Based on the operation type field in the decoding information, each first vector instruction is sent to the instruction queue in the corresponding Vector Execution Unit (VEU).
[0083] In some embodiments, the vector processor method based on self-routing logic further includes: an arbitrator receiving a second vector instruction and its corresponding arbitration request instruction. Based on the source register field and destination register field in the second vector instruction, it is determined whether there is a data dependency between the second vector instruction and a third vector instruction that is in a waiting execution state or an execution state. If there is no data dependency, it is determined that the second vector instruction has execution permission.
[0084] In some embodiments, the vector processor method based on self-routing logic further includes: after determining that the second vector instruction has obtained execution permission, the self-routing logic module is used to generate a memory access address of the vector register file according to the source register field and the target register field in the second vector instruction, and send the memory access address to the vector register file through a first multiplexer.
[0085] In some embodiments, the memory access address includes a read address. The first multiplexer selects the corresponding source vector register in the vector register file according to the read address, reads the read data in the source vector register, and sends the read data to the input buffer in the vector execution unit.
[0086] In some embodiments, the vector processor method based on self-routing logic further includes: a vector array for acquiring read data from an input buffer, performing vector computation based on the read data, and outputting the computation result to an output buffer.
[0087] In some embodiments, the memory access address further includes a write address. The vector execution unit stores the computation result into the vector register file through a second multiplexer. The second multiplexer selects the corresponding target vector register in the vector register file according to the write address, reads the computation result from the output buffer, and writes it into the target vector register.
[0088] In some embodiments, when any vector execution unit is in a paused state, the vector execution unit is further configured to send the backpressure signal step by step upwards to the vector execution units with which it has a data dependency. Each vector execution unit performs a latching operation upon receiving the backpressure signal.
[0089] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.
[0090] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.
[0091] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.
[0092] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.
[0093] In the description of this specification, the references to terms such as "an embodiment," "a specific embodiment," "some embodiments," "for example," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.
[0094] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above descriptions are merely specific embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.
Claims
1. A vector processor system based on self-routing logic, characterized in that, include: Vector execution unit, arbitration and routing module, and vector register file; The vector execution unit is used to obtain the corresponding first vector instruction and determine the instruction state of the first vector instruction according to the state machine corresponding to the first vector instruction. The second vector instruction whose instruction status is pending arbitration and its corresponding arbitration request instruction are sent to the arbitration and routing module; The arbitration and routing module is used to arbitrate each of the second vector instructions based on the arbitration request instruction, and to determine the execution authority of each of the second vector instructions; The second vector instruction, having obtained execution privileges, generates a memory access address for the vector register file; the vector execution unit retrieves read data from the vector register file based on the memory access address and performs vector calculations.
2. The system according to claim 1, characterized in that, Also includes: The decoding and scheduling module is used to decode each of the acquired first vector instructions to obtain decoding information; Based on the operation type field in the decoding information, each of the first vector instructions is sent to the instruction queue in the corresponding vector execution unit.
3. The system according to claim 1, characterized in that, The arbitration and routing module includes: an arbitrator, used to receive the second vector instruction and its corresponding arbitration request instruction; based on the source register field and the target register field in the second vector instruction, to determine whether there is a data dependency relationship between the second vector instruction and the third vector instruction that is in a waiting execution state or an execution state; if there is no data dependency relationship, to determine that the second vector instruction has obtained the execution permission.
4. The system according to claim 2 or 3, characterized in that, The arbitration and routing module includes a self-routing logic module. After determining that the second vector instruction has obtained the execution permission, the self-routing logic module is used to generate a memory access address of the vector register file according to the source register field and the target register field in the second vector instruction, and send the memory access address to the vector register file through a first multiplexer.
5. The system according to claim 4, characterized in that, The memory access address includes a read address; the first multiplexer selects the corresponding source vector register in the vector register file according to the read address, reads the read data in the source vector register, and sends the read data to the input buffer in the vector execution unit.
6. The system according to claim 5, characterized in that, The vector execution unit further includes a vector array, used to obtain the read data from the input buffer, perform vector calculations based on the read data, and output the calculation results to the output buffer.
7. The system according to claim 6, characterized in that, The memory access address also includes a write address; the vector execution unit stores the calculation result into the vector register file through the second multiplexer; the second multiplexer selects the corresponding target vector register in the vector register file according to the write address, reads the calculation result from the output buffer and writes it into the target vector register.
8. The system according to claim 1, characterized in that, When any of the vector execution units is in a paused state, the vector execution unit is also used to send the back pressure signal up level by level to the vector execution units that have a data dependency relationship with it; after receiving the back pressure signal, each of the vector execution units performs a latching operation.
9. A vector processor method based on self-routing logic, characterized in that, include: The vector execution unit obtains the corresponding first vector instruction and determines the instruction state of the first vector instruction based on the state machine corresponding to the first vector instruction. The second vector instruction whose instruction status is pending arbitration and its corresponding arbitration request instruction are sent to the arbitration and routing module; The arbitration and routing module arbitrates each of the second vector instructions based on the arbitration request instruction to determine the execution authority of each of the second vector instructions; The second vector instruction, having obtained execution privileges, generates a memory access address for the vector register file; the vector execution unit retrieves read data from the vector register file based on the memory access address and performs vector calculations.
10. A chip, characterized in that, The vector processor system based on self-routing logic, as described in any one of claims 1 to 8.
11. A circuit board, characterized in that, Includes the chip described in claim 10.
12. An electronic device, characterized in that, Includes the board as described in claim 11.