Vector processor system and method

CN122240185APending Publication Date: 2026-06-19BEIJING TSINGMICRO INTELLIGENT TECH CO LTD

View PDF 0 Cites 0 Cited by

Patent Information

Authority / Receiving Office: CN · China
Patent Type: Applications(China)
Current Assignee / Owner: BEIJING TSINGMICRO INTELLIGENT TECH CO LTD
Filing Date: 2026-02-14
Publication Date: 2026-06-19

Application Information

Patent Timeline

14 Feb 2026

Application

19 Jun 2026

Publication

CN122240185A

IPC: G06F9/302; G06F9/30; G06F9/38; G06F9/34

AI Tagging

Technical Efficacy Phrases

Improve execution efficiencyImprove system throughput capacity

Explore More Agents

Novelty Search
Search existing technologies and assess novelty
↗
FTO
Analyze whether a product may infringe others' patents
↗
Design FTO
Check prior-design risk for exterior design
↗
Drafting
Draft patent application text based on a technical solution
↗
Find Solutions with TRIZ
Generate feasible solution to solve your technical challenge
↗

Similar Technology Patents

A dynamic programming method and system for unmanned aerial vehicle inspection path
CN122258921AImprove execution efficiencyincrease resourcesNavigational calculation instruments Vehicle position/course/altitude control Point cloud Simulation
A northbound interface command processing method and system
CN120512482Blow cost Lower deployment costs Service flow Network management
Task processing method, electronic device, and program product
CN122285212Areduce consumptionReduce the waste of computing powerSmart technology Processing
A power supply station-oriented multi-agent collaborative scheduling system and a scheduling method thereof
CN122288249ACollaborative scheduling implementationImprove scheduling efficiencyCoscheduling Intent recognition
An office auxiliary method, device, equipment and medium
CN122288615Aefficient communicationprecise communicationSoftware engineering Intent recognition

Get free access to AI patent search and analysis

Check patentability, review prior art and ask IP Agent with full patent context.

AI Technical Summary

Technical Problem

Existing vector processors suffer from inappropriate data dependency determination in long vector chain execution scenarios, leading to low pipeline utilization and insufficient execution efficiency. In particular, when multiple similar vector instructions are issued consecutively, execution bubbles and resource idleness are easily generated.

Method used

A dependency determination mechanism at the data segment level of vector registers is adopted. The access status of vector registers is centrally recorded by the status recording unit, and the arbitrator determines the data dependency relationship after write during the vector instruction issuance stage. This allows vector instructions to obtain execution rights in advance when the dependency conditions are met. Combined with the address generation unit, real-time read and write addresses are generated for segmented access.

Benefits of technology

It improves the utilization rate of vector execution units and the overall execution efficiency and throughput of vector processors, reduces waiting and pipeline bubbles caused by conservative dependency judgments, and enhances the resource utilization and computational efficiency of the system.

✦ Generated by Eureka AI based on patent content.

Smart Images

Figure CN122240185A_ABST

Patent Text Reader

Abstract

This invention discloses a vector processor system and method, wherein the system includes: a status recording unit for storing a first identifier in a corresponding vector register status information table; a vector execution unit for sending a first vector instruction and its corresponding arbitration request instruction to an arbitrator; the arbitrator for determining, based on the arbitration request instruction and the status information tables of each vector register, whether the first vector instruction has a write-after-read data dependency; if not, determining that the first vector instruction has obtained execution permission; an address generation unit for determining the real-time access address of the data segment to be accessed based on the first vector instruction that has obtained execution permission and vector configuration parameters; and the vector execution unit for retrieving data from the vector register stack based on the real-time access address and performing vector processing. This invention can realize data segment-level dependency determination in long vector chain execution scenarios, improving the utilization rate of the vector execution unit and the overall execution efficiency and throughput of the vector processor.

Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of vector processor microarchitecture design technology, and in particular to a vector processor system and method. Background Technology

[0002] Vector processors are typically used in high-throughput processing scenarios involving large-scale vector data. When dealing with applications such as large models, it is often necessary to perform similar operations on a massive number of vector elements. This inevitably leads to complex data dependencies, including dependencies between vector registers and those related to mask registers. Furthermore, due to the limitation on the length of vectors that a single vector instruction can process, in practice, multiple similar vector instructions are usually needed to process all vector elements in batches. In these scenarios, if the data dependency determination and scheduling mechanism is poorly designed, it can easily introduce additional waiting during vector instruction issuance and execution, thereby reducing pipeline utilization and affecting overall execution efficiency. Therefore, how to implement an efficient and low-latency data dependency determination and scheduling mechanism in long vector chain execution scenarios has become one of the urgent technical problems to be solved in vector processor design.

[0003] To address the aforementioned issues, existing technologies primarily employ two typical technical solutions. One solution is based on a scoreboard-based dependency analysis mechanism. This mechanism maintains a read / write status table for register resources and checks the source and destination registers involved in each vector instruction to be issued. When a data dependency such as read-after-write, write-after-read, or write-after-write is detected, the issuance of the vector instruction is blocked until the read / write status of the relevant registers is cleared. The other solution utilizes a register renaming and out-of-order (OoO) scheduling mechanism. This eliminates data dependencies by mapping logical vector registers to physical vector registers and, combined with a vector instruction readiness status tracking mechanism, achieves dynamic scheduling and execution of vector instructions when dependency conditions are met.

[0004] However, the two existing technical solutions mentioned above still have certain limitations in long vector processing scenarios. Scoreboard-based solutions typically perform dependency determination at the register-wide level, requiring all read and write operations related to that register to be completed before subsequent vector instructions can be issued. This fails to fully utilize the chained pipelined execution and segmented result generation characteristics of vector processors, easily leading to idle execution units and insufficient parallelism. On the other hand, while register renaming and out-of-order execution solutions can alleviate the limitations caused by data dependencies to some extent, when the vector register length is large, they require the introduction of larger physical vector register files, register mapping tables, and corresponding bypass networks and scheduling logic, significantly increasing hardware area and implementation complexity.

[0005] Furthermore, in scenarios where multiple similar vector instructions are issued consecutively, CGRA (Coarse-Grained Reconfigurable Architecture) is adopted as the computing architecture. By configuring multiple processing elements (PEs), instruction-level parallelism and dataflow parallelism of the loop body are achieved. Although CGRA has certain advantages in reconfigurability, its hardware configuration latency is relatively long, and it lacks an instruction over-issue capability during the execution phase. This means that switching between different loop bodies requires reloading the configuration and waiting for instruction issuance, which can easily generate a large number of execution bubbles in the pipeline, resulting in a decrease in overall throughput and limiting system operating efficiency.

[0006] Therefore, there is an urgent need for a new vector processor microarchitecture that features reconfigurable data paths, supports fine-grained dynamic scheduling, and can efficiently organize the parallel operation of heterogeneous execution units.

[0007] This section is intended to provide background or context for the embodiments of the invention set forth in the claims. The description herein is not an admission that it is prior art simply because it is included in this section. Summary of the Invention

[0008] This invention provides a vector processor system for implementing data segment-level dependency determination in long vector chain execution scenarios. This avoids pipeline bubbles and resource idleness caused by conservative arbitration at the entire vector register granularity, thereby improving the utilization rate of vector execution units and the overall execution efficiency and throughput of the vector processor.

[0009] The vector processor system includes: a status recording unit, a vector execution unit, a vector register file, an arbitrator, and an address generation unit; The status recording unit is used to receive the first identifier of the first vector instruction and store the first identifier in the corresponding vector register status information table; The vector execution unit is used to send the first vector instruction and its corresponding arbitration request instruction to the arbitrator. The arbitrator is used to determine whether there is a write-after-read data dependency relationship for the first vector instruction based on the arbitration request instruction and the status information tables of each vector register; if not, it determines that the first vector instruction has the right to execute. The address generation unit is used to determine the real-time read address and / or real-time write address of the data segment to be accessed by the first vector instruction that has obtained execution permission, based on the source vector register field, the target vector register field, and the vector configuration parameters of the first vector instruction that has obtained execution permission. The vector execution unit performs vector processing by acquiring read data and / or write data from the vector register file based on the real-time read address and / or real-time write address.

[0010] In some embodiments, the vector register status information table includes: a read instruction table, a write instruction table, a read data segment table, and a write data segment table; The read instruction table and the write instruction table are respectively used to store the first identifier for reading or writing to the vector register; The read data segment table and the write data segment table are respectively used to store the second identifier corresponding to the data segment being read or written by the vector register; wherein, the data segment is determined based on the bit width of the vector register and the data bit width of the vector execution unit.

[0011] In some embodiments, the vector register status information table further includes: The vector register status table is used to store whether the vector register is in a reading state; A write vector register status table is used to store whether the vector register is in a write state.

[0012] In some embodiments, the arbitrator includes: a read dependency processing unit, configured to determine whether a second vector instruction exists in the write instruction table corresponding to the source vector register to be accessed by the first vector instruction; if not, determine that the first vector instruction does not have the write-after-read data dependency relationship; if the write instruction table contains only one second vector instruction, determine whether the second identifier exists in the write data segment table corresponding to the source vector register; if the second identifier exists, determine that the first vector instruction does not have the write-after-read data dependency relationship.

[0013] In some embodiments, the arbitrator includes: a write dependency processing unit, configured to determine whether there is a second vector instruction preceding the first vector instruction in the read instruction table corresponding to the target vector register to be accessed by the first vector instruction; if not, determine that the first vector instruction does not have a read-after-write data dependency; if there is only one second vector instruction in the read instruction table, determine whether there is a second identifier in the read data segment table corresponding to the target vector register; if there is a second identifier, determine that the first vector instruction does not have the read-after-write data dependency.

[0014] In some embodiments, the write dependency processing unit is further configured to determine whether there is a second vector instruction preceding the first vector instruction in the write instruction table corresponding to the target vector register to be accessed by the first vector instruction; if not, it is determined that the first vector instruction does not have a write-after-write data dependency relationship.

[0015] In some embodiments, the address generation unit includes: a real-time read address generation unit, a real-time write address generation unit, a read address queue, and a write address queue; The read address queue and the write address queue are used to receive the first vector instruction that has obtained execution permission; The real-time read address generation unit is used to obtain the first vector instruction from the read address queue after the read operation is completed in the last data segment of the second vector instruction; generate the real-time read address based on the first vector instruction and its vector length, resource dependency relationship and vector register grouping multiple; and update the read status of the vector register to the read vector register status table. The real-time write address generation unit is used to obtain the first vector instruction from the write address queue after the write operation is completed in the last data segment of the second vector instruction; generate the real-time write address based on the first vector instruction and its vector length, resource dependency relationship and vector register grouping multiple, and update the write status of the vector register to the write vector register status table.

[0016] In some embodiments, the vector execution unit includes: an input buffer, a vector array, and an output buffer; the vector register file is used to receive the real-time read address, obtain read data from the vector register corresponding to the real-time read address, and output the read data to the input buffer; The vector array is used to obtain the read data from the input buffer, perform vector operation processing based on the read data, and output the operation result to the output buffer; The output buffer is used to write the operation result into the vector register corresponding to the real-time write address if the write dependency processing unit determines that the first vector instruction does not have the read-after-write data dependency relationship or the write-after-write data dependency relationship; if the write dependency processing unit determines that the first vector instruction has the read-after-write data dependency relationship or the write-after-write data dependency relationship, it back-pressures the vector array to make the vector array pause vector operation processing.

[0017] In some embodiments, the system further includes: a decoding and scheduling unit; the vector execution unit includes: an instruction queue; The decoding and scheduling unit is used to decode the acquired first vector instruction to obtain decoding information; send the first vector instruction to the instruction queue of the corresponding vector execution unit according to the operation type field in the decoding information; and send the first identifier of the first vector instruction to the status recording unit based on the source register field and the target register field in the decoding information.

[0018] This invention also provides a vector processor method to achieve data segment granularity dependency determination in long vector chain execution scenarios, avoiding pipeline bubbles and resource idleness caused by conservative arbitration of the entire vector register granularity, and improving the utilization rate of vector execution units and the overall execution efficiency and throughput of vector processors.

[0019] This vector processor method includes: The status recording unit receives the first identifier of the first vector instruction and stores the first identifier in the corresponding vector register status information table; The vector execution unit sends the first vector instruction and its corresponding arbitration request instruction to the arbitrator; Based on the arbitration request instruction and the status information tables of each vector register, the arbitrator determines whether there is a write-after-read data dependency relationship for the first vector instruction; if not, it determines that the first vector instruction has execution permission. The address generation unit determines the real-time read address and / or real-time write address of the data segment to be accessed by the first vector instruction that has obtained execution permission, based on the source vector register field, the target vector register field, and the vector configuration parameters; the vector execution unit obtains read data and / or write data from the vector register file based on the real-time read address and / or real-time write address and performs vector processing.

[0020] This invention also provides a chip including the vector processor system described above.

[0021] Specifically, this invention is applicable to wafer-level chips, wherein the wafer-level chip can be configured with multiple computing cores, some or all of which include the matrix multiplication and addition operation unit of this invention; the chip can be used in scenarios such as AI large model training and high-performance scientific computing, and by integrating a high-density MAC array at the wafer-level scale, the advantages of this invention, such as multi-data type fusion, low power consumption, and high parallelism, are realized.

[0022] This invention also provides a board card including the above-mentioned chip.

[0023] This invention also provides an electronic device including the aforementioned circuit board.

[0024] The vector processor system and method provided in this invention have the following features: The vector processor system establishes a status recording unit on the vector register file side to centrally record the access status of each vector register and the vector instruction identifier. An arbitrator, during the vector instruction issuance phase, determines the write-after-read data dependencies between vector instructions based on the status information. This ensures that vector instructions can obtain execution permission when the dependency conditions are met, thereby reducing waiting caused by conservative dependency determination. An address generation unit generates real-time read addresses and / or real-time write addresses for data segments based on the register fields of the vector instructions that have obtained execution permission and the vector configuration parameters. This allows the vector execution unit to perform segmented access to the vector register file in a pipelined manner and proceed in parallel with the computation process. This reduces pauses and switching overhead during instruction issuance and execution in long vector and high-throughput computation scenarios, improves the continuous supply capability and resource utilization of the execution unit, and enhances the overall execution efficiency and system throughput of the vector processor without significantly increasing hardware complexity. Attached Figure Description

[0025] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. In the drawings: Figure 1 This is a schematic diagram of the vector processor system in an embodiment of the present invention; Figure 2 This is a schematic diagram of the vector register structure in an embodiment of the present invention; Figure 3 This is a schematic diagram of the execution pipeline of the vector processor system in an embodiment of the present invention; Figure 4 This is a schematic diagram of the structure of the vector state recording unit in an embodiment of the present invention; Figure 5 This is a schematic diagram of the state machine of the instruction queue in an embodiment of the present invention; Figure 6 This is a schematic diagram of the vector instruction execution flow in one embodiment of the present invention; Figure 7 This is a schematic diagram of the vector instruction execution flow in another embodiment of the present invention; Figure 8 This is a schematic diagram of the arbitrator execution process in an embodiment of the present invention; Figure 9 This is a schematic diagram of the vector instruction execution flow in another embodiment of the present invention; Figure 10 This is a schematic diagram of the vector instruction execution flow in another embodiment of the present invention; Figure 11 This is a schematic diagram of the vector instruction execution flow in another embodiment of the present invention; Figure 12 This is a schematic diagram of the vector instruction execution flow in another embodiment of the present invention; Figure 13 This is a schematic diagram of the vector instruction execution flow in another embodiment of the present invention; Figure 14 This is a schematic diagram of the vector instruction execution flow in another embodiment of the present invention. Detailed Implementation

[0026] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the embodiments of the present invention will be further described in detail below with reference to the accompanying drawings. Here, the illustrative embodiments and their descriptions are used to explain the present invention, but are not intended to limit the present invention. It should be noted that, unless otherwise specified, the embodiments and features in the embodiments of this application can be arbitrarily combined with each other. The acquisition, storage, use, and processing of data in the technical solutions of this application all comply with relevant laws and regulations. The user information in the embodiments of this application is obtained through legal and compliant means, and the acquisition, storage, use, and processing of user information have been authorized and agreed upon by the customer.

[0027] To facilitate understanding of the technical solution provided in this application, the relevant content of the technical solution in this application will be explained below.

[0028] To overcome the problems of existing vector processors, which typically determine data dependencies at the granularity of the entire vector register in long vector chain execution scenarios and require waiting for the preceding vector instruction to complete the read / write operation of all data segments of the vector register before allowing subsequent vector instructions to start, resulting in delayed data dependency resolution, increased idle cycles of vector execution units, and easy generation of pipeline bubbles when switching between multiple vector instructions of the same type, this application provides a vector processor system. By constructing a status recording unit at the granularity of the data segments of the vector register, it tracks the read / write instruction identifiers of each vector register and their data segment access status. An arbitrator determines write-after-read, read-after-write, and write-after-write dependencies at the data segment granularity based on the vector register status information table. This allows subsequent vector instructions accessing the same vector register to start the read operation in advance even when the preceding vector instruction has only completed the write of part of the data segment.

[0029] By combining the read address queue and write address queue in the address generation unit, data dependency arbitration and resource dependency arbitration are decoupled, allowing vector instructions that have been decoupled from data dependencies to enter the address generation process earlier. After the preceding vector instruction completes the read / write operation of the last data segment and releases the corresponding port resources, it can directly switch to the next vector instruction for execution. Combined with the pipelined processing and backpressure control mechanism of the vector execution unit, asynchronous decoupling of read / write dependencies and continuous supply of vector instructions are achieved. This improves the utilization rate of the vector execution unit, reduces pipeline bubbles generated during vector instruction switching, and enhances the overall execution efficiency and system throughput of the vector processor in long vector and high-throughput computing scenarios.

[0030] like Figure 1 As shown, the present invention provides a vector processor system, including: a state recording unit (VectorRegister File Scoreboard, VScoreboard), a vector execution unit (VEU), a vector register file (VRF), an arbitrator (Vector Register File Arbiter, VRF Arbiter), and an address generation unit (Vector Register File Address Generator, VRFAddrgen).

[0031] The status recording unit is used to receive the first identifier of the first vector instruction and store the first identifier in the corresponding vector register state information table (VState).

[0032] The vector execution unit is used to send the first vector instruction and its corresponding arbitration request instruction to the arbitrator (step ② in the figure).

[0033] The arbitrator is used to determine whether the first vector instruction has a write-after-read data dependency based on the arbitration request instruction and the status information tables of each vector register. If no dependency exists, the first vector instruction is granted execution privileges.

[0034] The address generation unit is used to determine the real-time read address and / or real-time write address of the data segment to be accessed by the first vector instruction that has obtained execution permission, based on the source vector register field, the target vector register field, and the vector configuration parameters.

[0035] The vector execution unit retrieves read and / or write data from the vector register file based on the real-time read address and / or real-time write address to perform vector processing.

[0036] According to the above embodiments, the vector processor system centrally records the access status of each vector register and the vector instruction identifier by setting up a status recording unit on the vector register file side. An arbitrator determines the write-after-read data dependencies between vector instructions based on the status information during the vector instruction issuance phase, enabling vector instructions to obtain execution permission when the dependency conditions are met, thereby reducing waiting caused by conservative dependency determination. The address generation unit generates real-time read addresses and / or real-time write addresses for data segments based on the register fields of the vector instructions that have obtained execution permission and the vector configuration parameters. This allows the vector execution unit to perform segmented access to the vector register file in a pipelined manner and proceed in parallel with the computation process. This reduces pauses and switching overhead during instruction issuance and execution in long vector and high-throughput computation scenarios, improves the continuous supply capability and resource utilization of the execution unit, and enhances the overall execution efficiency and system throughput of the vector processor without significantly increasing hardware complexity.

[0037] In this embodiment of the invention, a chaining pipelined execution mechanism is employed to support the RVV (RISC-V Vector Extension) instruction set. For example... Figure 2 As shown, the bit width VLEN of the vector register is greater than the data path bit width DLEN that the vector execution unit can process in a single clock cycle. Based on this structure, the vector data in a vector register needs to be processed in segments by the vector execution unit across multiple clock cycles. In each clock cycle, the vector execution unit reads only a segment of data of length DLEN from the vector register with a bit width of VLEN and performs operations on that segment. Each storage area of length DLEN in the vector register is defined as a data segment (Field).

[0038] Therefore, a vector register contains VLEN / DLEN data segments, and it takes VLEN / DLEN clock cycles to complete a read or write operation on all data segments of that vector register. Based on the above segmented access characteristics, the determination of data dependencies between vector registers does not require recording the read / write status of the entire vector register at the granularity. Instead, the read / write status of each data segment is recorded at the granularity of the data segments, and the data dependencies between vector instructions are determined at the data segment granularity. This provides a foundation for subsequent implementation of fine-grained dependency arbitration and pipelining parallel execution.

[0039] In some embodiments, such as Figure 1 As shown, the vector processor system also includes a decoding and dispatching unit, and the vector execution unit includes an instruction queue.

[0040] The decoding and scheduling unit decodes the acquired first vector instruction to obtain decoding information. Based on the operation type field in the decoding information, the first vector instruction is sent to the instruction queue of the corresponding vector execution unit (corresponding to step ① in the diagram). Based on the source register field and destination register field in the decoding information, the first identifier of the first vector instruction is sent to the status recording unit (corresponding to step ① in the diagram).

[0041] In embodiments of the present invention, such as Figure 3 As shown, in the decoding and dispatching stages of the vector processor, the decoding and dispatching unit decodes multiple vector instructions obtained from the central processing unit (CPU) to generate decoding information corresponding to each vector instruction.

[0042] The vector instructions of this invention adopt an instruction format conforming to the RISC-V instruction set architecture standard, as shown in Table 1 below. The decoded information includes: the first source vector register (Vector Source Register 1, vs1) field, the second source vector register (Vector Source Register 2, vs2) field, and the destination vector register (Vector Destination).

[0043] Register (vd) fields, Vector Mask Register (vm) fields, operation type

[0044] (opcode) field and function field.

[0045] Table 1

[0046] As shown in Figure 1, each vector execution unit has its own independent instruction queue (Inst0 to InstN). After decoding the vector instructions, the decoding and scheduling unit encapsulates the vector instructions according to the operation type (opcode) field in the decoding information and sends each encapsulated vector instruction to its corresponding instruction queue. The operation type field of the vector instructions includes: vector addition (Vector ADD, VADD), vector multiplication (VectorMultiply, VMUL), vector comparison (Vector Compare, VCMP), and vector type conversion (Vector CVT, VCVT). The types of the vector execution units correspond one-to-one with the operation types of the vector instructions described below. The types of vector execution units include: vector addition execution unit (Vector ADD), vector multiplication execution unit (Vector MUL), vector comparison execution unit (Vector CMP), and vector type conversion execution unit (Vector CVT).

[0047] In addition, the decoding and scheduling unit determines the vector registers involved in the vector instruction based on the source vector register field, the target vector register field, and the vector mask register field in the vector instruction, and writes the instruction identifier of the vector instruction into the vector register status information table of the corresponding vector register.

[0048] According to the above embodiments, by uniformly decoding vector instructions and extracting key information such as source registers, target registers, and mask registers during the decoding and scheduling stages, vector instructions can be distributed to the instruction queues of corresponding vector execution units based on the operation type field, supporting parallel scheduling and ordered execution of different types of vector operations. Furthermore, by writing the identifier of the vector instruction into the status information table of the corresponding vector register during the instruction issuance stage, advance registration and centralized management of vector register access relationships are achieved. This provides accurate and timely status information for subsequent data dependency determination by the arbitrator, reducing scheduling delays caused by unclear instruction type matching or vector register access relationships, improving the accuracy of vector instruction scheduling and the continuity of the execution pipeline, and increasing the utilization rate of vector execution units and the overall execution efficiency of vector processors in high-throughput computing scenarios.

[0049] In some embodiments, such as Figure 4 As shown, the vector register status information table includes: Read ID List, Write ID List, Reading Field ID table, and Writing Field ID table.

[0050] The read instruction table is used to store the first identifier for reading or writing to the vector register, and the write instruction table is used to store the first identifier for writing to the vector register.

[0051] The read data segment table stores the second identifier corresponding to the data segment being read by the vector register, and the write data segment table stores the second identifier corresponding to the data segment being written by the vector register. The data segment is determined based on the bit width of the vector register and the data bit width of the vector execution unit.

[0052] In this embodiment of the invention, the vector register status information table differs from the traditional scoreboard or reservation station method of recording instruction read-write dependencies centered on vector instructions. Instead, it manages the status of each vector register in the vector register stack.

[0053] The vector register status information table is centered on each vector register. It records the instruction identifier of the vector register that is being read or written in the vector instructions that have been issued. It also records whether the vector register is being read or written in the current clock cycle, as well as the identifier of the corresponding data segment being read or written.

[0054] For each vector register, a corresponding vector register status information table is set in the status recording unit. In this embodiment of the invention, a total of 32 vector register status information tables are set, each corresponding to one of the 32 vector registers.

[0055] The read instruction table records the instruction identifiers of all vector instructions that have been issued and require read operations on the vector register. The read instruction table is implemented as a bitmask, with a bit width identical to the instruction window. Each bit in the bitmask corresponds to the instruction identifier of one vector instruction, meaning the number of bits matches the number of vector instructions. When a vector instruction to be issued needs to perform a read operation on the vector register, the bit position corresponding to the vector instruction identifier in the read instruction table is set to a valid state (?), indicating that the vector instruction has been issued to the corresponding vector execution unit but has not yet begun actual execution. After a issued vector instruction completes its write operation on the vector register, the bit corresponding to the vector instruction identifier in the write instruction table is cleared to zero, indicating that the vector instruction no longer occupies the read access relationship of the vector register.

[0056] The write instruction table records the instruction identifiers of all vector instructions that have been issued and require write operations to the vector register. The write instruction table is implemented as a bit vector, with a bit width identical to the instruction window. Each bit in the bit vector corresponds to the instruction identifier of one vector instruction; that is, the number of bits matches the number of vector instructions. When a vector instruction to be issued needs to perform a write operation on the vector register, the bit position corresponding to the vector instruction identifier in the write instruction table is set to a valid state, indicating that the vector instruction has been issued to the corresponding vector execution unit but has not yet begun actual execution. After the issued vector instruction completes its write operation on the vector register, the bit corresponding to the vector instruction identifier in the write instruction table is cleared to zero, indicating that the vector instruction no longer occupies the write access relationship of the vector register.

[0057] The read data segment table records the segment number identifiers of the data segments being read from the vector register during the current clock cycle, indicating the data segment position corresponding to the current read operation. The write data segment table records the segment number identifiers of the data segments being written to the vector register during the current clock cycle, indicating the data segment position corresponding to the current write operation.

[0058] According to the above embodiments, the vector register status information table records the access status of vector registers at both the register granularity and data segment granularity levels by setting up read instruction tables, write instruction tables, read data segment tables, and write data segment tables respectively. This allows the vector processor system to simultaneously know the read and write access relationships of each vector register by the currently flying vector instructions, and to monitor the execution progress of each vector register read and write operation at the data segment level in real time. This provides fine-grained and timing-accurate status information for the arbitrator to determine data dependencies such as write-after-read, read-after-write, and write-after-write. It also supports the early release of corresponding data dependencies when some data segments have been read and written, reducing waiting and pipeline bubbles caused by conservative dependency determination at the whole register granularity. This improves the parallelism and resource utilization of the vector execution unit and enhances the overall execution efficiency of the vector processor in long vector and high-throughput computing scenarios.

[0059] In some embodiments, such as Figure 4 As shown, the vector register status information table also includes: The Vector Register is reading table stores whether a vector register is in a reading state.

[0060] The Vector Register is writing table stores whether a vector register is in a writing state.

[0061] In this embodiment of the invention, the read vector register status table has a bit width of 1 bit, which is used to indicate whether the corresponding vector register is in the state of being read by a certain vector instruction in the current clock cycle. When the bit is in a valid state, it indicates that the vector instruction has entered the execution stage and is performing a read operation on the vector register.

[0062] The write vector register status table is also 1 bit wide, used to indicate whether the corresponding vector register is being written to by a vector instruction in the current clock cycle. When the bit is valid, it indicates that the vector instruction has entered the execution stage and is performing a write operation on the vector register.

[0063] The read vector register status table has a 1-bit width and is used to indicate whether the corresponding vector register is being read by a vector instruction during the current clock cycle. When this bit is valid, it indicates that the vector instruction has entered the execution phase and is performing a read operation on the vector register.

[0064] The write vector register status table has a 1-bit width and is used to indicate whether the corresponding vector register is being written to by a vector instruction during the current clock cycle. When this bit is valid, it indicates that the vector instruction has entered the execution phase and is performing a write operation on the vector register.

[0065] According to the above embodiments, by setting up read vector register status tables and write vector register status tables, it is possible to indicate at the clock cycle granularity whether each vector register is currently in a read or write state. This provides a basis for the arbitrator to determine whether the register port resources are occupied and whether subsequent vector instructions can initiate corresponding read and write accesses, avoiding resource conflicts and reducing unnecessary waiting, thereby improving the continuous supply capability of the vector execution unit and enhancing the overall system execution efficiency.

[0066] In some embodiments, as shown in Figure 5, the instruction queue contains a per-instruction finite state machine (FSM) for managing each instruction. This state machine stores and controls the instruction states of multiple vector instructions. The instruction states include: WaitingArbiter, Arbiter Done – Waiting Processing, Processing, and Processing Done – Waiting Exit.

[0067] like Figure 3As shown, in the instruction issue phase of the vector processor, after receiving the vector instructions sent by the decoding and scheduling unit, the instruction queue determines the current instruction state according to the state machine entry corresponding to each vector instruction, and sends the vector instructions in the waiting arbitration state to the arbiter to perform data dependency arbitration.

[0068] Based on the aforementioned status recording unit, the arbitrator can arbitrate the data dependencies of vector registers to be accessed by vector instructions that are in a pending arbitration state at the data segment granularity.

[0069] In embodiments of the present invention, such as Figure 1 and Figure 3 As shown, during the arbitration phase of the vector processor, the arbitrator receives vector instructions in a waiting-for-arbitration state and their corresponding arbitration request instructions from the instruction queues of each vector execution unit. Upon receiving the arbitration request instruction, the arbitrator determines, based on the vector register status information table in the status record unit, whether the vector register to be accessed by the vector instruction has a data dependency relationship with other preceding vector instructions that are either waiting to execute or executing. These data dependencies include: Read After Write (RAW), Write After Read (WAR), and Write After Write (WAW). A Read After Write dependency indicates that a preceding vector instruction performs a write operation on a certain vector register, while a subsequent vector instruction needs to read that vector register. A Write After Read dependency indicates that a preceding vector instruction needs to read a certain vector register, while a subsequent vector instruction performs a write operation on the same vector register. A Write After Write dependency indicates that two vector instructions sequentially perform write operations on the same register.

[0070] In some embodiments, such as Figure 1 As shown, the arbitrator includes a read dependency handler, used to determine whether a second vector instruction exists in the write instruction table corresponding to the source vector register to be accessed by the first vector instruction, and whether it precedes the first vector instruction. If not, it determines that the first vector instruction does not have a write-after-read data dependency. If the write instruction table contains only one second vector instruction, it determines whether a second identifier exists in the write data segment table corresponding to the source vector register. If the second identifier exists, it determines that the first vector instruction does not have a write-after-read data dependency. In this application, the first vector instruction is defined as vector instruction A, and the second vector instruction is defined as vector instruction B.

[0071] In this embodiment of the invention, the arbitrator arbitrates the data dependencies between vector instructions that are in a pending arbitration state based on the vector dependency advance arbitration mechanism, so as to support the early release of the corresponding data dependencies when the corresponding conditions are met.

[0072] The read dependency processing unit determines whether there exists a vector instruction preceding vector instruction A in the write instruction table corresponding to the source vector register to be accessed by the requested arbitration vector instruction A. If there is no vector instruction preceding vector instruction A in the write instruction table, the write-after-read data dependency is released, meaning that there is no write-after-read data dependency for vector instruction A.

[0073] If the write instruction table contains only one vector instruction B preceding vector instruction A, the write data segment table corresponding to the source vector register is further read, and it is determined whether the recorded data segment identifier is non-zero. When the data segment identifier indicated by the write data segment table is non-zero, it indicates that vector instruction B has completed the write operation of at least one data segment. At this time, vector instruction A can safely read at least the first data segment, thus eliminating the write-after-read data dependency; that is, vector instruction A does not have a write-after-read data dependency.

[0074] In all other cases except those described above, it is determined that there is still a write-after-read data dependency between vector instructions, and the write-after-read data dependency cannot be removed.

[0075] For example, taking a write-after-read data dependency as an example, suppose vector instruction A performs a write operation on vector register v0, and vector instruction B performs a read operation on vector register v0. According to the conservative dependency determination method of the prior art, vector instruction B is usually allowed to start reading vector register v0 only after vector instruction A has completed the write operation on all data segments of vector register v0.

[0076] In this example, vector instruction A writes to vector register v0 sequentially according to the data segment order, for example, writing field0, field1, field2, and field3 in sequence. Because vector registers use segmented access and pipelining, once vector instruction A has completed writing to the first data segment field0, vector instruction B can safely begin reading the corresponding data segment field0 without waiting for vector instruction A to complete writing to field1, field2, and field3. This achieves advance arbitration of write-after-read data dependencies.

[0077] like Figure 6 As shown, Figure 6 The vector instruction execution flow under the traditional conservative data dependency determination mechanism is shown.

[0078] Suppose there are only three vector instructions that have data dependencies on each other. The first vector instruction A writes to the vector register v0, the second vector instruction B reads data from the vector register v0 and performs vector addition, and the third vector instruction C stores the result of the operation produced by the second vector instruction B in the target vector register v1.

[0079] During clock cycles T0 to T3, the VLOAD Unit sequentially loads and writes data segments field0 to field3 of vector register v0. During this phase, because vector register v0 is in a write-to-read state, there is a write-after-read data dependency on vector register v0. Therefore, the VADD Unit, which depends on vector register v0, cannot start executing. The VADD Unit must wait for the VLOAD Unit to complete the write operation on all data segments of vector register v0 before it can release its data dependency on vector register v0.

[0080] During clock cycles T4 to T7, once the vector addition unit has determined that all data segments of vector register v0 have been written, it begins to perform addition operations sequentially on each data segment (field0 to field3) of vector register v0 and writes the result to the target vector register v1. During this phase, because vector register v1 is in a write-to-data state, the vector storage unit (VSTORE Unit) that depends on vector register v1 cannot begin execution. The vector storage unit must wait for the vector addition unit to complete the write operation on all data segments of vector register v1 before it can release its data dependency on vector register v1.

[0081] Within clock cycles T8 to T11, once the vector storage unit has determined that all data segments of vector register v1 have been written, the vector storage unit begins to sequentially perform read operations on each data segment field0 to field3 of vector register v1 and writes the corresponding data back to memory, thereby completing the storage process of this set of vector instructions.

[0082] like Figure 7 As shown, Figure 7 The vector instruction execution flow under the vector dependency advance arbitration mechanism of the present invention is illustrated.

[0083] During clock cycle T0, the vector loading unit writes data to field0 of the vector register v0. At this time, since the vector register v0 is in a write state, there is a write-after-read data dependency relationship for the vector register v0. Therefore, the vector addition unit that depends on the vector register v0 cannot start execution yet.

[0084] Within clock cycle T1, the vector loading unit has completed writing data to data segment field0 of vector register v0 and begins writing data to data segment field1. At this time, the arbitrator detects that data segment field0 has been written completely based on the write data segment table corresponding to vector register v0. Therefore, it does not need to wait for vector instruction A to complete writing all data segments of vector register v0, which allows the vector addition unit to start and read data segment field0 of vector register v0 for operation, and write the corresponding operation result to data segment field0 of vector register v1.

[0085] Within clock cycle T2, the vector loading unit has completed writing data to data segment field1 of vector register v0 and begins writing data to data segment field2. At this time, the vector addition unit reads data segment field1 of vector register v0 and performs the addition operation, writing the result to data segment field1 of vector register v1. Simultaneously, the write-after-read data dependency of the vector storage unit for vector register v1 is released, and it begins reading data segment field0 of vector register v1 and performing the corresponding storage operation.

[0086] Within clock cycle T3, the vector loading unit has completed writing data to data segment field2 of vector register v0 and begins writing data to data segment field3. The vector addition unit reads data segment field2 of vector register v0, performs the operation, and writes the result to data segment field2 of vector register v1. Simultaneously, the vector storage unit reads data segment field1 of vector register v1 and performs the corresponding storage operation.

[0087] Within clock cycle T4, the vector loading unit completes the writing of data segment field3 of vector register v0. Simultaneously, the vector addition unit reads data segment field3 of vector register v0, performs the operation, and writes the result into data segment field3 of vector register v1. The vector storage unit reads data segment field2 of vector register v1 and performs the storage operation.

[0088] Within clock cycle T5, the vector addition unit completes the write operation to data segment field3 of vector register v1. Simultaneously, the vector storage unit reads data segment field3 of vector register v1 and performs the corresponding storage operation. Thus, the loading, computation, and storage operations for each data segment of this vector register are completed sequentially under a chained pipelined execution mechanism.

[0089] According to the above embodiments, this invention introduces a vector dependency advance arbitration mechanism at the data segment granularity, enabling instruction sequences that originally needed to wait sequentially to complete to be executed overlappingly in a chained pipelined structure, reducing the overall execution clock cycles from 11 clock cycles in the traditional scheme to 5 clock cycles. When vector instructions are executed in a pipelined manner, preceding vector instructions can write completed data segments to the target vector register before completing the writing of all data segments to the target vector register. Simultaneously, subsequent vector instructions do not need to wait for all preceding instructions to finish before sequentially reading data from the corresponding completed data segments in the source vector register and entering the execution pipeline, thereby achieving overlapping read / write and computation at the data segment level for different vector instructions. This significantly reduces waiting and pipeline bubbles caused by conservative dependency determination, improves the continuous supply capacity and resource utilization of vector execution units, and thus enhances the instruction-level parallelism of the vector processor and the overall execution efficiency in long vector and high-throughput computing scenarios.

[0090] In some embodiments, the arbitrator includes a write dependency handler, configured to determine whether a second vector instruction exists in the read instruction table corresponding to the target vector register accessed by the first vector instruction, and if so, whether the second vector instruction has a read-after-write data dependency. If the second vector instruction exists only in the read instruction table, the arbitrator determines whether a second identifier exists in the read data segment table corresponding to the target vector register. If the second identifier exists, the arbitrator determines that the first vector instruction has no read-after-write data dependency.

[0091] In this embodiment of the invention, the write dependency processing unit determines whether there is a vector instruction preceding vector instruction A in the read instruction table corresponding to the target vector register to be accessed by the vector instruction A requesting arbitration. If there is no vector instruction preceding vector instruction A in the read instruction table, the read-after-write data dependency is released, that is, vector instruction A does not have a read-after-write data dependency.

[0092] If the read instruction table contains only one vector instruction B preceding vector instruction A, then the read data segment table corresponding to the target vector register is further read, and it is determined whether the recorded data segment identifier is not zero. When the data segment identifier indicated by the read data segment table is not zero, it indicates that vector instruction B has completed the read operation of at least one data segment. At this time, vector instruction A can safely write at least the first data segment, and the read-after-write data dependency is released, meaning that vector instruction A does not have a read-after-write data dependency.

[0093] In all other cases except those described above, it is determined that there is still a read-after-write data dependency between vector instructions, and the read-after-write data dependency cannot be removed.

[0094] In some embodiments, the write dependency processing unit is further configured to determine whether a second vector instruction exists in the write instruction table corresponding to the target vector register to be accessed by the first vector instruction. If not, it is determined that the first vector instruction does not have a write-after-write data dependency.

[0095] In this embodiment of the invention, since the vector register file has only one write port, the write dependency processing unit determines whether there is a vector instruction preceding the vector instruction A in the write instruction table corresponding to the target vector register to be accessed by the vector instruction A requesting arbitration. If there is no vector instruction preceding the vector instruction A in the write instruction table, the write-after-write data dependency is released, that is, the write-after-write data dependency of vector instruction A is not present.

[0096] In all other cases except those described above, it is determined that there is still a write-after-write data dependency between vector instructions, and the write-after-write data dependency cannot be removed.

[0097] According to the above embodiments, the arbitrator, by setting up a write dependency processing unit, determines the access relationship between vector instructions at the vector register and data segment granularity for both read-after-write and write-after-write data dependencies. This allows the vector processor to accurately identify whether a preceding vector instruction is reading or writing to the target vector register, and, in conjunction with the read data segment table, determines the execution progress of the preceding vector instruction at the data segment level. Thus, if the conditions for early arbitration of vector dependencies are met, subsequent vector instructions are allowed to start execution earlier, reducing waiting and pipeline bubbles introduced by conservative determination at the whole register granularity. Under the hardware constraint of setting only a single write port, the mutual exclusion of write access is ensured by determining the write instruction table, avoiding write conflicts. This improves the flexibility of vector instruction scheduling and the continuity of the execution pipeline while ensuring data consistency and execution correctness, and also enhances the resource utilization of the vector execution unit and the overall system execution efficiency.

[0098] In some embodiments, such as Figure 8 As shown, since the execution of a vector instruction usually requires multiple clock cycles, there are multiple pipeline delays between the arrival of the source operand at the input port of the vector execution unit and the arrival of the corresponding operation result at the output port of the vector execution unit. If the vector instruction that initiates an arbitration request is granted execution permission only after all the above data dependencies are resolved, it will result in a huge delay in the pipeline stage of the vector execution unit.

[0099] Based on this, the present invention adopts an asynchronous read-write dependency resolution mechanism: when the arbitrator determines that there is no write-after-read data dependency relationship between the vector instruction in the waiting arbitration state and the preceding vector instruction, it determines that the vector instruction has execution permission and sends it to the address generation unit to generate the corresponding real-time memory access address.

[0100] The vector execution unit (ALU) retrieves the source operands from the vector register file based on the real-time memory access address and performs vector operations to generate the result. When the result reaches the ALU's output port, if the arbitrator determines that there is no read-after-write or write-after-write data dependency between the vector instruction and the preceding vector instruction, the ALU is allowed to output the result normally and write it to the vector register file. If the arbitrator determines that a read-after-write data dependency and / or a write-after-write data dependency still exist, it sends a backpressure control signal to the ALU to perform backpressure control on its execution pipeline, causing the ALU to suspend data processing until the aforementioned read-after-write and write-after-write data dependencies are resolved.

[0101] In some embodiments, within the RVV instruction set, the vector length (VL) typically represents the number of elements that a vector instruction needs to process. The vector length is constrained by hardware resources, with an upper limit of the maximum vector length VLMAX. However, in practical applications, the number of vector elements to be processed is usually greater than the maximum vector length VLMAX. Therefore, a strip-mining mechanism is used to divide the vector data to be processed into multiple data blocks, and vector instructions of the same operation type are executed in each loop to process VL elements in each data block. Thus, in a vector processor, each vector execution unit typically receives multiple vector instructions of the same operation type. These vector instructions may not only have data dependencies but also resource dependencies at the resource level, such as vector register access ports.

[0102] Based on the chained pipelining execution mechanism adopted in this invention, when a vector instruction needs to access multiple vector registers, it can release the corresponding data dependencies at the data segment level after completing the access to the first data segment of the first vector register. However, unlike data dependencies, the corresponding read port resources can only be released after the vector instruction has completed reading the data segments of all source operands, and the corresponding write port resources can only be released after the vector instruction has completed writing the data segments of all destination operands. Therefore, the release of data dependencies and the release of resource dependencies do not occur synchronously in time.

[0103] Furthermore, in a vector processor, the read and write addresses of vector registers by vector instructions in each clock cycle do not directly use the original addresses indicated by the source vector register field and the destination vector register field. Instead, the corresponding real-time access address needs to be calculated based on parameters such as the original address, vector length (VL), vector register grouping multiple (LMUL), and element bit width (SEW).

[0104] Therefore, there is an objective address generation delay between when a vector instruction obtains execution permission and when the corresponding real-time read / write address is generated. If a new data dependency arbitration request is initiated only after the resource dependency is completely resolved, or if resource dependency arbitration is performed simultaneously with data dependency arbitration, additional pipeline bubbles can easily be introduced when switching between multiple vector instructions, thereby reducing the continuous supply capacity of the execution unit and affecting the system resource utilization.

[0105] To solve the above problems, such as Figure 1 As shown, the address generation unit includes: a real-time read address generator (rt raddr gen), a real-time write address generator (rt waddr gen), a read address queue (read addr fifo), and a write address queue (writeaddr fifo).

[0106] The read address queue and write address queue are used to receive the first vector instruction that has been granted execution privileges.

[0107] The real-time read address generation unit retrieves the first vector instruction from the read address queue after the read operation is completed in the last data segment of the second vector instruction. It generates a real-time read address based on the first vector instruction, its vector length (VL), resource dependencies, and the length multiplier (LMUL) of the vector register, and updates the read status of the vector register to the read vector register status table.

[0108] The real-time write address generation unit retrieves the first vector instruction from the write address queue after the write operation of the last data segment of the second vector instruction is completed. It generates a real-time write address based on the first vector instruction, its vector length, resource dependencies, and vector register grouping multiple, and updates the write status of the vector registers to the write vector register status table.

[0109] In this embodiment of the invention, a multi-instruction look-ahead dedependency caching mechanism is employed to asynchronously decouple data dependency arbitration from resource dependency arbitration. The read address queue is used to cache vector instructions whose write-after-read data dependency has been resolved, i.e., vector instructions that have obtained execution permission. The write address queue is used to cache vector instructions whose read-after-write and write-after-write data dependencies have been resolved. The aforementioned vector instructions with resolved data dependencies do not need to wait for the corresponding resource dependencies to be resolved; they are directly pushed into the corresponding address queue for caching.

[0110] During the address generation (AddrGen) phase of the vector processor, the arbiter sends the vector instruction with execution privileges to the address generation unit. The address generation unit distributes the vector instruction to the read address queue and / or write address queue based on its operation type. When the preceding vector instruction in the read or write address queue completes its read or write operation on the last data segment, the corresponding vector instruction is popped from the address queue, effectively releasing the corresponding register port resources—that is, resolving resource dependencies. Subsequently, the real-time read address generation unit and / or the real-time write address generation unit directly generate the corresponding real-time read address or real-time write address for the next vector instruction in the address queue, enabling bubble-free continuous execution when switching between multiple vector instructions.

[0111] The real-time read address generation unit and the real-time write address generation unit generate real-time read addresses and / or real-time write addresses based on the resource dependencies, vector instructions and their vector lengths and vector register grouping multiples within the current clock cycle, and update the read and write status of the vector registers to the read vector register status table and / or write vector register status table.

[0112] For example, taking a scenario where LMUL = 2 (i.e., two vector registers cascaded) as an example, the bit width of each vector register is VLEN = 4 × DLEN, meaning that one vector register includes 4 data segments. The arbiter consecutively issues two vector addition instructions of the same operation type. Vector instruction inst0 is vadd.vs v2, v0, s0, and vector instruction inst1 is vadd.vs v2, v0, s1. Both of these vector addition instructions use vector register v0 as the source register and vector register v2 as the destination register.

[0113] With LMUL = 2 configured, the source register group is formed by cascading vector register v0 and vector register v1 to form a vector register group with a bit width of 8×DLEN, and the destination register group is formed by cascading vector register v2 and vector register v3 to form a vector register group with a bit width of 8×DLEN.

[0114] The vector execution unit reads or writes a data segment of length DLEN from the vector register in each clock cycle. The vector execution unit processes vector addition instructions in a pipelined manner, with an execution latency of 3 clock cycles. This means that from the arrival of the source operand at the input of the vector addition unit, the process goes through three stages: data fetching, computation, and result output. The calculated result arrives at the output of the vector addition unit in the third clock cycle.

[0115] like Figure 9 As shown, input indicates that the source operands read from the source vector register group by the vector addition unit are sent to the input terminal of the vector addition unit, and output indicates that the vector addition unit outputs the result of the operation after completion from its output terminal.

[0116] In stage 0, the write-after-read, read-after-write, and write-after-write dependencies of the vector instruction inst0 are all removed, and inst0 is sent to the read address queue and write address queue, respectively. Subsequently, the read address generation unit generates the real-time read address of the first data segment v0_0 of the source vector register v0 corresponding to the vector instruction inst0. The vector register file reads the source operand corresponding to data segment v0_0 from the source vector register v0 based on this real-time read address, and sends the source operand to the input of the vector addition unit through the read port. The vector addition unit then performs pipelined operations on the source operand.

[0117] like Figure 10 As shown, in the first stage (stage 1), the vector instruction inst0 currently reads the source operand corresponding to data segment v0_3 in the source vector register v0. Simultaneously, the vector addition unit has completed the operation on the source operand in data segment v0_0 and writes the result to data segment v2_0 of the target vector register v2. At this point, the write-after-read data dependency of the vector instruction inst1 has been released. However, since the vector instruction inst0 still occupies the read port of the vector register file and its read operation has not yet finished, the corresponding port resource dependency has not been released. Therefore, the vector instruction inst1 is cached in the read address queue to wait for the subsequent read port resource to be released before initiating a read access.

[0118] like Figure 11As shown, in the second stage (stage 2), the vector instruction inst0 currently reads the source operand corresponding to the last data segment v1_3 in the source vector register v1. Simultaneously, the vector addition unit has completed the operation on the source operand corresponding to data segment v1_0 and writes the result to data segment v3_0 of the target vector register v3. At this point, the read-after-write and write-after-write data dependencies of the vector instruction inst1 have been resolved. However, since the vector instruction inst0 still occupies the write port of the vector register file and its write operation has not yet finished, the corresponding write port resource dependency has not been resolved. Therefore, the instruction information of the vector instruction inst1 is pushed into the write address queue for caching, waiting for the subsequent write port resource to be released before initiating a write access.

[0119] like Figure 12 As shown, in stage 3, the vector instruction inst0 has completed all read operations on the source vector register group, the read address queue pops vector instruction inst0, and the corresponding vector register stack read port resources are released. Subsequently, the vector instruction inst1 begins to read the source operand corresponding to the first data segment v0_0 in the source vector register v0 according to the access address generated by the real-time read address generation unit. At the same time, the vector addition unit has completed the operation processing on the source operand corresponding to data segment v1_1 and writes the result into data segment v3_1 of the target vector register v3.

[0120] like Figure 13 As shown, in stage 4, the vector instruction inst1 currently reads the source operand corresponding to data segment v0_2 in the source vector register v0. At the same time, the vector addition unit has completed the operation on the source operand in data segment v1_3 and writes the result into data segment v3_3 of the target vector register v3.

[0121] like Figure 14 As shown, in stage 5, the vector instruction inst1 reads the source operand corresponding to data segment v0_3 in the source vector register v0. At this time, the vector instruction inst0 has completed all write operations on the target vector register group, the write address queue pops the vector instruction inst0, and the corresponding vector register heap write port resource is released. Simultaneously, the vector addition unit has completed the operation on the source operand corresponding to data segment v0_0 and writes the result to data segment v2_0 of the target vector register v2.

[0122] According to the above embodiments, by setting up read address queues and write address queues in the address generation unit, and cooperating with real-time read address generation units and real-time write address generation units, vector instructions that have obtained execution permissions can enter the address generation process in advance after data dependencies are resolved, without waiting for port resource dependencies to be resolved simultaneously. This achieves decoupling of data dependency arbitration and resource dependency arbitration. When the previous vector instruction completes the read / write of the last data segment and releases the corresponding port resources, the address generation unit can immediately generate a real-time memory access address for the next vector instruction in the address queue and start the access. This avoids pipeline bubbles during instruction switching, improves the continuous supply capability and port resource utilization of the vector execution unit, and thus improves the overall execution efficiency of the vector processor in long vector and high-throughput computing scenarios.

[0123] In some embodiments, the vector processor system of the present invention can be a coarse-grained reconfigurable vector processor, but the present invention is not limited thereto. This coarse-grained reconfigurable vector processor requires that multiple vector execution units can be flexibly configured into different vector pipeline structures to adapt to the needs of different operators or computation modes.

[0124] This invention introduces a vector dependency advance arbitration mechanism and a multi-instruction advance dependency caching mechanism. Based on the scheduling result of the current instruction stream, multiple vector execution units are dynamically connected to form a computing array that matches the instruction stream. This achieves the execution effect of continuous instruction stream supply and continuous data stream flow between multiple vector processing units.

[0125] Meanwhile, the address generation unit switches the real-time memory access address of each vector processing unit to the vector register file in real time, and uses the vector register file as the data exchange medium between each vector processing unit. This enables flexible configuration of the data path between multiple vector processing units, thereby supporting dynamic reconfiguration of the vector pipeline structure and improving the adaptability of the coarse-grained reconfigurable vector processor to different computing tasks.

[0126] In some embodiments, such as Figure 1 As shown, the vector execution unit includes an input buffer, a vector array, and an output buffer. The vector register file is used to receive real-time read addresses, obtain read data from the vector registers corresponding to the real-time read addresses, and output the read data to the input buffer.

[0127] Vector arrays are used to retrieve read data from the input buffer, perform vector operations based on the read data, and output the results to the output buffer.

[0128] The output buffer is used to write the operation result into the vector register corresponding to the real-time write address if the write dependency processing unit determines that the first vector instruction does not have a read-after-write data dependency or a write-after-write data dependency. If the write dependency processing unit determines that the first vector instruction has a read-after-write data dependency or a write-after-write data dependency, it back-pressures the vector array to cause the vector array to pause vector operation processing.

[0129] In embodiments of the present invention, such as Figure 1 and Figure 2 As shown, in the operand fetch stage of the vector processor, the corresponding source operand is read from the 32 vector registers in the vector register file according to the real-time read address sent by the address generation unit, and the source operand is sent to the input buffer of the vector execution unit.

[0130] During the execution phase of the vector processor, the vector array reads data from the input buffer and performs corresponding vector operations based on the source operands. After completing the vector operations, the vector array outputs the results to the output buffer.

[0131] During the write-back phase of the vector processor, if the write dependency processing unit determines that the read-after-write data dependency and write-after-write data dependency of the vector instruction have been resolved, that is, there is no read-after-write data dependency and write-after-write data dependency between the vector instruction and the preceding vector instruction, then the operation result is written to the corresponding target vector register in the vector register file according to the real-time write address sent by the address generation unit, so as to complete the write-back operation of the operation result.

[0132] If the write dependency processing unit determines that the read-after-write data dependency and / or write-after-write data dependency of the vector instruction has not been resolved (i.e., there is still a read-after-write data dependency and / or write-after-write data dependency between the vector instruction and its preceding vector instruction), the arbitrator sends a backpressure control signal to the vector execution unit to implement backpressure control on its execution pipeline, causing the vector execution unit to suspend data processing. After the aforementioned read-after-write data dependency and write-after-write data dependency are resolved, the operation result is then written to the corresponding target vector register.

[0133] According to the above embodiments, by setting up an input buffer, a vector array, and an output buffer within the vector execution unit, and cooperating with the real-time read / write address generation mechanism of the vector register file, the acquisition, processing, and write-back of vector data can be performed continuously in a pipelined manner, thereby improving the parallelism and throughput of vector operations. The output buffer, combined with the judgment result of the write dependency processing unit, promptly completes the write-back when the write dependency is resolved, and implements backpressure control on the vector array when the write dependency is not yet resolved, avoiding write port conflicts or data overwriting errors. While ensuring data consistency and execution correctness, it reduces unnecessary waiting and pipeline bubbles, improves the continuous supply capacity and hardware resource utilization of the vector execution unit, and thus enhances the overall execution efficiency of the vector processor in long vector and high-throughput computing scenarios.

[0134] This application provides a vector processor method applied to the aforementioned vector processor system. This vector processor method and the vector processor system in one embodiment of this application are based on the same inventive concept and have similar problem-solving principles. Therefore, the implementation of the vector processor method is the same as that of the vector processor system in one embodiment of this application, and repeated details will not be described again. The terms "unit" or "module" used below can refer to a combination of software and / or hardware that implements a predetermined function. Although the system described in the following embodiments is preferably implemented in software, hardware implementation, or a combination of software and hardware, is also possible and contemplated.

[0135] This invention provides a vector processor method comprising: The status recording unit receives the first identifier of the first vector instruction and stores the first identifier in the corresponding vector register status information table.

[0136] The vector execution unit sends the first vector instruction and its corresponding arbitration request instruction to the arbitrator.

[0137] The arbitrator, based on the arbitration request instruction and the status information tables of each vector register, determines whether the first vector instruction has a write-after-read data dependency. If not, it determines that the first vector instruction has execution privileges.

[0138] The address generation unit determines the real-time read address and / or real-time write address of the data segment to be accessed by the first vector instruction that has obtained execution privileges, based on the source vector register field, the target vector register field, and the vector configuration parameters. The vector execution unit retrieves read data and / or write data from the vector register file based on the real-time read address and / or real-time write address and performs vector processing.

[0139] In some embodiments, the vector register status information table includes: a read instruction table, a write instruction table, a read data segment table, and a write data segment table.

[0140] The read instruction table and write instruction table store the first identifier of the vector register to be read or written, respectively.

[0141] The read data segment table and write data segment table store the second identifier corresponding to the data segment being read or written by the vector register, respectively. The data segment is determined based on the bit width of the vector register and the data bit width of the vector execution unit.

[0142] In some embodiments, the vector register status information table further includes: Read the vector register status table to check if the vector register is in a read state.

[0143] Write to the vector register status table to store whether the vector register is in a write state.

[0144] In some embodiments, the arbitrator includes: a read dependency processing unit, which determines whether a second vector instruction exists in the write instruction table corresponding to the source vector register to be accessed by the first vector instruction, and if so, whether the second vector instruction precedes the first vector instruction. If not, it determines that the first vector instruction does not have a write-after-read data dependency. If the write instruction table contains only one second vector instruction, it determines whether a second identifier exists in the write data segment table corresponding to the source vector register. If the second identifier exists, it determines that the first vector instruction does not have a write-after-read data dependency.

[0145] In some embodiments, the arbitrator includes: a write dependency processing unit, which determines whether a second vector instruction exists in the read instruction table corresponding to the target vector register to be accessed by the first vector instruction, and if so, whether the second vector instruction has a read-after-write data dependency. If the second vector instruction exists only in the read instruction table, the arbitrator determines whether a second identifier exists in the read data segment table corresponding to the target vector register. If the second identifier exists, the arbitrator determines that the first vector instruction has no read-after-write data dependency.

[0146] In some embodiments, the write dependency processing unit further determines whether a second vector instruction exists in the write instruction table corresponding to the target vector register to be accessed by the first vector instruction. If not, it is determined that the first vector instruction does not have a write-after-write data dependency.

[0147] In some embodiments, the address generation unit includes: The read address queue and write address queue receive the first vector instruction that has been granted execution privileges.

[0148] After the real-time read address generation unit completes the read operation on the last data segment of the second vector instruction, it retrieves the first vector instruction from the read address queue. Based on the first vector instruction, its vector length, resource dependencies, and vector register grouping multiple, it generates a real-time read address and updates the read status of the vector register to the read vector register status table.

[0149] After the real-time write address generation unit completes the write operation on the last data segment of the second vector instruction, it retrieves the first vector instruction from the write address queue. Based on the first vector instruction, its vector length, resource dependencies, and vector register grouping multiple, it generates a real-time write address and updates the write status of the vector register to the write vector register status table.

[0150] In some embodiments, the vector execution unit includes: The vector register file receives the real-time read address, retrieves the read data from the vector register corresponding to the real-time read address, and outputs the read data to the input buffer.

[0151] The vector array reads data from the input buffer, performs vector operations based on the read data, and outputs the results to the output buffer.

[0152] If the write dependency processing unit determines that the first vector instruction does not have a read-after-write data dependency or a write-after-write data dependency, it writes the operation result into the vector register corresponding to the real-time write address. If the write dependency processing unit determines that the first vector instruction has a read-after-write data dependency or a write-after-write data dependency, it backpressures the vector array to cause the vector array to pause vector operation processing.

[0153] In some embodiments, the method further includes: the vector execution unit includes: an instruction queue.

[0154] The decoding and scheduling unit decodes the acquired first vector instruction to obtain decoding information. Based on the operation type field in the decoding information, the first vector instruction is sent to the instruction queue of the corresponding vector execution unit. Based on the source register field and destination register field in the decoding information, the first identifier of the first vector instruction is sent to the status recording unit.

[0155] Those skilled in the art will understand that embodiments of the present invention can be provided as methods, systems, or computer program products. Therefore, the present invention can take the form of a completely hardware embodiment, a completely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention can take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) containing computer-usable program code.

[0156] This invention is described with reference to flowchart illustrations and / or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and / or block diagrams, and combinations of blocks in the flowchart illustrations and / or block diagrams, can be implemented by computer program instructions. These computer program instructions can be provided to a processor of a general-purpose computer, special-purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, generate instructions for implementing the flowchart illustrations and / or block diagrams. Figure 1 One or more processes and / or boxes Figure 1 A device that provides the functions specified in one or more boxes.

[0157] These computer program instructions may also be stored in a computer-readable storage medium that can direct a computer or other programmable data processing device to function in a particular manner, such that the instructions stored in the computer-readable storage medium produce an article of manufacture including instruction means, which are implemented in a process Figure 1 One or more processes and / or boxes Figure 1 The function specified in one or more boxes.

[0158] These computer program instructions may also be loaded onto a computer or other programmable data processing equipment to cause a series of operational steps to be performed on the computer or other programmable equipment to produce a computer-implemented process, thereby providing instructions that execute on the computer or other programmable equipment for implementing the process. Figure 1 One or more processes and / or boxes Figure 1 The steps of the function specified in one or more boxes.

[0159] In the description of this specification, the references to terms such as "an embodiment," "a specific embodiment," "some embodiments," "for example," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples.

[0160] The specific embodiments described above further illustrate the purpose, technical solution, and beneficial effects of the present invention. It should be understood that the above descriptions are merely specific embodiments of the present invention and are not intended to limit the scope of protection of the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

Claims

1. A vector processor system, characterized in that, include: Status recording unit, vector execution unit, vector register file, arbitrator and address generation unit; The status recording unit is used to receive the first identifier of the first vector instruction and store the first identifier in the corresponding vector register status information table; The vector execution unit is used to send the first vector instruction and its corresponding arbitration request instruction to the arbitrator. The arbitrator is used to determine whether there is a write-after-read data dependency relationship in the first vector instruction based on the arbitration request instruction and the status information tables of each vector register; If it does not exist, determine that the first vector instruction has execution permission; The address generation unit is used to determine the real-time read address and / or real-time write address of the data segment to be accessed by the first vector instruction that has obtained execution permission, based on the source vector register field, the target vector register field, and the vector configuration parameters of the first vector instruction that has obtained execution permission. The vector execution unit performs vector processing by acquiring read data and / or write data from the vector register file based on the real-time read address and / or real-time write address.

2. The system according to claim 1, characterized in that, The vector register status information table includes: a read instruction table, a write instruction table, a read data segment table, and a write data segment table; The read instruction table and the write instruction table are respectively used to store the first identifier for reading or writing to the vector register; The read data segment table and the write data segment table are respectively used to store the second identifier corresponding to the data segment being read or written by the vector register; wherein, the data segment is determined based on the bit width of the vector register and the data bit width of the vector execution unit.

3. The system according to claim 2, characterized in that, The vector register status information table also includes: The vector register status table is used to store whether the vector register is in a reading state; A write vector register status table is used to store whether the vector register is in a write state.

4. The system according to claim 2, characterized in that, The arbitrator includes a read dependency processing unit, configured to determine whether a second vector instruction precedes the first vector instruction in the write instruction table corresponding to the source vector register to be accessed by the first vector instruction; if not, determine that the first vector instruction does not have the write-after-read data dependency relationship; if the write instruction table contains only one second vector instruction, determine whether the second identifier exists in the write data segment table corresponding to the source vector register; if the second identifier exists, determine that the first vector instruction does not have the write-after-read data dependency relationship.

5. The system according to claim 2, characterized in that, The arbitrator includes a write dependency processing unit, configured to determine whether a second vector instruction precedes the first vector instruction in the read instruction table corresponding to the target vector register to be accessed by the first vector instruction; if not, determine that the first vector instruction does not have a read-after-write data dependency relationship; if the read instruction table contains only one second vector instruction, determine whether the second identifier exists in the read data segment table corresponding to the target vector register; if the second identifier exists, determine that the first vector instruction does not have the read-after-write data dependency relationship.

6. The system according to claim 5, characterized in that, The write dependency processing unit is further configured to determine whether there is a second vector instruction preceding the first vector instruction in the write instruction table corresponding to the target vector register to be accessed by the first vector instruction; if not, it is determined that the first vector instruction does not have a write-after-write data dependency relationship.

7. The system according to claim 4, characterized in that, The address generation unit includes: a real-time read address generation unit, a real-time write address generation unit, a read address queue, and a write address queue; The read address queue and the write address queue are used to receive the first vector instruction that has obtained execution permission; The real-time read address generation unit is used to obtain the first vector instruction from the read address queue after the read operation is completed in the last data segment of the second vector instruction; generate the real-time read address based on the first vector instruction and its vector length, resource dependency relationship and vector register grouping multiple, and update the read status of the vector register to the read vector register status table; The real-time write address generation unit is used to obtain the first vector instruction from the write address queue after the write operation is completed in the last data segment of the second vector instruction; generate the real-time write address based on the first vector instruction and its vector length, resource dependency relationship and vector register grouping multiple, and update the write status of the vector register to the write vector register status table.

8. The system according to claim 6, characterized in that, The vector execution unit includes: an input buffer, a vector array, and an output buffer; the vector register file is used to receive the real-time read address, obtain read data from the vector register corresponding to the real-time read address, and output the read data to the input buffer. The vector array is used to obtain the read data from the input buffer, perform vector operation processing based on the read data, and output the operation result to the output buffer; The output buffer is used to write the operation result into the vector register corresponding to the real-time write address if the write dependency processing unit determines that the first vector instruction does not have the read-after-write data dependency relationship or the write-after-write data dependency relationship; if the write dependency processing unit determines that the first vector instruction has the read-after-write data dependency relationship or the write-after-write data dependency relationship, it back-pressures the vector array to make the vector array pause vector operation processing.

9. The system according to claim 1, characterized in that, Also includes: Decoding and scheduling unit; The vector execution unit includes: an instruction queue; The decoding and scheduling unit is used to decode the acquired first vector instruction to obtain decoding information; send the first vector instruction to the instruction queue of the corresponding vector execution unit according to the operation type field in the decoding information; and send the first identifier of the first vector instruction to the status recording unit based on the source register field and the target register field in the decoding information.

10. A vector processor method, characterized in that, include: The status recording unit receives the first identifier of the first vector instruction and stores the first identifier in the corresponding vector register status information table; The vector execution unit sends the first vector instruction and its corresponding arbitration request instruction to the arbitrator; The arbitrator determines whether there is a write-after-read data dependency relationship in the first vector instruction based on the arbitration request instruction and the status information tables of each vector register; If it does not exist, determine that the first vector instruction has execution permission; The address generation unit determines the real-time read address and / or real-time write address of the data segment to be accessed by the first vector instruction that has obtained execution permission, based on the source vector register field, the target vector register field, and the vector configuration parameters; the vector execution unit obtains read data and / or write data from the vector register file based on the real-time read address and / or real-time write address and performs vector processing.

11. A chip, characterized in that, The vector processor system included in any one of claims 1 to 9.

12. A circuit board, characterized in that, Includes the chip described in claim 11.

13. An electronic device, characterized in that, Includes the board as described in claim 12.