Processing method and apparatus of a vector memory access instruction
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- FALCON TECHNOLOGY (GUANGZHOU) CO LTD
- Filing Date
- 2026-04-01
- Publication Date
- 2026-06-30
Smart Images

Figure CN122308910A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of vector instruction processing technology, and more specifically, to a method and apparatus for processing vector memory access instructions. Background Technology
[0002] To enhance data-level parallelism, modern processors widely employ vector instruction sets. Vector instructions typically operate on multiple data elements and have long execution cycles. If they were executed sequentially in a scalar pipeline, they would severely clog the scalar pipeline (i.e., scalar instructions would wait for vector instructions to complete before continuing execution), reducing overall performance. To address this, existing technologies have developed architectures that decouple scalar and vector pipelines. Vector instructions are pre-decoded in the scalar pipeline and sent to the vector processor, while the scalar pipeline continues executing subsequent instructions. The vector instructions, on the other hand, are executed independently in the vector processor. This decoupling effectively prevents vector instructions from blocking the scalar pipeline.
[0003] However, vector memory access instructions (such as load and store instructions) typically cannot be committed in advance because they require read / write authentication (i.e., permission checks) for each element to ensure the legality of memory access and precise exception handling. This is because it is impossible to determine whether illegal access exists before authentication is complete. If the instruction is committed before the check is complete, it will be impossible to restore the precise exception state if a permission error is subsequently discovered. Therefore, in existing solutions, vector memory access instructions still have to wait in the scalar pipeline for all elements to be authenticated and executed before they can be committed. This causes the scalar pipeline to pause for a long time at this instruction, becoming a performance bottleneck.
[0004] In view of this, we propose a method and apparatus for processing vector memory access instructions that can decouple address authentication from execution and accelerate the authentication process, so as to release the scalar pipeline as early as possible while ensuring accurate exception handling and improving the overall system performance. Summary of the Invention
[0005] The purpose of this invention is to provide a method and apparatus for processing vector memory access instructions, so as to solve the problems mentioned in the background art.
[0006] To address the aforementioned technical problems, one objective of this invention is to provide a method for processing vector memory access instructions, applicable to a processor comprising a scalar pipeline and a vector processor. The scalar pipeline progressively includes an instruction fetch stage, a decode stage, an execution stage, a memory access stage, and a write-back stage, comprising the following steps: S1. Instruction issuance steps: In the scalar pipeline, when a vector memory access instruction flows through any of the pipeline stages of decoding, execution, memory access, and write-back, it is issued to the vector processor. At the same time, it continues to flow in the scalar pipeline as a null instruction until it reaches the write-back stage. In the write-back stage, the scalar pipeline is paused, waiting for the vector processor to return the authentication result. S2, Address authentication step: The vector processor performs read / write authentication on each memory address involved in the vector memory access instruction in element-wise order to obtain the authentication result; If the authentication result is that all elements are authenticated successfully, the vector processor sends a commit permission instruction to the scalar pipeline, and the scalar pipeline will suspend the commit of the vector memory access instruction in the write-back phase and release the scalar pipeline resources; if any element fails authentication, the address authentication is stopped immediately, the location information of the failed element is recorded in the exception location register, and an exception signal is sent to the scalar pipeline. S3. Vector execution step: The vector processor executes the memory access operation of the vector memory access instruction; The vector execution step overlaps at least partially with the address authentication step and the scalar submission step in time, and the execution progress of the vector execution step is constrained by the progress of the address authentication step.
[0007] Preferably, in the instruction issuance step, the scalar pipeline continues to flow in the form of an empty instruction until the write-back stage is reached, and the scalar pipeline is paused in the write-back stage to wait for the vector processor to return the authentication result, including the following steps: In the scalar pipeline, a vector memory access instruction is transformed into a null instruction, meaning that in subsequent pipeline stages, each execution unit does not perform any substantive operation on it, but only passes instruction identification and sequence information. The null instruction carries only the instruction identifier and sequence information and is passed up and down the scalar pipeline, while allowing subsequent vector memory access instructions to continue flowing into the pipeline until the write-back stage is reached. If the pause controller in the write-back phase detects that the vector memory access instruction is considered a null instruction in the scalar pipeline and has not yet received the authentication result returned by the vector processor, it generates a pause signal to prevent the instruction from being committed. At the same time, it no longer allows new instructions to enter the write-back phase and pushes back to the instruction fetch phase step by step until it receives the permission to commit instruction or an exception is reported.
[0008] Preferably, the authentication result includes authentication pass and authentication fail, and authentication failure includes page fault exception, permission violation and address misalignment; If any element output authentication failure is detected, authentication of subsequent elements is immediately stopped, the position information of the failed element is recorded in the abnormal position register, and an abnormal signal is sent to the scalar pipeline.
[0009] Preferably, in the vector execution step, the execution progress of the vector execution step is constrained by the progress of the address authentication step. The memory access operation is constrained based on the state of the vector memory access instruction in the scalar pipeline, including: When the instruction is in a speculative state, any memory access operation is prohibited; When the instruction is in a non-speculative state but has not yet been committed in the scalar pipeline, memory access operations on those elements that have been authenticated and have passed are allowed. Once the instruction has been submitted in the scalar pipeline, memory access operations for all remaining elements are allowed.
[0010] Preferably, the address authentication step adopts a block-based batch authentication method, specifically including the following steps: Get the memory address of the element currently to be authenticated; Determine the memory block containing the address based on the preset memory block size; Calculate all consecutive elements whose addresses fall within the memory block in the vector memory access instruction; All elements within the storage block are grouped together, and read / write authentication of all elements in the group is performed at once. If successful, all elements are marked as authenticated successfully; if unsuccessful, only the first element is reported as having failed authentication.
[0011] Preferably, for vector memory access instructions of contiguous address type, the method for calculating all contiguous elements whose addresses fall within the memory block in the vector memory access instruction is as follows: After each address authentication, starting from the address of the element to be authenticated, the element size is incremented sequentially until the boundary of the storage block is reached, thus determining the sequence of elements falling within the same storage block. Except for the first and last address authentication steps, all intermediate address authentication steps query the complete storage block.
[0012] Preferably, for vector memory access instructions of the step address type, calculating all consecutive elements whose addresses fall within the memory block in the vector memory access instruction includes the following steps: A complete address generation unit is used to calculate the complete address of the current element to be authenticated; then N simplified address generation units are used to calculate the low-order addresses of the first to Nth consecutive elements after the current element to be authenticated. Based on the consistency between the carry information generated by the complete address generation unit when calculating the address and the carry information generated by each of the simplified address generation units, it is determined whether the subsequent element is located in the same storage block as the current element. Select the last element that falls within the same block, and use the high-order bits of the first element's address and the low-order bits of the selected element's address to concatenate the base address for address authentication of the next element.
[0013] Preferably, for vector memory access instructions of index address type, when calculating all consecutive elements whose addresses fall within the memory block in the vector memory access instruction, determining whether multiple consecutive elements fall within the same memory block includes the following steps: The complete address of the element to be authenticated is calculated by a complete address generation unit based on the base address and the index value of the current element; then, the low-order address is calculated by N simplified address generation units based on the base address and the index values of the subsequent 1st to Nth elements respectively. The high-order comparator compares the high-order index of the subsequent element with the high-order carry and low-order carry of the current element's index. If the high-order index does not match or the low-order carry does not match, it is considered that the element is not in the same block as the element being authenticated. If the high-order index matches and the low-order carry matches, it is considered that the element is in the same storage block as the element being authenticated.
[0014] A second objective of this invention is to provide a processing apparatus for vector memory access instructions, including the processing method for vector memory access instructions described in any one of the above-mentioned methods, comprising: The instruction dispatch unit, located in the scalar pipeline, is used to dispatch the vector memory access instruction to the vector processor when the vector memory access instruction flows through any pipeline stage between the decoding stage and the write-back stage, and to control the instruction to continue flowing in the scalar pipeline as a no instruction, and to block the scalar pipeline when it reaches the write-back stage. An authentication unit, located in the vector processor, is used to receive vector memory access instructions issued by the instruction dispatch unit, perform read / write authentication on each element address of the instruction in sequence, and send a commit instruction to the scalar pipeline after all elements have passed the check. An execution unit, located in a vector processor and decoupled from the authentication unit, is used to execute the memory access operation of the vector memory access instruction; The state control unit is used to control the start / stop and execution range of the execution unit based on the speculative state of the vector memory access instruction in the scalar pipeline.
[0015] Compared with the prior art, the beneficial effects of the present invention are as follows: 1. Vector memory access instructions can be pre-issued to the vector processor during the decoding, execution, or memory access stages of the scalar pipeline. After being issued, the instruction is converted into a null instruction in the scalar pipeline and continues to flow. Each execution unit does not perform any substantive operation on it, but only transmits sequence information. This allows vector memory access instructions to be dispatched without waiting for the scalar pipeline to complete all stages. Subsequent scalar instructions can then enter the pipeline immediately afterward. This avoids the blocking problem caused by vector memory access instructions occupying the pipeline for a long time, and greatly improves the utilization rate of the scalar pipeline. Furthermore, before the instruction is submitted, the execution of authenticated elements is allowed, while unauthenticated elements are temporarily suspended. This allows data access to legitimate elements to be performed in advance during the authentication process, overlapping with the authentication operation in time, thus effectively hiding the authentication delay. Once all elements are authenticated, the instruction is quickly submitted, and the remaining elements are then executed without obstacles, significantly reducing the overall memory access delay.
[0016] 2. To address the issues of numerous vector memory access instructions and high overhead for element-by-element authentication, this paper proposes a method to identify consecutive elements whose addresses fall within the same memory block and merge these elements into a single permission query, significantly reducing the number of accesses. For three access modes—contiguous addresses, step addresses, and index addresses—efficient algorithms for identifying elements within the address block are provided. By utilizing hardware techniques such as carry prediction and simplified address generation units, multiple elements can be determined to be in the same block within a single cycle, thereby significantly improving authentication throughput and accelerating the release speed of the scalar pipeline. Attached Figure Description
[0017] Figure 1 This is the overall flowchart of Example 1; Figure 2 This is a data path diagram for the step-by-step vector memory access instructions in Example 1 that fall within the same memory block; Figure 3 This is a memory access data path diagram for multiplexed step-type vector memory access instructions and index address type vector memory access instructions in Example 1. Detailed Implementation
[0018] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0019] Example 1 like Figure 1 As shown, one of the objectives of this invention is to provide a method for processing vector memory access instructions, applied in a processor including a scalar pipeline and a vector processor. The scalar pipeline includes, in turn, an instruction fetch stage, a decode stage, an execution stage, a memory access stage, and a write-back stage, comprising the following steps: S1. Instruction Issuance Steps: In a scalar pipeline, when a vector memory access instruction flows through any of the following pipeline stages—decoding, execution, memory access, and write-back—it is issued to the vector processor, where: If the command is sent during the decoding phase, it starts earliest and can maximize the overlap between vector processing and scalar pipeline. If it is sent during the execution phase, the operands are usually ready at this time (because the register read has been completed before the execution phase), but the command is sent later than the decoding phase, and the overlap period is slightly shorter. If it is sent during the memory access phase, it starts latest, but can reuse operands (such as base address values) that are already prepared in the scalar pipeline, which is suitable for scenarios that are not sensitive to startup delay. The instruction is preferably issued during the decoding stage to maximize the overlap between vector processing and scalar pipeline. After determining the issuance timing, the instruction dispatch unit packages the information of the vector memory access instruction and sends it to the vector processor via a dedicated bus. The information packet includes at least: opcode, base address register value (or index), vector length, stride / index information, and a unique instruction identifier for subsequent response matching.
[0020] After the instruction is dispatched, it continues to flow in the scalar pipeline as a null instruction until it reaches the write-back stage. This method allows vector memory access instructions to be dispatched without waiting for the scalar pipeline to complete all stages, thereby reducing the startup latency of vector operations. The steps include: In the scalar pipeline, a vector memory access instruction is transformed into a null instruction. That is, in subsequent pipeline stages, each execution unit does not perform any substantive operation on it, but only passes instruction identification and sequence information. This includes: no computation is performed on it during the execution stage; and no access is performed on it during the memory access stage. The null instruction carries only the instruction identifier and sequence information and is passed step by step along the scalar pipeline to ensure that the program order remains unchanged. At the same time, it allows subsequent vector memory access instructions to continue to flow into the pipeline, improving throughput and preventing subsequent instructions from skipping levels. The step-by-step transmission means that it is passed along the decoding stage, execution stage, memory access stage and write-back stage until it reaches the write-back stage.
[0021] However, vector memory access instructions cannot be prematurely committed in the scalar pipeline because they must wait for the address authentication of all elements to pass. If the scalar pipeline is allowed to remain on this instruction until authentication is complete, subsequent unrelated scalar instructions will be unable to enter the pipeline, resulting in a decrease in throughput. Therefore, the scalar pipeline is paused during the write-back phase to wait for the vector processor to return the authentication result. This includes the following steps: The pause controller in the write-back phase detects that the vector memory access instruction is considered a null instruction in the scalar pipeline and has not yet received the authentication result returned by the vector processor. It generates a pause signal to prevent the instruction from being committed. At the same time, it pauses the preceding pipeline stage through the pipeline backpressure mechanism (that is, it no longer allows new instructions to enter the write-back phase and backpressures the instruction fetch phase step by step, pausing the memory access phase, execution phase, decoding phase and instruction fetch phase in sequence) until it receives the permission to commit instruction or an exception is reported.
[0022] S2. Address authentication step: The vector processor performs read / write authentication on each memory address involved in the vector memory access instruction in element-wise order to obtain the authentication result, specifically including the following steps: The memory address of each element is calculated sequentially in ascending order of index (0,1,2,…,VL-1). The address generation unit receives the base address, stride / index value, and current element index for each element to calculate the complete memory address. For continuous and step-based modes, the memory address is generated recursively; for indexed mode, the index value is read from the vector register file as the memory address. Following the element order is beneficial because when an exception occurs, the first erroneous element can be accurately located and recorded in the vstart register. Furthermore, when decoupled from the execution progress of the vector processor, the authenticated pointer can accurately reflect the range of elements that have been confirmed to be safe. Then, the memory address generated for each element is sent to the processor's Memory Management Unit (MMU) and Physical Memory Protection Unit (PMP) for permission lookup. The MMU is responsible for the translation of virtual addresses to physical addresses and page-level authentication (read / write / execute permissions, user / kernel mode, etc.); the PMP provides access control for physical address regions (such as isolation between the secure and insecure worlds). The lookup typically involves the following steps: TLB lookup: First, look up the address translation and permission information in the Translation Lookaside Buffer (TLB). If a match is found, return quickly. Page table traversal: If the translation backing buffer is not hit, a hardware page table traversal is initiated (or software filling is performed) to load page table entries from memory and obtain the physical address and permission bits; Access verification: Compare the access type (read / write) required by the instruction with the permission bits in the page table entry / PMP entry, and check whether the address is valid (no page fault), and whether it is aligned, etc. Output the authentication result, which includes authentication passed (the address is valid and has the required read and write permissions) and authentication failed. Authentication failed includes page fault (the virtual address is not mapped to a physical page frame), permission violation (the address mapping exists, but the permissions are insufficient) and address misalignment (the address is not aligned according to the element size). If any element output authentication failure is detected, authentication of subsequent elements is immediately stopped, the position information of the failed element is recorded in the abnormal position register, and an abnormal signal is sent to the scalar pipeline.
[0023] If the authentication result is that all elements are authenticated successfully, the vector processor sends a commit permission instruction to the scalar pipeline. The scalar pipeline will suspend the commit of the vector memory access instruction in the write-back phase and release scalar pipeline resources (such as reordered buffer entries, physical registers, etc.). After the commit, the instruction is considered to be completed at the architectural level.
[0024] S3. Vector execution step: The vector processor executes the memory access operation of the vector memory access instruction; The vector execution step overlaps at least partially with the address authentication step and the scalar submission step in time, and the execution progress of the vector execution step is constrained by the progress of the address authentication step.
[0025] If vector execution is allowed to proceed completely freely, executing an element that has not yet passed authentication during pre-authentication execution will result in unauthorized access if subsequent authentication of that element fails, leading to irrecoverable damage. Similarly, during pre-commit execution, before the instruction is committed in the scalar pipeline, if an unrecoverable error occurs during execution, it cannot be linked to the specific instruction and program point, compromising the precision of the exception. Therefore, the execution progress of the vector execution step is constrained by the progress of the address authentication step. The memory access operation is constrained based on the state of the vector memory access instruction in the scalar pipeline, including: When the instruction is in a speculative state, any memory access operations are prohibited, specifically: If an instruction on a speculative path actually performs a memory access, but the final branch prediction error causes the instruction to be canceled, then the memory access operation that has already occurred (especially the write operation) will be irreversible, which may corrupt the program state. Even read operations may interfere with the system due to triggering exceptions. Therefore, the speculative state of each vector instruction is tracked, and the execution unit is notified through a speculative signal. When speculative=1, all memory access requests of the execution unit are blocked, the address generation unit is disabled, and cache access is not initiated, so that execution must be completely prohibited in the speculative state. When the instruction is in a non-speculation state but has not yet been committed in the scalar pipeline, memory access operations on those elements that have been authenticated and whose results are passed are allowed. In other words, the instruction is safe at this time, but has not yet been committed. Therefore, not all elements can be executed, only the parts that have passed authentication can be executed. This ensures that if the authentication of subsequent elements fails, the executed elements are all legal and will not cause illegal access. At the same time, since the instruction has not been committed, even if an unrecoverable error occurs during execution, the system can trigger an exception at the correct program point. Once the instruction has been committed in the scalar pipeline, memory access operations for all remaining elements are allowed. The commit of the instruction means that it has been completed at the architectural level, and subsequent execution will not affect the correctness of the program (even if an unrecoverable error occurs, it will directly trigger a machine-level exception). At this point, the authentication of all elements must have been completed (because the premise of the commit is that all authentications are passed), so the execution can proceed without constraints.
[0026] The second objective of this invention is to provide a vector memory access instruction processing apparatus and a vector memory access instruction processing method applied to any of the above, including an instruction dispatch unit disposed in a scalar pipeline, for distributing the vector memory access instruction to the vector processor when the vector memory access instruction flows through any pipeline stage between the decoding stage and the write-back stage, and controlling the instruction to continue flowing in the scalar pipeline in the form of a no instruction, and blocking the scalar pipeline when it reaches the write-back stage; An authentication unit, located in the vector processor, is used to receive vector memory access instructions issued by the instruction dispatch unit, perform read / write authentication on each element address of the instruction in sequence, and send a commit instruction to the scalar pipeline after all elements have passed the check. An execution unit, located in a vector processor and decoupled from the authentication unit, is used to execute the memory access operation of the vector memory access instruction; The state control unit is used to control the start / stop and execution range of the execution unit based on the speculative state of the vector memory access instruction in the scalar pipeline.
[0027] Example 2: Memory attributes in a processor are typically managed based on "blocks" of a certain size. For example, page tables are in 4KB units, and PMPs are in 32B / 64B units. This means that multiple consecutive addresses within the same memory block have the same access permissions. Therefore, to speed up the address authentication step in step S2, multiple elements within a block can be merged into a single query. The difference between this example and Example 1 is: The address authentication step adopts a batch authentication method based on storage blocks, which specifically includes the following steps: Get the memory address of the element currently to be authenticated; Determine the memory block containing the address based on the preset memory block size; Calculate all consecutive elements whose addresses fall within the memory block in the vector memory access instruction; All elements within the storage block are grouped together, and read / write authentication of all elements in the group is performed at once. If successful, all elements are marked as authenticated successfully; if unsuccessful, only the first element is reported as unauthenticated, and the above process is repeated until all elements are authenticated.
[0028] Because address authentication is based on storage blocks, the address change rules for permission queries are inconsistent with the memory access address patterns of the original elements. The specific address calculation is as follows: Firstly, for vector memory access instructions with contiguous address types (such as the unit-stride of RISC-V vector extension), the method for calculating all contiguous elements whose addresses fall within the memory block in the vector memory access instruction is as follows: After each address authentication, starting from the address of the element to be authenticated, the element size is incremented sequentially until the boundary of the storage block is reached, thus determining the sequence of elements falling within the same storage block. Except for the first and last address authentication steps, all intermediate address authentication steps query the complete storage block.
[0029] Secondly, for vector memory access instructions of the step address type (such as the stride of RISC-V vector extension), calculate all consecutive elements whose addresses fall within the memory block in the vector memory access instruction, and determine whether multiple consecutive elements fall within the same memory block, including the following steps: When the step size is greater than or equal to the storage block size, it means that each accessed element falls in a different storage block, making batch authentication impossible. In this case, the address authentication steps are performed sequentially according to the element order. When the step size is smaller than the storage block size, multiple consecutively accessed elements may fall within the same storage block. Batch authentication of multiple elements within the same block can improve efficiency, including the following approaches: Approach 1: Starting from the address of the element to be authenticated, increment the element size sequentially until the boundary of the storage block is reached. Determine the sequence of elements falling into the same storage block. Divide the step size by the remaining address capacity in the storage block to calculate the number of elements falling into the same storage block. Then, set the address for the next authentication to the next consecutive address based on the storage block alignment. If the block size is 32B and the current element offset is 4B, then the remaining block size is 28B. If the step size is 5B, then it can hold a maximum of 28 ÷ 5 = 5 elements (the remainder of 3B cannot hold the next complete element). These 5 elements can be merged into a storage block for a one-time permission check.
[0030] Alternatively, address offsets and step sizes can be used to create a lookup table. The number of elements that can be placed under various combinations of offsets and step sizes can be pre-calculated, the number of elements falling in the same storage block can be calculated, and the elements in the same storage block can be merged based on the number of elements to perform the address authentication step. The second approach involves using a complete address generation unit to calculate the complete address of the current element to be authenticated; then using N simplified address generation units to calculate the low-order addresses of the first to Nth consecutive elements following the current element to be authenticated. Based on the consistency between the carry information generated by the complete address generation unit when calculating the address and the carry information generated by each of the simplified address generation units, it is determined whether the subsequent element is located in the same storage block as the current element. like Figure 2As shown in the example diagram, assuming a storage block is 32B, N is set to 3, a complete address generation unit is a complete AGU, and three simplified address generation units are incomplete AGU0, incomplete AGU1, and incomplete AGU2. The address of the first element being queried is calculated using the complete address generation unit, and the low-order addresses of the following N consecutive elements are calculated using the N incomplete simplified address generation units. According to the address matching logic, whether the carry of the low-order address matches the carry generated during the calculation of the first element can determine whether the corresponding element falls within the same block as the first element. Finally, select the last element that falls within the same block, and use the high-order bits of the first element's address and the low-order bits of the selected element's address to concatenate the base address for address authentication of the next element.
[0031] Thirdly, for vector memory access instructions based on index values, it is not easy to directly determine which elements fall in the same memory block. Similarly, the method of using one complete AGU and N incomplete AGUs is adopted. When authenticating a certain element, the complete AGU is used to calculate (base address + index), and at the same time, the N small AGUs are used to calculate whether the address of the next N consecutive elements falls in the same memory block as the address of the complete AGU. First, it is determined whether the high bits of the subsequent N indexes are consistent with the high bits of the index of the element being authenticated. At the same time, the low bits of base and index are added to determine whether the carry is consistent with the low carry of the element being authenticated.
[0032] For vector memory access instructions of index address type (such as the index of RISC-V vector extension), when calculating all consecutive elements whose addresses fall within the memory block in the vector memory access instruction, determining whether multiple consecutive elements fall within the same memory block includes the following steps: The complete address of the element to be authenticated is calculated by a complete address generation unit based on the base address and the index value of the current element; then, the low-order address is calculated by N simplified address generation units based on the base address and the index values of the subsequent 1st to Nth elements respectively. The high-order comparator compares the high-order index of the subsequent element with the high-order carry and low-order carry of the current element. If the high-order index does not match or the low-order carry does not match, it is considered that it is not in the same block as the element being authenticated. If the high-order index matches and the low-order carry matches, it is considered that it is in the same storage block as the element being authenticated. In addition, there are cases where the high bits are inconsistent and differ by 1, while the low bits carry exactly makes up for the inconsistency in the high bits. In this case, the elements also fall in the same storage block.
[0033] like Figure 3As shown, when the vector memory access instructions of index address type and the vector memory access instructions of step address type perform the address authentication step, the specific implementation of the method of one complete address generation unit plus N incomplete simplified address generation units is the same. That is, one complete address generation unit is a complete AGU, and three simplified address generation units are incomplete AGU0, incomplete AGU1, and incomplete AGU2. Therefore, the same circuit can be reused in the hardware, requiring only... Figure 2 Each multiplexer adds an extra source of index offset. Another difference is that the base address for index memory access always uses the instruction base address. This improves element read / write authentication for vector memory access instructions of index address type and vector memory access instructions of step address type with a smaller area cost, thereby freeing up the scalar pipeline faster.
[0034] The foregoing has shown and described the basic principles, main features, and advantages of the present invention. Those skilled in the art should understand that the present invention is not limited to the above embodiments. The embodiments and descriptions in the specification are merely preferred examples and are not intended to limit the invention. Various changes and modifications can be made to the invention without departing from its spirit and scope, and all such changes and modifications fall within the scope of the present invention as claimed. The scope of protection of the present invention is defined by the appended claims and their equivalents.
Claims
1. A method for processing vector memory access instructions, applied in a processor including a scalar pipeline and a vector processor, wherein the scalar pipeline includes, step by step, an instruction fetch stage, a decode stage, an execution stage, a memory access stage, and a write-back stage, characterized in that: Includes the following steps: S1. Instruction issuance steps: In the scalar pipeline, when a vector memory access instruction flows through any of the pipeline stages of decoding, execution, memory access, and write-back, it is issued to the vector processor. At the same time, it continues to flow in the scalar pipeline as a null instruction until it reaches the write-back stage. In the write-back stage, the scalar pipeline is paused, waiting for the vector processor to return the authentication result. S2, Address authentication step: The vector processor performs read / write authentication on each memory address involved in the vector memory access instruction in element-wise order to obtain the authentication result; If the authentication result is that all elements are authenticated successfully, the vector processor sends a commit permission instruction to the scalar pipeline, and the scalar pipeline will suspend the commit of the vector memory access instruction in the write-back phase and release the scalar pipeline resources; if any element fails authentication, the address authentication is stopped immediately, the location information of the failed element is recorded in the exception location register, and an exception signal is sent to the scalar pipeline. S3. Vector execution step: The vector processor executes the memory access operation of the vector memory access instruction; The vector execution step overlaps at least partially with the address authentication step and the scalar submission step in time, and the execution progress of the vector execution step is constrained by the progress of the address authentication step.
2. The method for processing vector memory access instructions according to claim 1, characterized in that: In the instruction issuance step, the scalar pipeline continues to flow in the form of an empty instruction until the write-back stage is reached. During the write-back stage, the scalar pipeline is paused, waiting for the vector processor to return the authentication result. This includes the following steps: In the scalar pipeline, a vector memory access instruction is transformed into a null instruction, meaning that in subsequent pipeline stages, each execution unit does not perform any substantive operation on it, but only passes instruction identification and sequence information. The null instruction carries only the instruction identifier and sequence information and is passed up and down the scalar pipeline, while allowing subsequent vector memory access instructions to continue flowing into the pipeline until the write-back stage is reached. If the pause controller in the write-back phase detects that the vector memory access instruction is considered a null instruction in the scalar pipeline and has not yet received the authentication result returned by the vector processor, it generates a pause signal to prevent the instruction from being committed. At the same time, it no longer allows new instructions to enter the write-back phase and pushes back to the instruction fetch phase step by step until it receives the permission to commit instruction or an exception is reported.
3. The method for processing vector memory access instructions according to claim 2, characterized in that: The authentication results include authentication successful and authentication failed. Authentication failed includes page fault, permission violation and address misalignment. If any element output authentication failure is detected, authentication of subsequent elements is immediately stopped, the position information of the failed element is recorded in the abnormal position register, and an abnormal signal is sent to the scalar pipeline.
4. The method for processing vector memory access instructions according to claim 1, characterized in that: In the vector execution step, the execution progress of the vector execution step is constrained by the progress of the address authentication step. The memory access operation is constrained based on the state of the vector memory access instruction in the scalar pipeline, including: When the instruction is in a speculative state, any memory access operation is prohibited; When the instruction is in a non-speculative state but has not yet been committed in the scalar pipeline, memory access operations on those elements that have been authenticated and have passed are allowed. Once the instruction has been submitted in the scalar pipeline, memory access operations for all remaining elements are allowed.
5. The method for processing vector memory access instructions according to claim 1, characterized in that: The address authentication step adopts a batch authentication method based on storage blocks, which specifically includes the following steps: Get the memory address of the element currently to be authenticated; Determine the memory block containing the address based on the preset memory block size; Calculate all consecutive elements whose addresses fall within the memory block in the vector memory access instruction; All elements within the storage block are grouped together, and read / write authentication of all elements in the group is performed at once. If successful, all elements are marked as authenticated successfully; if unsuccessful, only the first element is reported as having failed authentication.
6. The method for processing vector memory access instructions according to claim 5, characterized in that: For vector memory access instructions of contiguous address type, the method for calculating all contiguous elements whose addresses fall within the memory block in the vector memory access instruction is as follows: After each address authentication, starting from the address of the element to be authenticated, the element size is incremented sequentially until the boundary of the storage block is reached, thus determining the sequence of elements falling within the same storage block. Except for the first and last address authentication steps, all intermediate address authentication steps query the complete storage block.
7. The method for processing vector memory access instructions according to claim 6, characterized in that: For vector memory access instructions of the step address type, calculating all consecutive elements whose addresses fall within the memory block in the vector memory access instruction includes the following steps: A complete address generation unit is used to calculate the complete address of the current element to be authenticated; then N simplified address generation units are used to calculate the low-order addresses of the first to Nth consecutive elements after the current element to be authenticated. Based on the consistency between the carry information generated by the complete address generation unit when calculating the address and the carry information generated by each of the simplified address generation units, it is determined whether the subsequent element is located in the same storage block as the current element. Select the last element that falls within the same block, and use the high-order bits of the first element's address and the low-order bits of the selected element's address to concatenate the base address of the next element.
8. The method for processing vector memory access instructions according to claim 6, characterized in that: For vector memory access instructions of index address type, when calculating all consecutive elements whose addresses fall within the memory block in the vector memory access instruction, determining whether multiple consecutive elements fall within the same memory block includes the following steps: The complete address of the element to be authenticated is calculated by a complete address generation unit based on the base address and the index value of the current element; then, the low-order address is calculated by N simplified address generation units based on the base address and the index values of the subsequent 1st to Nth elements respectively. The high-order comparator compares the high-order index of the subsequent element with the high-order carry and low-order carry of the current element's index. If the high-order index does not match or the low-order carry does not match, it is considered that the element is not in the same block as the element being authenticated. If the high-order index matches and the low-order carry matches, it is considered that the element is in the same storage block as the element being authenticated.
9. A processing apparatus for vector memory access instructions, applied to the processing method for vector memory access instructions according to any one of claims 1-8, characterized in that, include: The instruction dispatch unit, located in the scalar pipeline, is used to dispatch the vector memory access instruction to the vector processor when the vector memory access instruction flows through any pipeline stage between the decoding stage and the write-back stage, and to control the instruction to continue flowing in the scalar pipeline as a no instruction, and to block the scalar pipeline when it reaches the write-back stage. An authentication unit, located in the vector processor, is used to receive vector memory access instructions issued by the instruction dispatch unit, perform read / write authentication on each element address of the instruction in sequence, and send a commit instruction to the scalar pipeline after all elements have passed the check. An execution unit, located in a vector processor and decoupled from the authentication unit, is used to execute the memory access operation of the vector memory access instruction; The state control unit is used to control the start / stop and execution range of the execution unit based on the speculative state of the vector memory access instruction in the scalar pipeline.