Memory access unit, memory access instruction execution method and chip
By dividing the execution process of data fetch instructions into multiple subtasks for parallel processing, the timing delay and excessive circuit area of data transfer in storage instructions in processor pipeline design are solved, thereby improving processor performance and clock speed.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- BEIJING VCORE TECH CO LTD
- Filing Date
- 2026-05-22
- Publication Date
- 2026-06-19
AI Technical Summary
In processor pipeline design, there are problems such as timing delays, excessive circuit layout area, and limited processor clock speed increase when passing data from store instructions to fetch instructions with matching addresses.
The execution process of data fetch instructions is divided into multiple subtasks, and these subtasks are executed in parallel through multiple data fetch pipelines. The pre-processing operation is split into each data fetch subtask, and the execution process of data storage instructions is processed in parallel using multiple data storage pipelines.
It effectively distributes the processing load, reduces timing latency, lowers circuit area, and improves the processor's clock speed and overall performance.
Smart Images

Figure CN122240187A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of computer technology, and in particular to a memory access unit, a method for executing memory access instructions, and a chip. Background Technology
[0002] In processor pipeline design, when a fetch instruction is issued and executed, in order to pass data from the store instruction to the address-dependent fetch instruction, address dependency checks need to be performed on all preceding store instructions in the program sequence. Since the smallest granularity of memory access operations is byte, dependency judgment must be completed at the byte level. That is, for each byte position of the fetch instruction, its address is compared one by one to see if it overlaps with the address of any preceding store instruction. This process requires fully connected address comparisons of the fetch instruction with all preceding store instructions, involving a large number of parallel comparator circuits to identify byte units with matching addresses. Afterward, the corresponding data needs to be extracted from the matching store instructions, and data selection and concatenation operations need to be performed to generate the complete data value required by the fetch instruction.
[0003] The above operations introduce significant combinational logic latency in hardware implementation, especially in high-concurrency pipeline scenarios where the complexity of address comparisons and data transfer paths increases dramatically. This latency often becomes a critical path in processor timing, directly limiting clock frequency increases and making it difficult to further optimize overall processor performance. Furthermore, centralized processing requires all comparisons and data operations to be completed within a single pipeline stage, resulting in increased circuit area and power consumption, further exacerbating the design challenges.
[0004] Therefore, overcoming the bottlenecks such as timing delays, excessive circuit layout area, and limited processor clock speed during the data transfer process from the store instruction to the address-matching fetch instruction is an urgent technical problem to be solved. Summary of the Invention
[0005] This invention provides a memory access unit, a method for executing memory access instructions, and a chip to solve the above-mentioned defects in the prior art, so as to effectively overcome the bottlenecks such as timing delay, excessive circuit layout area, and limited processor clock frequency increase in the process of transferring data from the store instruction to the address-matched fetch instruction.
[0006] This invention provides a memory access unit, comprising: Multiple data fetch pipelines; each data fetch pipeline executes one of multiple data fetch subtasks in parallel; wherein, the multiple data fetch subtasks are obtained by dividing the complete execution process of the data fetch instruction; the forward operation is split into each data fetch subtask; the forward operation is the process by which the store instruction directly passes data to the address-related data fetch instruction.
[0007] According to the memory access unit provided by the present invention, the multiple data fetch pipelines include a first data fetch pipeline and a second data fetch pipeline; the first data fetch pipeline is used to execute a first data fetch subtask; the first data fetch subtask includes: pre-decoding the data queue index of the data fetch instruction in the memory access unit; fetching the data fetch instruction; determining the data queue index of the target data fetch instruction before the program order of the data fetch instruction based on the result of the pre-decoding; generating a data forwarding mask of the data fetch instruction based on the address information, opcode, and data queue index of the target data fetch instruction, so that the second data fetch pipeline performs a forwarding operation corresponding to the data fetch instruction based on the data forwarding mask; generating a first virtual address of the data fetch instruction; sending the first virtual address to the translation backup buffer to request the translation backup buffer to convert the first virtual address into a first physical address of the hardware storage unit accessed by the data fetch instruction; and initiating a cache read request to the data cache according to the first virtual address.
[0008] According to the memory access unit provided by the present invention, the second data fetch pipeline is used to execute a second data fetch subtask; the second data fetch subtask includes: obtaining the first physical address from the translation backup buffer; sending the first physical address and the data forwarding mask to the storage queue to trigger a first forwarding operation in the storage queue; sending the first physical address and the data forwarding mask to the commit storage instruction buffer to trigger a second forwarding operation of the storage instruction to the fetch instruction in the commit storage instruction buffer; determining whether the cache read request hits based on the first physical address; if the cache read request hits, obtaining a hit vector from the data cache.
[0009] According to the memory access unit provided by the present invention, the multiple data fetch pipelines further include a third data fetch pipeline; the third data fetch pipeline is used to execute a third data fetch subtask; the third data fetch subtask includes: obtaining the forward data of the data fetch instruction based on the return data of the first forward operation and the return data of the second forward operation; if the cache read request hits, obtaining the cache data corresponding to the first virtual address from the data cache according to the hit vector; obtaining the read data of the data fetch instruction based on the forward data and the cache data; writing the read data of the data fetch instruction and the first physical address into the data fetch queue of the memory access unit, so as to return the execution result of the data fetch instruction through the data fetch queue.
[0010] According to the memory access unit provided by the present invention, obtaining the read data of the fetch instruction based on the preceding data and the cached data includes: if the cache read request misses, obtaining the backfill data information of the data cache through the fetch queue; and using a data selection mask, obtaining the read data of the fetch instruction based on the backfill data information of the data cache and the preceding data; wherein, the data selection mask is a bit vector, and the value of each bit is used to indicate whether the preceding data at the corresponding position in the fetch queue is valid.
[0011] According to the memory access unit provided by the present invention, the step of obtaining the read data of the fetch instruction based on the data selection mask, the backfill data information of the data cache, and the forward data includes: performing an inversion operation on the data selection mask to obtain a fill data mask; for each data unit in the backfill data information, if the address of the fetch instruction matches the address of the data unit, extracting fill data from the data unit based on the fill data mask; and obtaining the read data of the fetch instruction based on the forward data and the fill data.
[0012] According to the memory access unit provided by the present invention, obtaining the read data of the fetch instruction based on the preceding data and the cache data includes: if the cache read request misses and the preceding data hits completely; obtaining the read data of the fetch instruction based on the preceding data.
[0013] According to the memory access unit provided by the present invention, the memory access unit includes: multiple memory access pipelines; each memory access pipeline executes one of multiple memory access subtasks in parallel; wherein, the multiple memory access subtasks are obtained by dividing the complete execution process of memory access instructions.
[0014] According to the memory access unit provided by the present invention, the plurality of memory access pipelines include a first memory access pipeline; the first memory access pipeline is used to execute a first memory access subtask; the first memory access subtask includes: obtaining a memory access instruction; generating a second virtual address of the memory access instruction; and sending the second virtual address to a translation backup buffer to request the translation backup buffer to convert the second virtual address into a second physical address of the hardware memory unit to be accessed by the memory access instruction.
[0015] According to the memory access unit provided by the present invention, the multiple memory access pipelines further include a second memory access pipeline; the second memory access pipeline is used to execute a second memory access subtask; the second memory access subtask includes: obtaining the second physical address from the translation backup buffer; writing the storage data corresponding to the second physical address and the memory access instruction into the memory access queue; and marking the memory access instruction in the memory access queue as a write-back state.
[0016] According to the memory access unit provided by the present invention, the multiple data storage pipelines further include a third data storage pipeline; the third data storage pipeline is used to execute a third data storage subtask; the third data storage subtask includes: performing a violation check on the data storage instruction to pass data to a data fetch instruction; wherein, the violation check is used to check whether a data storage instruction has failed to pass data to a data fetch instruction with the same address and whose program order follows the data storage instruction, and the data fetch instruction has already been written back; if the violation check finds an abnormal situation where a data storage instruction has failed to pass data to a data fetch instruction with the same address and whose program order follows the data fetch instruction, and the data fetch instruction has already been written back, the data fetch pipeline corresponding to the violation check is triggered to re-execute.
[0017] The present invention also provides a method for executing memory access instructions, applied to the memory access unit described above, the method comprising: Obtain the target data retrieval instruction; based on the complete execution process of the target data retrieval instruction, obtain multiple data retrieval subtasks; utilize each data retrieval pipeline to execute one of the multiple data retrieval subtasks in parallel.
[0018] The present invention also provides a chip including the memory access unit as described above.
[0019] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the execution method of memory access instructions as described above.
[0020] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the execution method of memory access instructions as described above.
[0021] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the execution method of memory access instructions as described above.
[0022] The memory access unit, memory access instruction execution method, and chip provided by this invention effectively distribute the processing load by dividing the complete execution process of the data fetch instruction into multiple data fetch subtasks and distributing these subtasks to multiple data fetch pipelines for parallel execution. Furthermore, the forward operation itself is also broken down into these different data fetch subtasks. For example, in the above example, operations such as address generation, forward judgment, and data fetch merging are no longer performed serially or concentrated in a single complex stage, but are decomposed and processed in parallel by different pipelines. Attached Figure Description
[0023] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.
[0024] Figure 1 This is a schematic diagram of the memory access unit provided by the present invention.
[0025] Figure 2 This is a flowchart illustrating the execution method of the memory access instruction provided by the present invention.
[0026] Figure 3 This is a schematic diagram of the memory access pipeline architecture provided by the present invention.
[0027] Figure 4 This is a schematic diagram illustrating the process of obtaining data retrieval instructions based on backfill data information and forward data from the data cache, as provided by the present invention.
[0028] Figure 5 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation
[0029] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.
[0030] The following is combined Figure 1 This invention describes the memory access unit provided by the present invention.
[0031] like Figure 1 As shown, the memory access unit 100 includes: Multiple data retrieval pipelines: First data retrieval pipeline, Second data retrieval pipeline, and Third data retrieval pipeline.
[0032] Each data retrieval pipeline executes one of multiple data retrieval subtasks in parallel.
[0033] Multiple data fetch subtasks are obtained by dividing the complete execution process of the data fetch instruction; the forward operation is split into each data fetch subtask; the forward operation is the process by which the store instruction directly passes data to the address-related data fetch instruction.
[0034] Multiple storage pipelines: First storage pipeline, Second storage pipeline, Third storage pipeline, and Fourth storage pipeline.
[0035] Each store pipeline executes one of multiple store subtasks in parallel. These multiple store subtasks are derived by dividing the complete execution process of store instructions.
[0036] The formats of memory access instructions are not exactly the same for different instruction sets. Common instruction set fetch instructions include four types: fetch byte (lb), fetch half word (lh), fetch word (lw), and fetch double word (ld); and store instructions include four types: store byte (sb), store half word (sh), store word (sw), and store double word (sd).
[0037] For example, in RISC-V (Reduced Instruction Set Computing 5), fetch instructions include the lb instruction shown below.
[0038] The instruction `lb rd, offset(rs1), x[rd] = sext(M[x[rs1]+ sext(offset)][7:0])` has the following format: Table 1 below. Here, `rd` and `rs1` are the identifiers of the base address register, and `offset` is the offset. This instruction refers to byte loading, that is, reading one byte from address `x[rs1] + sign-extend(offset)`, extending it with the sign bit, and writing it to `x[rd]`.
[0039] Table 1
[0040] For example, in RISC-V (Reduced Instruction Set Computing 5), store instructions include the sb instruction shown below.
[0041] The instruction `sb rs2, offset(rs1)`, which is equivalent to `M[x[rs1]] + sign-extend(offset) = x[rs2][7:0]`, has the following format: `sb rs2, offset(rs1)`. This instruction stores a byte, meaning it stores one byte from address `x[rs2]` into memory address `x[rs1] + sign-extend(offset)`.
[0042] Table 2
[0043] The memory access unit is a processor component whose function is to handle all memory access requests issued by the processor, including fetch instructions and store instructions. The memory access unit is responsible for managing the flow of data from the processor to memory and from memory to the processor, ensuring the correctness and efficiency of data access.
[0044] The data fetch pipeline refers to a series of processing stages within the memory access unit used to execute data fetch instructions.
[0045] A data fetch subtask is a smaller, more detailed unit of operation obtained by logically dividing the complete execution process of a data fetch instruction. A complete data fetch instruction may contain multiple complex steps. Decomposing it into multiple subtasks allows these subtasks to be scheduled individually or executed in parallel on different processing resources.
[0046] Forwarding is an optimization technique aimed at reducing pipeline stalls caused by data dependencies. When forwarding occurs, a store instruction passes its data directly to a subsequent address-dependent fetch instruction before writing the data into memory, thus avoiding the fetch instruction waiting for the data to be loaded from memory.
[0047] Parallel execution refers to the simultaneous execution of multiple operations or tasks within the same time period. In memory access units, executing multiple data fetch subtasks in parallel can improve the processing capacity and efficiency of the memory access unit and shorten the overall instruction execution time.
[0048] By dividing the complete execution process of the data retrieval instruction into multiple data retrieval subtasks, processing resources can be better allocated, the execution efficiency of each part can be optimized, and a foundation for distributed processing can be provided.
[0049] In a specific embodiment, in one implementation, the memory access unit can be configured with two or more separate data fetch pipelines, each executing a data fetch subtask, and different data fetch pipelines executing different data fetch subtasks. This multi-pipeline design enables the memory access unit to process multiple data fetch instructions simultaneously, thereby improving the overall memory access throughput.
[0050] Each data fetch pipeline executes one of multiple data fetch subtasks in parallel. When a data fetch instruction enters a memory access unit, its complete execution process is decomposed into multiple logically independent data fetch subtasks. These subtasks are then assigned to different data fetch pipelines for parallel processing.
[0051] The passforward operation is the process by which a store instruction directly passes data to a fetch instruction that accesses the same memory address. When a fetch instruction requires data that was previously written by a store instruction, and both instructions access the same memory address, the passforward operation allows the store instruction to provide the data directly to the fetch instruction before the data is actually written to memory. This direct pass avoids the fetch instruction waiting for data to be loaded from memory, thus reducing the execution latency of the fetch instruction.
[0052] In the specific implementation process, such as Figure 3 As shown, the forward operation is broken down into various data retrieval subtasks.
[0053] In related technologies, the forwarding operation completes all address comparisons and data selections within a single pipeline stage, which leads to increased latency in that stage.
[0054] In the embodiments provided by this invention, the forwarding operation is split into data fetching subtasks. For example, the first data fetching pipeline performs pre-decoding of the storage queue index, thereby identifying potential forwarding sources before the physical address is determined, and generating a data forwarding mask. The second data fetching pipeline then retrieves the specific forwarding data from the storage queue or the data submission storage instruction buffer based on the data forwarding mask. This splitting allows the various stages of the forwarding operation to be executed in parallel or in a pipelined manner in different pipelines, avoiding the complex computation and long latency of a single stage.
[0055] In some embodiments, such as Figure 1 As shown, the multiple data retrieval pipelines include a first data retrieval pipeline and a second data retrieval pipeline.
[0056] The first data retrieval pipeline is used to execute the first data retrieval subtask. The first data retrieval subtask includes: The process involves: pre-decoding the memory queue index of the memory access instruction; fetching the data instruction; determining the memory queue index of the target memory instruction preceding the fetch instruction based on the pre-decoding result; generating a data forwarding mask for the fetch instruction based on the address information, opcode, and memory queue index of the target memory instruction, so that the second fetch pipeline can perform a forwarding operation corresponding to the fetch instruction based on the data forwarding mask; generating a first virtual address for the fetch instruction; sending the first virtual address to the translation back buffer to request the translation back buffer to convert the first virtual address into the first physical address of the hardware memory unit accessed by the fetch instruction; and initiating a cache read request to the data cache based on the first virtual address.
[0057] The data forwarding mask is a binary bit string used to indicate the range of bytes of data that a store instruction needs to forward to a fetch instruction.
[0058] In the specific implementation process, the index of the storage queue of the target storage instruction before the program order of the data fetch instruction can be determined in various ways based on the result of pre-decoding.
[0059] Program order refers to the order in which instructions are executed in a program. It is a key concept used in memory consistency models and out-of-order execution processors to determine data forwarding conditions.
[0060] In some embodiments, since fetch instructions and store instructions are ordered in the pipeline stage before entering the instruction issue queue, and since the store queue index is pre-decoded, the store queue index of the nearest target store instruction before the fetch instruction can be recorded when each fetch instruction enters the issue queue.
[0061] In some embodiments, the target stored instruction that precedes the fetch instruction can be selected based on the reordering buffer number corresponding to the fetch instruction and the reordering buffer number corresponding to each stored instruction in the stored queue. The stored queue index of the target stored instruction can then be obtained based on the result of the stored queue index pre-decoding. The reordering buffer number is the queue number of the instruction in the reordering buffer, uniquely identifying the order of an instruction in the program.
[0062] The forwarding operation process is as follows: Based on the forwarding mask, the forwarding store instruction is determined. The physical address of the fetch instruction is used to query the forwarding store instruction, generating a physical address matching hit vector. Subsequently, the physical address matching hit vector is used to generate the forwarding results for each byte from the store queue and the store instruction buffer. The forwarded data will be fed back to the third fetch pipeline in the next cycle after the forwarding request is generated. In specific implementation, such as... Figure 3 As shown, the data retrieval command issued by the data retrieval command reservation station can be obtained.
[0063] The first fetch pipeline is a hardware or logic unit responsible for executing the early stages of fetch instruction processing. Its main responsibility is to prepare the necessary information, such as address information and potential forward source information, for subsequent memory access and forwarding operations.
[0064] In practice, pre-decoding can be implemented in various ways, and is not limited to the description in this specification.
[0065] The location information obtained through pre-decoding, such as the index or pointer of an entry in the storage queue, enables the second data retrieval pipeline to directly utilize this information in subsequent stages without performing time-consuming full address comparisons, thereby accelerating the execution of forward operations.
[0066] In the embodiments provided by this invention, the data queue index pre-decoding is performed in the first data fetch pipeline. This allows the generation of a data forwarding mask in the first data fetch pipeline; the data forwarding mask is then passed to the second data fetch pipeline, where the forwarding operation corresponding to the data fetch instruction is executed; and the forwarded data of the data fetch instruction is returned in the third data fetch pipeline. Because the forwarding operation is split into various data fetching subtasks, each stage of the forwarding operation can be executed in parallel or in a pipelined manner in different pipelines, effectively avoiding complex calculations and long delays in a single stage.
[0067] The first virtual address is the logical address of the memory to be accessed by the fetch instruction, which can be generated by the CPU's instruction decoding unit or address generation unit.
[0068] The Translation Lookahead (TLB) acts as a cache, storing the mapping between virtual and physical addresses. Sending a first virtual address to the TLB retrieves the corresponding physical address. The TLB searches its internally stored mapping entries in parallel. If a match is found, the corresponding first physical address is quickly returned; otherwise, a page table traversal mechanism is triggered to obtain the physical address.
[0069] The first virtual address is sent to the data cache to attempt to retrieve the required data from the data cache. The data cache is a small-capacity, high-speed memory located between the CPU and main memory, used to store recently or frequently accessed data.
[0070] In some embodiments, the data cache provides two 64-bit read ports and one write port with the same width as the data cache line for the fetch and store pipelines, as well as a data refill port with the same width as the data cache line. The width of the data refill port is determined by the bus width between the data cache and the L2 cache. For example, the data refill port width is 512 bits.
[0071] Initiating a read request through the first virtual address allows for parallel detection of whether the data already exists in the data cache while waiting for the physical address translation result, thereby reducing memory access latency.
[0072] In practice, the second data retrieval pipeline is used to execute the second data retrieval subtask. The second data retrieval subtask includes: Obtain the first physical address from the translation backup buffer; send the first physical address and the data forwarding mask to the storage queue to trigger the first forwarding operation in the storage queue; wherein, the data forwarding mask is generated based on the address information, opcode, and position information of each storage instruction in the storage queue; send the first physical address and the data forwarding mask to the commit storage instruction buffer to trigger the second forwarding operation of the storage instruction to the fetch instruction in the commit storage instruction buffer; determine whether the cache read request hits based on the first physical address; if the cache read request hits, obtain the hit vector from the data cache.
[0073] The store queue is used to temporarily store store instructions and their data that have not yet been written to main memory. When the physical address of a fetch instruction and its data forwarding mask are sent to the store queue, the queue checks if there are any address-related store instructions that are preceding the main memory instruction. If so, the relevant store instruction in the queue will forward part or all of its corresponding data to the fetch instruction. For example, the store queue may include a comparison logic that compares the received first physical address with the physical address of the store instruction in the queue and uses the data forwarding mask to determine the specific data bytes that need to be forwarded.
[0074] The commit store instruction buffer stores store instructions that have been executed and committed but have not yet been written to main memory or the data cache. Sending the first physical address and a data forwarding mask to this buffer can trigger another layer of data forwarding. For example, the commit store instruction buffer can maintain a list of committed store instructions and contain corresponding address comparison and data selection logic to provide the latest data in response to forwarding requests from fetch instructions.
[0075] In some embodiments, the multiple data retrieval pipelines further include a third data retrieval pipeline, which is used to execute a third data retrieval subtask. The third data retrieval subtask includes: Based on the return data of the first and second forward operations, the forward data of the fetch instruction is obtained; if the cache read request hits, the cache data corresponding to the first virtual address is retrieved from the data cache according to the hit vector; based on the forward data and the cache data, the read data of the fetch instruction is obtained; the read data of the fetch instruction and the first physical address are written to the fetch queue of the memory access unit so that the execution result of the fetch instruction can be returned through the fetch queue; the fetch queue writes the read data back to the common data bus or sends it to the functional unit to perform subsequent operations.
[0076] The third data fetch pipeline is the hardware execution stage in the memory access unit used to process data integration of fetch instructions and write back the results. It can be implemented as an independent hardware module.
[0077] In practice, the third data retrieval pipeline receives the results of the first forward operation from the storage queue and the results of the second forward operation from the data submission storage instruction buffer, and selects or merges the latest forward data from them. Specifically, valid bytes from different forward sources can be combined using data forward masking and data merging logic.
[0078] In the specific implementation process, the data corresponding to each 8 bytes can be selected according to the data cache access result. The priority of the data passed forward is higher than the data read from the data cache. The program order is before the data fetch instruction. The closer the program order of the data storage instruction is to the data fetch instruction, the higher the priority.
[0079] When a cache read request hits, the third fetch pipeline uses the hit vector provided by the data cache (indicating which cache line was hit) and the address information of the fetch instruction to read the corresponding data from the data cache's data storage area. This can be achieved through the data cache's read port and address decoder. When the hit signal is valid, the data cache outputs the data at the corresponding address to the third fetch pipeline.
[0080] After obtaining the preceding data and potential cached data, the final data selection can be performed. In practice, preceding data has higher priority because it represents the most recent write operation. Therefore, if valid preceding data exists, it is used first; otherwise, if a cache hit occurs, the cached data is used. This process can be implemented using a priority encoder and a data selector, determining the final data source based on the validity flag of the preceding data and the cache hit flag.
[0081] The final fetch instruction's read data and the physical address accessed by that instruction are stored in the fetch queue within the memory access unit. The fetch queue is a buffer that temporarily stores the results of completed fetch instructions for subsequent execution units or submission logic to read. This is achieved through the fetch queue's write port and control logic. After the third fetch pipeline completes data integration, the result is written to the corresponding entry in the fetch queue, and the entry is marked as "write-back".
[0082] In some embodiments, if a cache read request misses, the backfill data information of the data cache is obtained through the data retrieval queue; using the data forwarding mask, the read data of the data retrieval instruction is obtained based on the backfill data information and forwarding data of the data cache.
[0083] When a cache read request from a memory access unit fails to hit the data cache, it means that the required data is not in the current cache level. In this case, the data cache will send a data request to the next level of memory (such as main memory) and fill the data cache with the retrieved data.
[0084] To ensure that fetch instructions can promptly obtain the backfilled data, the embodiments provided in this invention utilize a fetch queue to receive backfilled data information from the data cache. The memory access unit can be designed with a dedicated backfilled data interface. After the data cache completes data backfilling, this interface directly pushes the backfilled data block and its corresponding address range to a specific buffer in the fetch queue. The control logic within the fetch queue identifies this backfilled data and associates it with the fetch instructions awaiting processing.
[0085] In the specific implementation process, the data selection mask can be inverted to obtain the padding data mask; for each data unit in the backfill data information, if the address of the data fetching instruction matches the address of the data unit, the padding data is extracted from the data unit based on the padding data mask; and the read data of the data fetching instruction is obtained based on the preceding data and the padding data.
[0086] In some embodiments, if a cache read request misses but the preceding data is fully retrieved, the read data for the fetch instruction is obtained based on the preceding data. This allows the fetch instruction to prioritize the latest preceding data and return the result of the fetch instruction earlier.
[0087] like Figure 4 As shown, when a data cache miss occurs, a refill operation is triggered to retrieve the required data from lower-level storage (such as L2 cache, lower-level cache, or main memory) to fill the data cache. During this process, the data retrieval queue continuously monitors the data cache refill results.
[0088] The data selection mask fwdmask is a bit vector. A position of 1 indicates that the preceding data at the corresponding position in the data retrieval queue is valid, while a position of 0 indicates that the preceding data at the corresponding position in the data retrieval queue is invalid and backfill data is required.
[0089] like Figure 4 As shown, after the data selection mask is inverted by the inverter, a padding data mask is generated. The padding data mask indicates which data in the backfill data information needs to be extracted by the data retrieval queue.
[0090] Figure 4 The address fully associative mask enable signal cam_mask_wen is used for address comparison. Only when the address corresponding to the cache miss line in the data retrieval queue is the same as the address of the corresponding backfill data, the data retrieval queue will obtain the backfill data under the effective control of the address fully associative mask enable signal cam_mask_wen. This ensures the accuracy and relevance of data backfill and avoids unnecessary data transmission and operation.
[0091] like Figure 4 As shown, the address fully connected mask enable signal and the padding data mask are used together as inputs to an AND gate, outputting the write enable signal wen. The write enable signal controls the merging process of the forward data and the backfill data. When the write enable signal is valid, the data retrieval queue will merge the forward data and the backfill data according to the padding data mask. The merged data will be used as the final read data output of the data retrieval instruction.
[0092] In the embodiments provided by this invention, when a cache read request misses, a data retrieval queue is used to obtain backfill data information from the data cache. Combined with a data selection mask, the preceding data and backfill data are efficiently integrated, ensuring that the fetch instruction can still quickly and accurately obtain complete read data even when the cache misses. When a cache read request misses, the data retrieval queue acts as an intermediary channel, ensuring reliable acquisition of backfill data and avoiding waiting delays caused by missing data. The data selection mask is used to identify the priority of data sources, prioritizing the use of preceding data to cover relevant parts of the backfill data. This reduces the resource consumption and latency of the CAM port for address-connected comparisons, lowering the consumption of processor hardware resources, reducing the impact on processor clock speed, and returning results earlier.
[0093] The memory access unit is configured with multiple memory pipelines, which are independent hardware paths to support the processing of memory instructions at different stages.
[0094] In some embodiments, the multiple storage pipelines include a first storage pipeline, which is used to execute a first storage subtask.
[0095] The first storage subtask includes: Fetch the store instruction; generate the second virtual address of the store instruction; send the second virtual address to the translation back buffer to request the translation back buffer to translate the second virtual address into the second physical address of the hardware memory unit to be accessed by the store instruction.
[0096] In the specific implementation process, such as Figure 3 As shown, the stored instructions issued by the stored instructions can be obtained from the stored instructions retention station.
[0097] The second virtual address refers to the logical address of the memory that the store instruction will access, calculated based on the operands and addressing mode of the store instruction.
[0098] In some embodiments, the multiple storage pipelines further include a second storage pipeline for executing a second storage subtask.
[0099] The second data storage subtask includes: obtaining the second physical address from the translation backup buffer; writing the storage data corresponding to the second physical address and the data storage instruction into the data storage queue; and marking the data storage instructions in the data storage queue as write-back.
[0100] The storage queue acts as an internal buffer within the processor, temporarily storing data entries to be committed to the memory system (such as data cache or main memory). In practice, the second storage pipeline can have a dedicated write port for writing the second physical address and the stored data into a new entry in the storage queue. Alternatively, the second storage pipeline can prepare a storage queue entry containing the second physical address and the stored data and notify the storage queue controller via a control signal to atomically add it to the queue.
[0101] After the second physical address and the stored data are successfully written to the memory queue, the status flag of the memory instruction in the memory queue is updated. This status flag indicates that the memory instruction is ready to be submitted to the memory hierarchy for the actual write-back operation.
[0102] In some embodiments, the multiple storage pipelines further include a third storage pipeline for executing a third storage subtask.
[0103] The third data storage subtask includes: performing a violation check on the data storage instruction to pass data to the data fetch instruction; wherein, the violation check is used to check whether a data storage instruction has failed to pass data to a data fetch instruction with the same address and whose program sequence follows the data storage instruction, and the data fetch instruction has already been written back; if the violation check finds an abnormal situation where a data storage instruction has failed to pass data to a data fetch instruction with the same address and whose program sequence follows the data fetch instruction, and the data fetch instruction has already been written back, the data fetch pipeline corresponding to the violation check is triggered to re-execute.
[0104] The third data storage subtask can be implemented using hardware logic circuitry that monitors the status of the data storage queue and the data retrieval queue, and performs necessary comparison operations. Alternatively, the third data storage subtask can be implemented using a microcode or firmware-controlled logic unit that executes a predefined violation checking algorithm within a specific clock cycle.
[0105] If a violation check detects an exception where a store instruction fails to pass data to a fetch instruction with the same address but a later address in the program sequence, and that fetch instruction has already been written back, the corresponding fetch pipeline is re-executed. Its purpose is to promptly correct data flow errors and ensure the correctness of the processor's execution results. By re-executing the fetch instruction that failed the violation check, the error propagation chain can be effectively interrupted.
[0106] In practice, the data retrieval pipeline corresponding to the violation check can be re-executed in various ways, without being limited by the description in this manual.
[0107] For example, when an anomaly is detected during a violation check, the data retrieval command holding station is notified to reissue the data retrieval command.
[0108] In some embodiments, the memory access unit further includes a fourth memory pipeline. The fourth memory pipeline is used to: notify the reordering buffer that the memory instruction can be legally committed; and write the memory instruction back.
[0109] The following is combined Figure 2 This invention describes a method for executing memory access instructions. This method is applied to... Figure 1 The memory access unit shown.
[0110] like Figure 2 As shown, the execution method of a memory access instruction includes the following steps: Step 201: Obtain the target data retrieval instruction.
[0111] Step 202: Based on the complete execution process of the target data retrieval instruction, obtain multiple data retrieval subtasks; use each data retrieval pipeline to execute one of the multiple data retrieval subtasks in parallel.
[0112] In the specific implementation process, one of multiple data retrieval subtasks is executed in parallel on each data retrieval pipeline, including: The first data retrieval subtask is executed using the first data retrieval pipeline.
[0113] The first subtask for obtaining numbers includes: The memory queue index of the memory access unit is pre-decoded to obtain the position information of each memory instruction in the memory queue, so that the second data fetch pipeline can perform the forward operation corresponding to the target data fetch instruction based on the position information of each memory instruction in the memory queue; the first virtual address of the target data fetch instruction is generated; the first virtual address is sent to the translation back buffer to request the translation back buffer to convert the first virtual address into the first physical address of the hardware memory unit accessed by the target data fetch instruction; and a cache read request is initiated to the data cache according to the first virtual address.
[0114] Execute the second data retrieval subtask using the second data retrieval pipeline. The second subtask for obtaining numbers includes: Obtain the first physical address from the translation backup buffer; send the first physical address and the data forwarding mask to the storage queue to trigger the first forwarding operation in the storage queue; wherein, the data forwarding mask is generated based on the address information, opcode, and position information of each storage instruction in the storage queue; send the first physical address and the data forwarding mask to the commit storage instruction buffer to trigger the second forwarding operation of the storage instruction to the target data fetch instruction in the commit storage instruction buffer; determine whether the cache read request hits based on the first physical address; if the cache read request hits, obtain the hit vector from the data cache.
[0115] The third data retrieval subtask is executed using the third data retrieval pipeline.
[0116] The third subtask for obtaining numbers includes: Based on the return data of the first forward operation and the return data of the second forward operation, the forward data of the target fetch instruction is obtained; if the cache read request hits, the cache data corresponding to the first virtual address is obtained from the data cache according to the hit vector; based on the forward data and the cache data, the read data of the target fetch instruction is obtained; the read data of the target fetch instruction and the first physical address are written into the fetch queue of the memory access unit so as to return the execution result of the target fetch instruction through the fetch queue.
[0117] In some embodiments, a target data storage instruction is obtained; multiple data storage subtasks are obtained based on the complete execution process of the target data storage instruction; and one of the multiple data storage subtasks is executed in parallel using each data retrieval pipeline.
[0118] In the specific implementation process, one of multiple data storage subtasks is executed in parallel on each data retrieval pipeline, including: The first storage subtask is executed using the first storage pipeline.
[0119] The first storage subtask includes: generating the second virtual address of the target storage instruction; and sending the second virtual address to the translation backup buffer to request the translation backup buffer to translate the second virtual address into the second physical address of the hardware memory unit to be accessed by the target storage instruction.
[0120] The second storage subtask is executed using the second storage pipeline.
[0121] The second data storage subtask includes: obtaining the second physical address from the translation backup buffer; writing the storage data corresponding to the second physical address and the target data storage instruction into the data storage queue; and marking the target data storage instruction in the data storage queue as a write-back state.
[0122] The third storage subtask is executed using the third storage pipeline.
[0123] The third data storage subtask includes: performing a violation check on the data transfer from the target data storage instruction to the data fetch instruction; wherein, the violation check is used to check whether the target data storage instruction has failed to transfer data to a data fetch instruction with the same address and whose program sequence follows the target data storage instruction, and the data fetch instruction has already been written back; if the violation check finds an abnormal situation where the target data storage instruction has failed to transfer data to a data fetch instruction with the same address and whose program sequence follows the data fetch instruction, and the data fetch instruction has already been written back, the data fetch pipeline corresponding to the violation check is triggered to re-execute.
[0124] For detailed implementation instructions of the above steps, please refer to [link / reference]. Figure 1 The relevant content will not be repeated here.
[0125] Figure 5An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 5 As shown, the electronic device may include a processor 510, a communications interface 520, a memory 530, and a communication bus 540. The processor 510, communications interface 520, and memory 530 communicate with each other via the communication bus 540. The processor 510 can invoke logical instructions in the memory 530 to execute a memory access instruction execution method. This method includes: obtaining a target data fetch instruction; obtaining multiple data fetch subtasks based on the complete execution process of the target data fetch instruction; and executing one of the multiple data fetch subtasks in parallel using various data fetch pipelines.
[0126] Furthermore, the logical instructions in the aforementioned memory 530 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.
[0127] On the other hand, the present invention also provides a computer program product, the computer program product including a computer program, the computer program being stored on a non-transitory computer-readable storage medium, the computer program being executed by a processor, the computer being able to execute the memory access instruction execution method provided by the above methods, the method including: obtaining a target data fetch instruction; obtaining multiple data fetch subtasks according to the complete execution process of the target data fetch instruction; and executing one of the multiple data fetch subtasks in parallel using each data fetch pipeline.
[0128] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements an execution method for executing memory access instructions provided by the above methods, the method comprising: obtaining a target data fetch instruction; obtaining multiple data fetch subtasks based on the complete execution process of the target data fetch instruction; and executing one of the multiple data fetch subtasks in parallel using each data fetch pipeline.
[0129] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.
[0130] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.
[0131] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.
Claims
1. A memory access unit, characterized in that, The memory access unit includes: multiple data fetch pipelines; Each data retrieval pipeline executes one of multiple data retrieval subtasks in parallel; The multiple data retrieval subtasks are obtained by dividing the complete execution process of the data retrieval instruction; the forward operation is split into each data retrieval subtask; the forward operation is the process by which the data storage instruction directly passes the data to the address-related data retrieval instruction.
2. The memory access unit according to claim 1, characterized in that, The multiple data retrieval pipelines include a first data retrieval pipeline and a second data retrieval pipeline; The first data retrieval pipeline is used to execute the first data retrieval subtask; The first data retrieval subtask includes: The memory access instructions in the memory access unit are pre-decoded using the memory queue index; Get the data retrieval command; Based on the result of the pre-decoding, determine the storage queue index of the target storage instruction preceding the program order of the data fetch instruction; Based on the address information and opcode of the fetch instruction and the storage queue index of the target storage instruction, a data forwarding mask for the fetch instruction is generated, so that the second fetch pipeline can perform a forwarding operation corresponding to the fetch instruction based on the data forwarding mask; Generate the first virtual address of the data fetch instruction; The first virtual address is sent to the translation backup buffer to request the translation backup buffer to translate the first virtual address into the first physical address of the hardware memory unit accessed by the fetch instruction; Based on the first virtual address, a cache read request is initiated to the data cache.
3. The memory access unit according to claim 2, characterized in that, The second data retrieval pipeline is used to execute the second data retrieval subtask; The second data retrieval subtask includes: Obtain the first physical address from the translation backup buffer; Send the first physical address and the data forwarding mask to the data storage queue to trigger the first forwarding operation in the data storage queue; Send the first physical address and the data forwarding mask to the submit data storage instruction buffer to trigger a second forwarding operation of the data storage instruction to the data retrieval instruction in the submit data storage instruction buffer; Based on the first physical address, determine whether the cache read request has been successful; If the cache read request is successful, the hit vector is retrieved from the data cache.
4. The memory access unit according to claim 3, characterized in that, The multiple data retrieval pipelines also include a third data retrieval pipeline; The third data retrieval pipeline is used to execute the third data retrieval subtask; The third data retrieval subtask includes: Based on the return data of the first forwarding operation and the return data of the second forwarding operation, the forwarding data of the data retrieval instruction is obtained; If the cache read request hits, the cache data corresponding to the first virtual address is retrieved from the data cache according to the hit vector; Based on the preceding data and the cached data, the read data of the data retrieval instruction is obtained; The read data of the fetch instruction and the first physical address are written to the fetch queue of the memory access unit so that the execution result of the fetch instruction can be returned through the fetch queue.
5. The memory access unit according to claim 4, characterized in that, The step of obtaining the read data for the data retrieval instruction based on the preceding data and the cached data includes: If the cache read request fails, the backfill data information of the data cache is obtained through the data retrieval queue; Using a data selection mask, the read data of the data retrieval instruction is obtained based on the backfill data information of the data cache and the preceding data; wherein, the data selection mask is a bit vector, and the value of each bit is used to indicate whether the preceding data at the corresponding position in the data retrieval queue is valid.
6. The memory access unit according to claim 5, characterized in that, The step of using the data selection mask to obtain the read data for the data retrieval instruction based on the backfill data information of the data cache and the forward data includes: Invert the data selection mask to obtain the padding data mask; For each data unit in the backfill data information, if the address of the data retrieval instruction matches the address of the data unit, the fill data is extracted from the data unit based on the fill data mask; The read data of the data retrieval instruction is obtained based on the preceding data and the filling data.
7. The memory access unit according to claim 4, characterized in that, The step of obtaining the read data for the data retrieval instruction based on the preceding data and the cached data includes: If the cache read request misses, but the preceding data is fully loaded; Based on the preceding data, the read data of the data retrieval instruction is obtained.
8. The memory access unit according to any one of claims 1 to 7, characterized in that, The memory access unit includes: multiple memory pipelines; Each storage pipeline executes one of multiple storage subtasks in parallel; The multiple data storage subtasks are obtained by dividing the complete execution process of the data storage instruction.
9. The memory access unit according to claim 8, characterized in that, The multiple data storage pipelines include a first data storage pipeline; The first storage pipeline is used to execute the first storage subtask; The first data storage subtask includes: Retrieve stored data instruction; Generate the second virtual address of the store instruction; The second virtual address is sent to the translation backup buffer to request the translation backup buffer to translate the second virtual address into the second physical address of the hardware memory unit to be accessed by the store instruction.
10. The memory access unit according to claim 9, characterized in that, The multiple data storage pipelines also include a second data storage pipeline; The second storage pipeline is used to execute the second storage subtask; The second data storage subtask includes: Obtain the second physical address from the translation backup buffer; Write the storage data corresponding to the second physical address and the storage instruction into the storage queue; The data storage instructions in the data storage queue are marked as write-back.
11. The memory access unit according to claim 10, characterized in that, The multiple data storage pipelines also include a third data storage pipeline; The third storage pipeline is used to execute the third storage subtask; The third data storage subtask includes: The execution of the store instruction to pass data to the fetch instruction is checked for violations. The violation check is used to check whether the store instruction fails to pass data to the same address, and whether the program sequence is a fetch instruction following the store instruction, and the fetch instruction has already been written back. If the violation check detects an anomaly where a data storage instruction fails to pass data to a data retrieval instruction with the same address but a later program sequence, and the data retrieval instruction has already been written back, the data retrieval pipeline corresponding to the violation check will be re-executed.
12. A method for executing a memory access instruction, characterized in that, Applied to the memory access unit as described in any one of claims 1 to 11, the method comprises: Obtain the target data retrieval instruction; Based on the complete execution process of the target data retrieval instruction, multiple data retrieval subtasks are obtained; one of the multiple data retrieval subtasks is executed in parallel using each data retrieval pipeline.
13. A chip, characterized in that, Includes the memory access unit as described in any one of claims 1 to 11.