Apparatus and method for handling memory load requests
By combining address generation and buffering circuits, address proximity conditions are identified, and unnecessary load request forwarding is suppressed, thus solving the problem of low memory load request processing efficiency and achieving more efficient memory load processing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ARM LTD
- Filing Date
- 2020-10-07
- Publication Date
- 2026-06-16
Smart Images

Figure CN114600079B_ABST
Abstract
Description
Background Technology
[0001] This technology relates to the field of data processing. More specifically, this invention relates to the processing of access requests.
[0002] The device can perform data processing operations that use data stored in a memory system. To access data items in the memory system, the device's data processing circuitry is arranged to generate access requests indicating the data items to be accessed. For example, these access requests may be load requests. In some cases, other operations of the device may require the result of a load request to proceed. Therefore, it is desirable to be able to process load requests quickly to reduce the amount of time the device waits for data items to be retrieved. This is a particularly important issue in the context of load operations, as write operations may be able to send write requests to the memory system and proceed without waiting for the storage operation to be fully completed. Similarly, other data processing operations performed by the device, such as those performed on data stored in registers in the processing circuitry, may be able to operate without waiting for a response from the memory system. Therefore, it would be advantageous to provide techniques that can efficiently process access requests. Summary of the Invention
[0003] In one exemplary arrangement, an apparatus is provided for performing a data processing operation including loading data items from a memory system. The apparatus includes: an address generation circuit for generating an address for a load request; a pending load buffer circuit for buffering a load request received from the address generation circuit before the load request is executed to retrieve data items using the address of the load request; a load processing circuit for retrieving a series of data items, including data items identified by the load request, from the memory system in response to the load request; and a coalescing circuit for forwarding the load request buffered in the pending load buffer circuit to the load processing circuit, and is arranged for a set of one or more subsequent load requests buffered in the pending load buffer circuit. The system determines whether an address proximity condition is satisfied, wherein the address proximity condition is satisfied when all data items identified by the set of one or more subsequent load requests are included within the set of data items, and wherein the coalescing circuit suppresses the forwarding of the set of one or more subsequent load requests in response to satisfying the address proximity condition; and a decoupling circuit, which is configured to receive the set of data items retrieved by the load processing circuit and to return the data item identified by the load request as the result of the load request, wherein the decoupling circuit, in response to satisfying the address proximity condition, returns one or more additional data items from the set of data items identified by the one or more subsequent load requests as the result of the one or more subsequent load requests for each of the subsequent load requests in the set of one or more subsequent load requests.
[0004] In another exemplary arrangement, a method is provided for operating an apparatus for performing a data processing operation, the data processing operation including loading data items from a memory system, the method comprising: generating an address for a load request; buffering the load request in a pending load buffer circuit before a load processing circuit executes the load request to retrieve data items using the address of the load request; forwarding the load request buffered in the pending load buffer circuit to the load processing circuit to retrieve from the memory system a series of data items including data items identified by the load request; determining, for a set of one or more subsequent load requests buffered in the pending load buffer circuit, whether an address proximity condition is satisfied, wherein when all data items identified by the set of one or more subsequent load requests are included... When included within the series of data items, the address proximity condition is satisfied; in response to satisfying the address proximity condition, forwarding of the group of one or more subsequent load requests to the load processing circuitry is suppressed; in response to the load request, the series of data items including the data item identified by the load request are retrieved from the memory system; the retrieved series of data items are received and the data item identified by the load request is returned as the result of the load request; and in response to satisfying the address proximity condition, for each subsequent load request in the group of one or more subsequent load requests, one or more additional data items from the series of data items identified by the one or more subsequent load requests are returned as the result of the one or more subsequent load requests.
[0005] In yet another exemplary arrangement, an apparatus is provided for performing a data processing operation including loading data items from a memory system. The apparatus includes: means for generating an address for a load request; means for buffering the load request before a means for processing the load executes the load request to retrieve data items using the address of the load request; means for forwarding a load request buffered in the means for buffering the load request to the means for processing the load to retrieve a series of data items from the memory system including data items identified by the load request, wherein the means for processing the load retrieves the series of data items from the memory system including the data items identified by the load request in response to the load request; and means for handling a set of one or more subsequent load requests buffered in the means for buffering. A means for determining whether an address proximity condition is satisfied, wherein the address proximity condition is satisfied when all data items identified by the set of one or more subsequent load requests are included within the set of data items; means for suppressing forwarding of the set of one or more subsequent load requests to the means for processing the load in response to satisfying the address proximity condition; means for receiving the retrieved set of data items and for returning the data item identified by the load request as a result of the load request; and means for: in response to satisfying the address proximity condition, returning one or more additional data items from the set of data items identified by the one or more subsequent load requests as a result of the one or more subsequent load requests for each of the subsequent load requests in the set of one or more subsequent load requests. Attached Figure Description
[0006] The present technology will be further described by way of illustration only, with reference to examples of the present technology shown in the accompanying drawings, wherein:
[0007] Figure 1A This illustrates an instruction sequence, including both access and execution instructions received by a prior art processor, in a scenario where a cache miss for an access instruction causes the pipeline to stall due to some subsequent execution instructions. Figure 1B The same sequence of instructions received by the processor is shown according to some implementations, wherein the execution of priority access instructions and the deferral of execution instructions allow for greater instruction execution progress;
[0008] Figure 2 The data processing apparatus in some embodiments is illustrated schematically;
[0009] Figure 3 The sequence of steps taken in a processor according to some exemplary embodiments is illustrated;
[0010] Figure 4 The data processing apparatus in some embodiments is illustrated schematically;
[0011] Figure 5 The data processing apparatus in some embodiments is illustrated schematically;
[0012] Figure 6 The collision detection circuitry provided in some embodiments is illustrated schematically; and
[0013] Figure 7 The sequence of steps taken by a collision detection circuit according to some exemplary embodiments is shown.
[0014] Figure 8A and Figure 8B A data dependency graph of an exemplary instruction sequence is shown, in which Figure 8A It is a "access" data dependency graph, and Figure 8B It is an "execution" data dependency graph;
[0015] Figure 9 A data processing apparatus according to some embodiments is schematically shown;
[0016] Figure 10 It is a flowchart illustrating the sequence of steps taken according to some implementation schemes;
[0017] Figure 11 It is a flowchart illustrating the sequence of steps taken according to some implementation schemes;
[0018] Figure 12 A data processing apparatus according to some embodiments is schematically shown;
[0019] Figure 13A An example of the contents of a traversal table according to some implementation schemes is shown;
[0020] Figure 13B An exemplary content of the final writer table according to some implementation schemes is shown;
[0021] Figure 14A and Figure 14B An instruction tag storage device according to some embodiments and some exemplary content are schematically shown;
[0022] Figure 15 schematically illustrates the instruction cache hierarchy associated with the micro-operations cache according to some implementation schemes;
[0023] Figure 16 An apparatus according to some exemplary specific implementations is illustrated schematically;
[0024] Figure 17The apparatus of FIG1 is illustrated schematically in some exemplary specific embodiments, and examples of effective operation are also shown;
[0025] Figure 18 An apparatus according to some exemplary specific implementations is illustrated schematically;
[0026] Figure 19 The illustrations show effective examples of experimental proximity checks and address proximity checks according to some exemplary specific implementations;
[0027] Figure 20 The contents of a pending load buffer at different stages are illustrated schematically in a valid example of a few exemplary specific implementations;
[0028] Figure 21 This is a flowchart illustrating a method for performing data processing operations according to some exemplary specific implementations;
[0029] Figure 22 The diagram schematically illustrates a sequence of instructions including a first instruction according to some embodiments, the first instruction defining each instruction in a set of subsequent instructions as either an execution instruction or an access instruction;
[0030] Figure 23 The decoding circuitry in some implementation schemes is illustrated schematically;
[0031] Figure 24 The data processing apparatus including a micro-operation cache is illustrated schematically in some embodiments;
[0032] Figure 25 The diagram schematically illustrates a data processing apparatus including a register set in some embodiments;
[0033] Figures 26A to 26C Three versions of the instructions according to the present technology are illustrated schematically in some exemplary embodiments;
[0034] Figure 27 The decoding circuitry in some implementation schemes is illustrated schematically;
[0035] Figure 28 This is a flowchart illustrating a series of steps taken by the decoding circuit according to some implementation schemes;
[0036] Figure 29 This is a flowchart illustrating a series of steps taken by the decoding circuit according to some implementation schemes; and
[0037] Figure 30 The illustration schematically shows a specific implementation of a simulator that can be used in some implementation schemes. Detailed Implementation
[0038] In apparatuses used to perform data processing operations, where at least some of those operations involve accessing data items in a memory system, it may be desirable to provide a mechanism that can efficiently handle multiple access requests. For example, since it is common for other operations to depend on the result of a load request, by providing an efficient means of handling load requests, the incidence of delays or stalls in the apparatus while it waits for the result of a load request from the memory system can be reduced.
[0039] As used herein, the term memory system refers to main memory other than any intermediate cache hierarchy that can be implemented as a cache copy of data items in main memory.
[0040] The device can generate a load request with an indication of the data item to be acquired. This indication is typically the memory address of the data item, directly indicating the memory location in memory corresponding to the data item. However, the load request can also indirectly indicate the data item. For example, the load request can specify a register storing the memory address corresponding to the data item to be acquired. Alternatively, the load request can indicate a register and an offset, wherein the memory address corresponding to the data item to be retrieved is determined by applying (e.g., adding) the offset to the memory address stored in the register. Therefore, the load request initially generated by the device may not directly identify the memory address corresponding to the requested data item.
[0041] In at least one exemplary embodiment, an apparatus is provided for performing a data processing operation including loading data items from a memory system. The apparatus includes: an address generation circuit for generating addresses for load requests; a pending load buffer circuit for buffering load requests received from the address generation circuit before the load requests are executed to retrieve data items using the addresses of the load requests; a load processing circuit for retrieving a series of data items, including data items identified by the load requests, from the memory system in response to the load requests; and a coalescing circuit for forwarding the load requests buffered in the pending load buffer circuit to the load processing circuit, and is arranged to target one or more of the groups buffered in the pending load buffer circuit. Multiple subsequent load requests determine whether an address proximity condition is satisfied, wherein the address proximity condition is satisfied when all data items identified by the group of one or more subsequent load requests are included in the series of data items, and wherein the coalescing circuit suppresses the forwarding of the group of one or more subsequent load requests in response to satisfying the address proximity condition; and a decoupling circuit is configured to receive the series of data items retrieved by the load processing circuit and to return the data item identified by the load request as the result of the load request, wherein the decoupling circuit, in response to satisfying the address proximity condition, returns one or more additional data items from the series of data items identified by the one or more subsequent load requests as the result of the one or more subsequent load requests for each of the subsequent load requests in the group of one or more subsequent load requests.
[0042] Therefore, according to the technology described herein, an apparatus for performing data processing operations is provided, the apparatus including an address generation circuit for generating an address for a load request. The apparatus is arranged to perform data processing operations, including data processing operations for loading data items from a memory system. The load request generated by the apparatus does not necessarily directly specify the address of the requested data item. That is, the memory address may need to be derived from the load request. The address generation circuit performs this function, generating an address corresponding to the load request. After the address is generated, comparisons can be performed between the addresses of the corresponding load requests, such as identifying the load request corresponding to a nearby data item, thereby enabling an advantageous scheme for processing the load request.
[0043] The device also includes load processing circuitry for retrieving a data item identified by the load request from the memory system. According to the techniques described herein, the load processing circuitry is arranged not only to retrieve the data item identified by the load request, but also to retrieve a series of data items including the data item identified by the load request. This may be because the interface between the device and the memory system is configured to transfer multiple data items at once. For example, this may occur when the device includes a vector processor arranged to handle operations related to multiple inputs or outputs in response to a single instruction and capable of handling scalar operations, thereby performing the operation relative to a single data item. Therefore, when processing a scalar load request, the device may be arranged to pull from a series of data items and then determine which data item in the series is the requested data item. In other examples, the processor is a scalar processor, and the interface between the device and the L1 cache from which the device receives data items is arranged to transfer the entire cache line to the device in response to a load request that identifies a memory location in the cache line.
[0044] The apparatus can be arranged to retrieve a series of data items in response to each load request generated by the apparatus, discarding data items not identified by the load request. However, since the load processing circuitry is arranged to retrieve a series of data items in response to a load request according to the techniques described herein, the apparatus is arranged to utilize this bandwidth when processing load requests. Therefore, if there are two or more load requests with addresses such that the series of data items retrieved by the load processing circuitry contains all the data items identified by the two or more load requests, the apparatus can utilize the additional data items in the series of data items instead of discarding them. In this way, the apparatus is arranged to process load requests in parallel to reduce the number of retrieval operations performed by the load processing circuitry, and thus provides a more efficient way to process load requests. Further details on how the techniques described herein achieve these effects will be described below.
[0045] According to the technology described herein, the device includes a pending load buffer circuit for implementing a pending load buffer. The pending load buffer circuit is arranged to buffer load requests from an address generation circuit before load requests are executed to retrieve data items using the addresses of those load requests. The pending load buffer circuit receives load requests whose addresses have been generated by the address generation circuit. Then, before the aforementioned load processing circuit executes the load requests to retrieve data items, the pending load buffer circuit provides a storage area for these load requests. Because load requests can be generated at the device at a rate different from (which can be a higher rate) than the rate at which the load processing circuit can process load requests, it is advantageous to provide functionality for buffering pending load requests that have not yet been processed. Furthermore, this provides an opportunity to examine co-pending load requests and determine whether any of these load requests has an access proximity that can be used according to the technology.
[0046] According to the technology described herein, a coalescing circuit is provided for forwarding load requests buffered in a pending load buffer circuit to a load processing circuit. Therefore, one function of this coalescing circuit is to enable the load processing circuit to receive load requests to be processed from the pending load buffer circuit.
[0047] In some exemplary embodiments, forwarding a load request to the load processing circuit involves passing the load request to be forwarded to the load processing circuit and deleting the load request from the pending load buffer circuit. In this way, the coalescing circuit ensures that once a load request is propagated to the load processing circuit, these load requests leave the buffer, thus providing more space in the pending load buffer circuit for new load requests to be stored. However, in some exemplary embodiments, the coalescing circuit is arranged to leave the load request in the pending load buffer circuit when forwarding it, and therefore can be considered that the coalescing circuit provides a copy of the load request to the load processing circuit. This means that while the load request is being processed by the load processing circuit, a record of the load request is maintained in the pending load buffer circuit, making it easier to track these load requests as they move through the device and easier to return the load request to the pending load buffer circuit if needed (e.g., if an interruption prevents the load processing circuit from completing the load request).
[0048] Load requests forwarded to the load processing circuitry can be load requests located in a defined position within the pending load buffer circuitry, such as load requests stored in the pending load buffer circuitry at the header position of the longest-running load requests in the pending load buffer circuitry. This helps avoid situations where load requests remain in the pending load buffer circuitry unprocessed for extended periods. In an alternative exemplary embodiment, the coalescing circuitry may otherwise examine the contents of the pending load buffer circuitry to determine which of the buffered load requests is the next load request to be forwarded to the load processing circuitry.
[0049] In addition to forwarding load requests to the load processing circuitry, the coalescing circuitry is also configured to determine whether an address proximity condition is met for a group or more subsequent load requests buffered in the pending load buffer circuitry. These subsequent load requests include other load requests in the pending load buffer circuitry that are not forwarded to the load processing circuitry. Therefore, when the pending load buffer circuitry contains multiple load requests specifying a proximal region of memory, the coalescing circuitry is able to identify a group of load requests containing load requests from these multiple load requests, thereby determining a group of subsequent load requests specifying a proximal region of memory that is close to the region specified by the load request being forwarded.
[0050] Address proximity conditions can take many forms, but in some exemplary implementations, they are based on a simple numerical comparison between addresses specified by a load request. Alternatively or otherwise, address proximity conditions can be determined based on identifying the memory location specified by the load request and that the group of one or more subsequent load requests are in the same cache line. Thus, in some examples, the series of data items is a cache line, and the address proximity condition is satisfied when the data item identified by the load request and all data items identified by the group of one or more subsequent load requests are included within that cache line.
[0051] The address proximity condition is satisfied when all data items identified by the group of one or more subsequent load requests are included within the series of data items. Therefore, the address proximity condition is satisfied when the group of one or more subsequent load requests causes the load processing circuitry to retrieve a series of data items from memory, which will contain the data items identified by the load request and the group of one or more subsequent load requests. For example, if the load processing circuitry is configured to retrieve a cache line containing data items identified by a load request forwarded to the load processing circuitry, the address proximity condition is satisfied when the data items specified by the group of one or more subsequent load requests correspond to the same cache line.
[0052] Therefore, according to the technique described herein, the device is able to suppress the forwarding of one or more subsequent load requests by determining that a group of one or more subsequent load requests satisfy an address proximity condition relative to the load request. Instead, since the series of data items retrieved by the load processing circuitry in response to a load request will contain data items identified by the group of one or more subsequent load requests, the device is arranged to use these data items to process the group of one or more load requests, without requiring the load processing circuitry to receive a series of data items for each load request.
[0053] The device also includes a de-coalescing circuit that receives a series of data items retrieved by the load processing circuit and returns the data item identified by the load request as the result of the load request. Therefore, when the address proximity condition is not met, based on the load request and a series of data items containing the data item identified by the load request, the de-coalescing circuit can determine which data item in the series of data items is the one identified by the load request and return that data item as the result of the load request.
[0054] However, when the address proximity condition is met, the de-coalescing circuit is configured to return one or more additional data items from the series of data items, identified by the one or more subsequent load requests, as the result of the one or more subsequent load requests for each subsequent load request in the group of one or more load requests. That is, in addition to determining the data items in the series that are identified by the load requests, the de-coalescing circuit is also configured to determine, for each of the one or more subsequent load requests, additional data items in the series that correspond to those one or more subsequent load requests.
[0055] Instead of using the result obtained relative to the load request, this device can process load requests more efficiently by identifying a group of subsequent load requests that meet the address proximity condition and suppressing the forwarding of that group of subsequent load requests to the load processing circuitry. This is because, by identifying subsequent load requests (which means that no retrieval operation needs to be performed by the load processing circuitry), the number of load processing operations performed by the load processing circuitry can be reduced, thereby allowing the load processing circuitry to process the remaining load requests more quickly.
[0056] As described above, in some exemplary embodiments, the apparatus can be arranged such that the load processing circuitry retrieves cache lines from the memory system as a series of data items. This provides a useful embodiment of the techniques discussed herein because data in the memory system can be arranged in cache lines, and thus a one-time retrieval of the entire cache line can be performed quickly. Furthermore, the addressing of memory locations allows for easy determination of whether two memory addresses correspond to the same cache line, thereby enabling rapid checking of address proximity conditions to identify whether one or more subsequent load requests are included within that same cache line.
[0057] To quickly determine whether an address proximity condition is met, this condition is satisfied when the absolute difference between the address used for the load request and the address used for each of the one or more subsequent load requests in the group is less than a predetermined threshold. This predetermined threshold can be based on the size of a series of data items retrieved by the load processing circuitry. Therefore, to perform the address proximity check, the coalescing circuitry can perform a simple numerical calculation on the address specified by the load request. For example, for each of the one or more subsequent load requests in the group, the coalescing circuitry can subtract the address of the subsequent load request from the address of the load request, evaluate the magnitude of the subtraction result, and determine that the address proximity condition is met if the magnitude is less than the predetermined threshold for each of the one or more subsequent load requests.
[0058] Therefore, in some examples, the coalescing circuit is arranged to: determine whether a trial proximity condition is met for the group of one or more subsequent load requests buffered in the pending load buffer circuit before determining whether the address proximity condition is met, wherein the coalescing circuit forwards the load request to the load processing circuit and temporarily suppresses the forwarding of the group of one or more subsequent load requests in response to the meeting of the trial proximity condition, and wherein the coalescing circuit stops temporarily suppressing the forwarding of the group of one or more subsequent load requests in response to the failure to meet the address proximity condition.
[0059] While the determination of whether a group of one or more subsequent load requests corresponds to the same set of data items can be performed in one stage, in some implementations, the coalescing circuitry can be arranged to perform a trial proximity check to determine whether a trial proximity condition is met. Based on the result of the trial proximity check, the coalescing circuitry can determine an initial indication of whether the address proximity condition will be met. In some exemplary implementations, this is achieved by comparing a portion of the address of a load request with a portion of the address of each of the group of one or more subsequent load requests. For example, by comparing the first portion of the address to determine whether the trial proximity condition is met, the coalescing circuitry can determine that if the trial proximity condition is met, the address proximity condition may also be met. Therefore, the coalescing circuitry can temporarily suppress the forwarding of the group of one or more subsequent load requests before checking the address proximity condition. This approach allows for a rapid, preliminary determination of whether a load request can be coalesced, as only a portion of the address needs to be considered.
[0060] Therefore, in some implementations, a trial proximity condition is satisfied when the first portion of all addresses of the group of one or more subsequent load requests matches the first portion of the address of the load request, and an address proximity condition is satisfied when the trial proximity condition is satisfied and when the second portion of all addresses of the group of one or more subsequent load requests matches the second portion of the address of the load request.
[0061] After determining that the trial proximity condition is met, the coalescing circuit then determines whether the address proximity condition is met. If the coalescing circuit determines that the address proximity condition is not met, the coalescing circuit can be configured to temporarily suppress the forwarding of one or more subsequent load requests because the coalescing circuit has identified that this group cannot be coalesced with the load requests that will be forwarded to the load processing circuit next. However, by temporarily suppressing these requests in response to the satisfaction of the trial proximity condition, the coalescing circuit can continue to process the next load requests to be processed without waiting for the address proximity condition to be executed.
[0062] A trial proximity condition can be based on a first portion of all addresses in the group of one or more subsequent load requests and a first portion of the address of the load request, such that the trial proximity condition is satisfied when all these first portions match. Similarly, an address proximity condition is satisfied when the trial proximity condition is met and when the second portions of all addresses in the group of one or more subsequent load requests match the second portion of the address of the load request. Therefore, comparisons between load requests can be performed in stages, where different portions of the address are considered in each stage.
[0063] In some exemplary implementations, the first part includes fewer address bits than the second part. In this way, a quick trial proximity condition can be implemented as a preliminary indication of address proximity, which is later refined to give an accurate result regarding whether the address proximity condition is met. When examining a trial proximity condition or an address proximity condition, the third part of the address may not be suitable for comparison. This may be because the third part indicates the position of the requested data item within the set of data items, and therefore, regardless of the value of the third part, if the first part matches the second part, all data items identified by one or more subsequent load requests in that group are included within that set of data items.
[0064] While the de-aggregating circuitry can identify a set of subsequent requests that have been aggregated with a load request in various ways (including by referring to a pending load buffer circuitry), in some exemplary embodiments, the de-aggregating circuitry receives an indication from the aggregation circuitry that aggregation has occurred. According to such exemplary embodiments, the aggregation circuitry provides the de-aggregating circuitry with an aggregation request indication in response to an address proximity condition, the aggregation request indication identifying the load request and the set of one or more subsequent load requests; the de-aggregating circuitry, in response to the aggregation request indication, identifies the one or more additional data items based on the aggregation request indication. Therefore, the aggregation circuitry provides the de-aggregating circuitry with an aggregation request indication in response to an address proximity condition, the aggregation request indication identifying the load request and the set of one or more load requests. This provides the de-aggregating circuitry with information that can be used to determine how to de-aggregate the load request. The reason for this is that, in response to a series of data items and load requests retrieved by the load processing circuitry, the de-coalescing circuitry may need to know whether it will only output the data items in the series that correspond to the load request, or whether the load request has been coalesced with a set of subsequent load requests, and therefore also needs to output additional data items identified by that set of subsequent load requests.
[0065] Therefore, in response to a coalescing request instruction, the de-coalescing circuit identifies one or more additional data items that it may subsequently output as a result of a subsequent load request for that group.
[0066] In some exemplary embodiments, the pending load buffer circuitry includes a FIFO buffer, wherein the load request is the oldest load request in the FIFO buffer, and wherein the group of one or more subsequent load requests is the newest load request in the FIFO buffer. Thus, the pending load buffer circuitry may include a first-in, first-out (FIFO) buffer, whereby a load request is added to the buffer at the tail and proceeds to the head of the buffer. A coalescing circuit is arranged to operate on the load request at the head of the buffer, which is the oldest request in the FIFO buffer. Therefore, the group of one or more subsequent load requests has been in the FIFO buffer for a shorter period and is therefore a newer load request. Some exemplary embodiments of the technology described herein use FIFO buffers because this ensures that load requests do not ultimately remain in the pending load buffer circuitry for too long, and because FIFO buffers can represent an efficient way of providing a buffer with minimal overhead in terms of required storage and the operations used to manage the buffer.
[0067] When proximity conditions are met, subsequent load requests in this group are clustered with load requests that will be forwarded to the load processing circuitry. This means that the processing time for subsequent load requests is earlier than the processing time if these subsequent load requests were not clustered with load requests, thus changing the order in which load requests are processed. If the only access to the memory system is a load request, reordering the load requests does not cause any problems regarding memory consistency because the data items are not changed. However, the device can operate within the system, thereby modifying data items in the memory system. Therefore, the order in which operations are performed does matter, because whether the load request is executed before or after the modification of the data item specified by the load request can affect the result of the load request.
[0068] Therefore, in some exemplary embodiments of the technology described herein, the apparatus further includes: a hazard detection circuit for detecting actions related to the modification of the series of data items, and in response to detecting such actions, causing a memory consistency operation to be performed to ensure that the load processing circuit retrieves and modifies the series of data items in the order specified by the memory consistency protocol. Thus, a hazard detection circuit can be provided to detect actions related to the modification of the series of data items. Therefore, the hazard detection circuit can identify when another device requests the ability to modify the series of data items, or when another operation in the process performed by the device will modify the series of data items. As used herein, modification of a series of data items refers to modifying at least one data item in the series of data items, and therefore, actions related to the modification of the series of data items can be, for example, requested by another device for performing data processing operations to write one of the data items. In this case, it is important to ensure that the device maintains a consistent and coherent view regarding the order of execution requests. Another example of an action related to the modification of the series of data items is a write request occurring in the device from the same process as a load request. It is important to execute load and write requests in the correct (program-defined) order, because otherwise, if requests related to the same data items are processed in the wrong order, the program may produce unexpected or incorrect results.
[0069] To address this issue and ensure consistent memory access, a hazard detection circuit, in response to detecting an action related to a modification of a set of data items, performs a memory consistency operation. This ensures that the load processing circuit retrieves and modifies the set of data items in the order specified by the memory consistency protocol. A memory consistency protocol exists to define the expected order of operations for retrieving and modifying the set of data items, and based on this protocol, the hazard detection circuit is configured to perform the memory consistency operation. This execution can involve performing the memory consistency operation itself or performing it elsewhere. In this way, the hazard detection circuit can detect hazards and take action to prevent memory consistency problems from occurring.
[0070] An example of an executable memory coherence operation involves resuming a load request in a pending load buffer circuit and preventing the load request from being forwarded to the load processing circuit until the modification of the series of data items has been completed. Therefore, in some embodiments, the memory coherence operation includes: resuming a load request in a pending load buffer circuit and preventing the load request from being forwarded to the load processing circuit until the modification of the series of data items has been completed; and preventing the de-aggregation circuit from returning data items from the series of data items as a result of the load request when the series of data items are retrieved before the operation modifying the series of data items has been completed.
[0071] This results in the re-enumeration of the load request from the pending load buffer circuit. Therefore, in response to an action related to modification, modification is permitted upon re-enumeration of the load request, allowing a retrieval of the data items to be performed after the modification of that series of data items. In this way, a consistent approach to handling data hazards can be achieved, and thus the accuracy of the data processing operations performed by the device can be maintained.
[0072] After the load processing circuit has retrieved the series of data items identified by the load request, the hazard detection circuit can detect a hazard. To follow the scheme of resuming the load request in the pending load buffer circuit, if the series of data items identified by the load request has been retrieved from the memory system before the operation modifying the series of data items has been completed, the memory consistency operation involves preventing the de-aggregation circuit from returning data items from the series of data items as the result of the load request, thereby ensuring that the result of the load request corresponds to the series of data items retrieved after the operation modifying the series of data items has been completed.
[0073] In some exemplary embodiments, resuming a load request in a pending load buffer involves adding the load request to the pending load buffer. Specifically, this might be the case in an exemplary embodiment where forwarding a load request from the pending load buffer circuit to the load processing circuit involves removing the load request from the pending load buffer circuit. Therefore, to resume the load request, it is re-added to the pending load buffer circuit. The load request can be added in the same way as the load request for which the address generation circuit just generated an address, or it can be added in a different way. For example, it might be desirable to speed up the transmission of the load request through the pending load buffer circuit in the case of adding the load request due to risk. Therefore, the load request can be added at a location in the pending load buffer circuit that would result in the load request being forwarded to the load processing circuit again much faster than if it were added again from the address generation circuit. This approach avoids situations where the load request is delayed for too long, as this could cause stagnation or delay in the processing operation of the device.
[0074] In some exemplary embodiments, the action related to the modification of the series of data items may be a write notification issued by another device. The device for performing the data processing operation may be only a core or central processing unit (CPU), while a wider device or system may include more than one core or CPU. Therefore, the other device may be another core or CPU. If the other device attempts to write to the series of data items, it may issue a write notification to indicate its request to perform the write operation. Thus, in some exemplary embodiments, the action related to the modification of the series of data items is a write notification issued by another device, and that device, in response to detecting the write notification, delays sending an acknowledgment for the write notification until after the load processing circuitry has retrieved the series of data items, wherein the acknowledgment signals permission to continue modifying the series of data items. Therefore, the device is arranged to respond to a write notification having an acknowledgment signaling to the other device that it can continue modifying. The write notification and acknowledgment may be implemented on the kernel and transmitted on an interconnect that provides an interface between the kernel and the memory system. To ensure memory consistency, memory consistency operations in the exemplary embodiments using the aforementioned write notification involve delaying the sending of an acknowledgment in response to detecting the write notification. By doing so, since another device is waiting for confirmation to continue modifying the series of data items, the device can ensure that the load processing circuitry retrieves the series of data items before the other device continues modifying them. Therefore, this method provides a means to ensure memory consistency even if an external device attempts to modify data being loaded.
[0075] To increase the chance of finding a group or more subsequent load requests that meet the address proximity condition in the pending load buffer circuit, it may be desirable to provide a pending load buffer circuit that stores a large number of pending load requests. By storing more pending load requests, the frequency with which load requests forwarded to the load processing circuit converge with subsequent load requests can be increased, because there are more load requests in the pending load buffer circuit that can match the load request. In some exemplary embodiments, the device includes an out-of-order processor that is arranged to execute instructions in an order different from the order in which the instructions are received. Such an out-of-order processor may be more susceptible to reordering of load requests that occurs during the convergence of load requests and may occur during the address generation phase. The out-of-order processor may be arranged to operate with a larger instruction window to increase the number of load requests in the pending load buffer circuit, thereby increasing the chance that a given load request can converge with a group or more subsequent load requests.
[0076] Compared to other operations that a device can perform, data processing devices are particularly susceptible to delays or stalls caused by loading operations. This is because, in order to perform subsequent operations, certain other operations may need to be performed first, such that if those operations have not yet been completed, the device must wait until the results of those operations are obtained. For certain types of operations, such as calculations involving operands in registers within the processor, the operations can be performed easily without waiting for external devices such as the memory system. For write operations, these operations can be performed by instructing the memory system to perform the write, where the write does not actually need to be completed before the processor can proceed to the next operation. However, for load operations, in order to perform the load, the processor needs to wait for the result to be retrieved from the memory system, which may take a considerable amount of time. In some exemplary embodiments, the device includes an out-of-order processor for performing data processing operations. Thus, an out-of-order processor arranged to perform data processing operations other than loading items from the memory system can be arranged such that data processing operations including loading data items from the memory system take precedence over data processing operations other than loading data items from the memory system.
[0077] One type of out-of-order processor that can be arranged to operate according to the techniques described herein is the Decoupled Access Execution (DAE) processor. In a DAE processor, instructions are categorized into “access” instructions and “execution” instructions based on their dependencies. Specifically, in a DAE processor, this instruction classification is associated with the identifier of the load instruction and the instruction chain and instruction graph leading to the load instruction, linked by its data dependencies. If an instruction is determined to be a load instruction or necessary for executing a load instruction (because it provides the operands of the load instruction), the instruction is designated as an access instruction. In fact, any instruction that provides the source operands of an instruction (providing the source operands of the load instruction) is designated as an access instruction, thus establishing such a chain / data graph. Otherwise, the instruction is considered an execution instruction. The DAE processor is then arranged to process these two types of instructions as separate instruction streams in separate execution circuits. Notably, the execution of access instructions takes precedence over the execution of execution instructions in an effort to allow the load instruction to begin execution as soon as possible, thus “hiding” the associated latency to the greatest extent possible if memory access is required (i.e., typically when a cache miss occurs). Such techniques are described in more detail elsewhere in this document. As part of how the DAE processor handles load requests, the load coalescing technique of the present invention can be applied within the context of the DAE processor because these load instructions can be identified before the results of the load requests are needed. Therefore, a larger instruction window of load requests can exist, from which load requests are selected for coalescing, thereby enabling the coalescing technique described herein to be applied even more efficiently.
[0078] In some exemplary implementations, to continuously track the status of load requests, a pending load buffer circuit is arranged to store a status indicator for each load request buffered in the pending load buffer circuit. This status indicator may take one of several states to indicate which stage of the process of handling the load request is underway. By storing status indicators, an efficient way is provided to continuously track load requests propagating through the device without having to implement dedicated circuitry to perform this tracking.
[0079] A status indicator for a load request received from the address generation circuitry can initially be set to indicate a valid state. The status indicator can be implemented as a series of bits stored with the load request, where the values of these bits indicate the state. In response to an address proximity condition being met, the status indicator corresponding to one or more subsequent load requests in that group is set to indicate an invalid state, and the coalescing circuitry is arranged to suppress the forwarding of invalid load requests in response to invalid load requests. In this way, the status indicator provides a mechanism to prevent the load processing circuitry from retrieving a series of data items based on subsequent load requests because these subsequent requests will be coalesced. Therefore, when subsequent load requests are identified as coalescing, these subsequent load requests can be left in the pending load buffer circuitry while still ensuring that these subsequent requests do not unnecessarily delay the operation of the load processing circuitry. The status indicator can also be used for more complex behaviors of the processing device, such as resuming load requests, performing two-phase proximity checks, and handling hazard detection as discussed above.
[0080] Therefore, in some exemplary embodiments, a pending load buffer circuit is arranged to store a status indicator for each load request in the buffered load requests, wherein the status indicator for the load request received from the address generation circuit is initially set to indicate a valid state, wherein in response to the address proximity condition being met, the status indicator corresponding to one or more subsequent load requests in the group is set to indicate an invalid state, and the coalescing circuit suppresses the forwarding of an invalid load request in response to an invalid load request in the pending load buffer circuit.
[0081] In an exemplary implementation using the experimental proximity condition discussed above, a status indicator can be set to a hold state for one or more subsequent load requests in response to the satisfaction of the experimental proximity condition. Satisfaction of the experimental proximity condition indicates that an address proximity condition may be satisfied, and therefore this may be suitable for the group of subsequent load requests to be coalesced. Therefore, to determine whether the address proximity condition is satisfied, the coalescing circuitry can temporarily suppress the forwarding of that group of one or more subsequent load requests by suppressing the forwarding of those load requests in response to the hold state. If it is later determined that the address proximity condition is not satisfied for that group of subsequent load requests, the status indicator can be reset to a valid state so that the coalescing circuitry can forward these load requests to the load processing circuitry, or coalesce these requests with another request that will be forwarded to the load processing circuitry next. Therefore, in some such implementations, in response to the satisfaction of the experimental proximity condition, the status indicator for the group of one or more subsequent load requests is set to indicate a hold state, wherein the coalescing circuit temporarily suppresses the forwarding of load requests with a hold state in response to a load request with a hold state in the pending load buffer circuit, and wherein in response to the failure to satisfy the address proximity condition, the status indicator corresponding to the group of one or more subsequent load requests is reset to a valid state.
[0082] In an exemplary implementation that resumes a load request in a pending load buffer circuit in response to a detected hazard, when forwarding a load request, the status indicator of the load request is set to indicate an in-flight state. This in-flight state indicates that the load request has been sent from the pending load buffer circuit, and therefore the coalescing circuit does not need to consider using the in-flight load request for forwarding to the load processing circuit or coalescing with another load request. When the decoupling circuit has returned the result for the in-flight load request, the status indicator for that load request is set to an invalid state, thereby indicating that the load request is not forwarded to the load processing circuit, but can be removed from the pending load buffer circuit. In this implementation, if a hazard is detected and a load request is to be resumed in the pending load buffer circuit, this can be achieved by resetting the status indicator corresponding to the load request to an valid state so that the coalescing circuit will consider the load request for forwarding. Therefore, in some such implementations, when a load request is forwarded, the indicator of the load request is set to indicate an in-flight state, wherein in response to the decoupling circuit returning a data item identified by the load request as a result of the load request, the state indicator corresponding to the load request is set to an invalid state, and wherein resuming the load request in the pending load buffer includes resetting the state indicator corresponding to the load request to an valid state.
[0083] In some implementations, the device also includes an out-of-order processor for performing data processing operations.
[0084] In some implementations, the data processing operations also include data processing operations other than loading data items from the memory system.
[0085] The device is arranged such that data processing operations, including loading data items from the memory system, take precedence over data processing operations other than loading data items from the memory system.
[0086] In some implementations, the out-of-order processor is a decoupled access execution processor.
[0087] In at least one exemplary embodiment, a method is provided for operating an apparatus for performing a data processing operation, the data processing operation including loading data items from a memory system, the method comprising: generating addresses for load requests; buffering the load requests in a pending load buffer circuit before a load processing circuit executes the load requests to retrieve data items using the addresses of the load requests; forwarding the load requests buffered in the pending load buffer circuit to the load processing circuit to retrieve from the memory system a series of data items including data items identified by the load requests; determining, for a set of one or more subsequent load requests buffered in the pending load buffer circuit, whether an address proximity condition is satisfied, wherein when the set of one or more subsequent load requests identifies a data item... When all data items identified are included in the series of data items, the address proximity condition is satisfied; in response to satisfying the address proximity condition, forwarding of the group of one or more subsequent load requests to the load processing circuit is suppressed; in response to the load request, the series of data items including the data items identified by the load request are retrieved from the memory system; the retrieved series of data items are received and the data items identified by the load request are returned as the result of the load request; and in response to satisfying the address proximity condition, for each subsequent load request in the group of one or more subsequent load requests, one or more additional data items from the series of data items identified by the one or more subsequent load requests are returned as the result of the one or more subsequent load requests.
[0088] In at least one exemplary embodiment, an apparatus is provided for performing a data processing operation including loading data items from a memory system. The apparatus includes: means for generating addresses for load requests; means for buffering the load requests before means for processing the load executes the load requests to retrieve data items using the addresses of the load requests; means for forwarding load requests buffered in the means for buffering the load requests to the means for processing the load to retrieve a series of data items from the memory system including data items identified by the load requests, wherein the means for processing the load retrieves the series of data items from the memory system including data items identified by the load requests in response to the load requests; and means for targeting one or more sets of data items buffered in the means for buffering. A means for determining whether a subsequent load request satisfies an address proximity condition, wherein the address proximity condition is satisfied when all data items identified by the group of one or more subsequent load requests are included in the series of data items; means for suppressing forwarding of the group of one or more subsequent load requests to the means for processing the load in response to satisfying the address proximity condition; means for receiving the retrieved series of data items and for returning the data item identified by the load request as a result of the load request; and means for: in response to satisfying the address proximity condition, returning one or more additional data items identified by the one or more subsequent load requests from the series of data items as a result of the one or more subsequent load requests for each subsequent load request in the group of one or more subsequent load requests.
[0089] Now, a specific implementation will be described relative to the diagram.
[0090] Figure 1A The diagram illustrates a scenario that provides context in which the inventive technique is particularly relevant. The figure (on the left) shows a sequence of instructions comprising both access (A1, A2, etc.) and execution (E1, E2, etc.) instructions, received by a prior art processor in a single, interwoven stream as shown. Assume that access instruction A3 is a "load" instruction that provides the information required by execution instruction E1. If access instruction A3 triggers a cache miss, the sequence of execution instructions starting with E1 will be stalled until the memory system delivers the requested data. This has the additional disadvantage that the entire pipeline is subsequently filled with stalled execution instructions, limiting its look-ahead execution depth and ultimately forcing it to stall.
[0091] Figure 1BThis illustrates the applicability of the technique to the same context in which the processor receives the same sequence of instructions, including both access (A1, A2, etc.) and execution (E1, E2, etc.) instructions, according to the technique. Here, the pipeline shown belongs to the processor's "access" pipeline, and it can be seen that by prioritizing access instructions (processed in the access pipeline shown) and deferring execution instructions (E1-E3, which can be processed by the processor's "execution" pipeline (not shown)), the complete processor is able to execute more instructions during the time it spends waiting for data with access instruction E1. In fact, if the program's access section can detect another potentially long-latency event (e.g., a cache miss) while executing A6, the cost of the "miss" can be compensated, for example, by initiating a data access before the point at which the event would normally occur. It should be noted that the execution instructions are placed in a temporary storage area or cache ("E cache" in the figure) designed to remain passive for several loops until the data arrives.
[0092] Figure 2 A data processing apparatus according to some embodiments is schematically illustrated. A single set of "front-end" circuitry for retrieving and decoding an ordered sequence of instructions to be executed by the data processing apparatus is provided, the set including an acquisition circuit 101 and a decoding circuit 102. The decoded instructions are passed to a release circuit 103. According to the present technology, the release circuit 103 is arranged to identify a marker associated with at least some of the instructions in the received ordered sequence of instructions. Specifically, the release circuit 103 releases the instruction to an access execution circuit 104 for execution in response to the identification of an "access" marker associated with the instruction. Conversely, instructions without an "access" marker are directed to an "execution" execution circuit 105 for execution. While there may be two explicit types of markers in some examples, in the illustrated example there is actually only an access marker. Therefore, instructions with this marker are directed to the access execution circuit 104, and conversely, any instructions without an access marker are directed to the execution circuit 105. The access marker is associated with all access-related instructions that determine at least one feature for a load operation to retrieve a data value based on a specified memory address. Figure 8A and Figure 8B The associated description illustrates the definition of access instructions according to this technology, wherein it can be seen that the access dependency graph includes all instructions leading to the terminal node representing the load instruction.
[0093] The access execution circuitry includes an execution section 106, which may be arranged, for example, in a pipelined manner. It should be understood that... Figure 2The schematic diagram is at a relatively high level of abstraction to provide an overview of the general principles of the construction of the data processing apparatus 100. However, it is particularly important to note that the loading unit 107, within the access execution circuit 104, delegates its loading operations, defined by the loading instructions executed by the execution section 106. This loading operation first accesses the L1 cache 108, which also forms part of the access execution circuit 104, and (if an access miss occurs) may further propagate the access to the L2 cache 109 (and possibly further to the memory system). Data values returned from the memory system and / or cache hierarchy enter the L1 cache 108, and data values returned from the cache subsystem are placed in the decoupled access buffer 110, which forms part of the access execution circuit 104. These values may also be provided to the register 111, which the execution section 106 accesses as part of its data processing operations.
[0094] Instructions without an "access" flag are issued by the issue circuit 103 to the execution circuit 105. The received instructions are temporarily held in the instruction cache 112, allowing these instructions to be delayed while prioritizing the parallel execution of access instructions, which is then performed in the access execution circuit 104. The decoupled access buffer 110 is arranged to send certain signals about its contents to the execution circuit 104. Therefore, when a data item retrieved from memory via a load operation is available in the decoupled access buffer 110, this fact can be signaled to the execution section 113 of the execution circuit 105, which can then utilize the value when executing a specific instruction. The execution section 113 can also utilize the value held in register 111, and conversely, due to its own data processing operations, cause certain updates to the contents of register 111. In cases where the data processing of the execution circuit 105 depends on the processing performed by the access execution circuit 104, Figure 2 Another feature of the exemplary embodiment shown is a low-power state control circuit 114 that provides execution circuitry 105. When the decoupling access buffer 110 is depleted (becomes empty), the low-power state control circuit receives a notification from the decoupling access buffer 110, and in response, the low-power state control 114 causes execution circuitry 105 to enter an inactive low-power (or low-frequency operation) state. Conversely, when the decoupling access buffer 110 has content again, the low-power state control 114 can cause execution circuitry 105 to become active again (i.e., fully powered or operating at a higher frequency than before) and begin further instruction execution. While the execution circuitry 105 can be woken up in this way once any content is present in the decoupling access buffer 110, Figure 2In the example, the decoupled access buffer 110 signals the low-power status control 114 when its contents meet a predetermined threshold content (i.e., minimum content), thereby improving the power-saving advantages of operating the execution circuit 105 in this manner, and waking up the execution circuit to continue instruction execution only when a sufficient amount of data value is available in the decoupled access buffer 110. The specific level of this threshold can be set as a specific implementation detail according to system requirements.
[0095] Figure 3 This is a flowchart illustrating the sequence of steps taken according to a method based on some implementation schemes. Specifically, Figure 3 This diagram illustrates how instructions within a received instruction sequence are processed according to the present technique. The process begins at step 150, where the next instruction in the received instruction sequence is considered. At step 151, it is determined whether the instruction has a first type (“access”) marker. If it does, the process continues to step 152, where the instruction is issued to the access execution circuitry. Then, at step 153, the execution of the instruction is prioritized in out-of-order instruction execution performed by the access execution circuitry. Then, at step 154, when one or more data values retrieved from the memory system are returned, the values or such values are stored in a decoupled access buffer. However, if at step 151, it is found that the instruction does not have a first type marker, the process continues to step 155, where the instruction is issued to the execution execution circuitry. Then, at step 156, as part of out-of-order instruction execution performed by the execution execution circuitry, the instruction is delayed (e.g., the instruction is held in an instruction cache or buffer). At step 155, execution of the instruction is initiated once the required data value for the instruction is available in the decoupled access buffer. For example, the presence of the required data value for the instruction in the decoupled access buffer can be used to trigger the execution of the instruction while the instruction is held in the instruction execution circuit. It should be noted that the dashed path from step 154 to step 157 indicates that step 157 depends on actions generated by other paths, but this is not a true step in the illustrated procedural flow.
[0096] Figure 4An exemplary data processing apparatus 200 according to some embodiments is schematically shown. A single set of front-end circuitry 201 includes an instruction cache 202, a fetch circuit 203, a decode circuit 204, and a splitter 205. Thus, the received ordered sequence of instructions arrives at the splitter 205, which is then arranged to direct the instructions to either the access circuit 210 or the execution circuit 220 depending on whether the instruction has an associated “access” flag. When an instruction is passed to the access circuit 210, the first stage of this access circuit is shown as a micro-operation cache 211, a renaming circuit 212, a release queue 213, and a register fetch stage 214. These pipelined components are arranged in a manner familiar to those skilled in the art and not described in detail herein. Furthermore, depending on the nature of the instruction, corresponding signals are passed from the register fetch stage 214 to the integer ALU unit 216, the load unit 218, and / or the branch resolution (BR) unit 219. Therefore, it should be understood that the access circuit 210 is capable of executing various types of instructions (note that it is not limited to load instructions), and indeed as... Figure 2 The specific execution units 216 and 218 shown are merely examples, and other execution units may be provided. The result of branch resolution (as determined in BR unit 219) is passed back to fetch unit 203 and micro-operation cache 211. It should also be noted that data processing apparatus 200 is further provided with branch prediction unit 230, which interacts with the contents of micro-operation cache 211 and indicates to fetch unit 203 when it predicts that a branch will be taken and the corresponding jump in the fetched instruction should be implemented.
[0097] Integer ALU unit 216 performs its data processing operations relative to the values held in the registers of access section 210, which may have been retrieved from the cache / memory system. The loading unit 218 performs the retrieval of these values from memory (through a load operation). Figure 4 The interaction between these loading units and L1 cache 231 is illustrated, which forms part of a cache hierarchy including L2 cache 232. Additional cache levels leading to a memory system (not shown) may also be provided. Values returned from the cache / memory system are stored in a decoupled access (DA) buffer 234. Referring to the aforementioned data processing operations of integer ALU unit 216, these results are also fed into access result cache 236, which the integer ALU unit 216 has access to as part of performing its data processing operations. Modifications performed on these values are passed to commit queue 238 before being applied to register 240 of data processing device 200.
[0098] Instructions received in an ordered instruction sequence and without access markers are passed from splitter 205 to execution circuit 220. Specifically, these instructions are first received in X-scheduling cache 250, where they are held to delay execution relative to access instructions prioritized in access circuit 210. Execution instructions can be held in X-scheduling cache 250 in a compact pre-execution form (i.e., not fully unpacked and expanded as they will eventually be executed) to allow for compact provisioning of cache 250. Execution circuit 220 includes a reservation station 252 that enables the execution circuit to manage its own out-of-order instruction execution and, in particular, to continuously track instruction dependencies and operand availability. Execution circuit 220 also includes two integer ALU units 253 and two floating-point units (FPUs) 254, as well as two memory units 255. When executing its instructions, execution circuit 220 is therefore arranged such that it receives the values required by ALU 253 and FPU 254 from reserved station 252 and transmits the results of data processing performed by these units back to reserved station 252. Execution circuit 220 also includes branch resolution (BR) unit 258, which, like BR unit 219 of access circuit 210, sends a signal notification to acquisition circuit 203 of front-end circuit 201.
[0099] Reserved station 252 passes the resulting value to the commit queue 238 of access circuit 210 to update the register value. The data value to be written to memory is passed from reserved station 252 to memory cell 255. The memory transaction initiated by memory cell 255 is temporarily buffered in memory buffer 256 of access circuit 210. This allows for buffering of write data until memory is "committed". This also provides a sliver of opportunity to identify cases where the address of a memory transaction matches the address of a load that has already brought the value to decoupled access buffer 234. Identifying updates that could lead to data hazards (i.e., conflicts between newer loads and older memory) allows for remedial actions, and will be referred to below. Figure 5 This feature will be discussed in more detail.
[0100] Figure 5 A data processing apparatus 300 is schematically illustrated in some exemplary embodiments. The data processing apparatus 300 includes a front-end circuit 301, which itself includes an acquisition circuit 302, a decoding circuit 303, and a publishing circuit 304. In the manner discussed above, the publishing circuit 304 identifies certain instructions (which have associated "access" tags) and publishes these instructions to the access circuit 305, while publishing other instructions to the execution circuit 306. Figure 5 based on Figure 4An example is provided, but not all components are shown, merely for the sake of clarity in this example. Instructions received by access circuitry 305 enter its execution pipeline 307, which performs various data processing operations, some of which involve interaction with register 308, and some of which cause load unit 309 to initiate a load transaction with the cache / memory system. Figure 5 The diagram shows only L1 cache 310. Data values returned from the cache / memory hierarchy are passed to decoupled access buffer 311, and thereafter, some of these values can cause updates to the values held in register 308 (e.g., via various stages such as the commit queue—see...). Figure 4 ). Figure 5 The specific focus of the discussion is the provision of the conflict detection unit 312, which is described in more detail below. The execution circuitry 306 includes an execution cache 313, a reservation station 314, an ALU / FPU unit 315, and a storage unit 316. The execution circuitry 306 operates in a manner substantially similar to that described above for... Figure 4 The execution circuit 220 is described in the same manner and will not be repeated here for the sake of brevity. The conflict detection unit 312 also interacts with the storage unit 316 and is specifically arranged to identify situations that can lead to data hazards, namely, conflicts between newer load instructions prioritized in the execution performed by the access circuit 305 and older store instructions executed by the execution circuit 306. Data hazards can occur when these corresponding load and store operations involve the same memory address, i.e., in program order, the store should be performed before the load, but this order has been disrupted by the priority of load operations according to this technology. Furthermore, in cases where the store and load are already performed in separate, largely independent execution units (i.e., execution circuit 306 and access circuit 305), mechanisms that would typically prevent such data hazards within a separate out-of-order processing unit may be insufficient.
[0101] The following is for reference. Figure 6 and Figure 7 The specific manner in which the collision detection unit 312 operates is described in more detail. However, when such a data hazard is identified, the collision detection unit is arranged to signal this data hazard to various parts of the data processing apparatus 300, including the access execution pipeline 307 and the execution cache 313 / reservation station 314, because it will be necessary to refresh various instructions from the corresponding pipelines and re-execute some instructions. In some embodiments, the data processing apparatus 300 is arranged such that only the load instructions and any subsequent instructions are refreshed, but simpler embodiments assume that, due to the simplicity of implementation, a complete refresh of the corresponding pipeline is triggered when such a data hazard condition is identified.
[0102] Figure 6 The operation of the conflict detection circuit 312 relative to the contents of the memory cell 316 and the decoupled access buffer 311 is illustrated schematically. The memory cell 316 holds entries about storage transactions that are still "in flight," i.e., not yet committed. Various related information can be associated with each entry, although relevant to this discussion (e.g., ...). Figure 6 As shown in the example), but this information includes the value to be stored, the address where the value will be stored, and the instruction identifier (which is provided here by the Reorder Buffer (ROB) ID). The decoupled access buffer 311 also includes various information related to the entries it holds, in Figure 6 In the example, this information is shown as the address of the retrieved value, the value itself, and the instruction identifier associated with the load operation, which is in Figure 6 The example also uses the ROB ID. As is known to those skilled in the art, out-of-order execution pipelines, such as those... Figure 4 The out-of-order execution pipeline provided by the example access circuit 210 and execution circuit 220 utilizes such reordering buffers and ROB IDs to maintain the program's knowledge of the order in which it executes instructions, so that even though they perform out-of-order execution, the effects of the instructions can be correctly ordered when the results are submitted. Conflict detection circuit 312 is arranged to monitor the corresponding contents of memory cell 316 and decoupled access buffer 311 and identify instances of address matching between entries. This can be performed, for example, by a cyclical check of entries in one of them (e.g., DAB 311)—in turn, obtaining the address of each entry and checking for a matching entry in the other (e.g., memory cell 316). When a pair of address matches is found, the corresponding order of the instructions is determined by referring to the ROB ID, and when a data hazard (“conflict”) is thus identified, the conflict detection circuit causes a flush. This flush can be a full pipeline flush or a partial pipeline flush.
[0103] Figure 7The sequence of steps that can be operated according to the conflict detection circuit 312 is illustrated. At step 400, the next entry in the decoupled access buffer is checked. At step 401, it is then determined whether an entry with the same address exists within the execution memory cell. If not, the flow returns to step 400 to check the next entry in the decoupled access buffer. However, if a matching address exists, the flow continues to step 402, where it is determined whether the memory cell entry (in program order) was loaded before the value was brought into the decoupled access buffer (DAB). If this is not the case, the flow returns to step 400 to check the next entry in the decoupled access buffer. However, if this is the case, a conflict condition has been identified, and at step 403, the load instruction itself is compressed and any subsequent instructions in the access and execution circuitry are refreshed to avoid the erroneous side effect of the load and store operations being reversed. In other embodiments, the execution pipeline is refreshed such that only instructions that directly or indirectly depend on the compressed load instruction are also compressed, while the rest of the pipeline remains unchanged. Then, the process returns to step 400.
[0104] The following are exemplary sequences of instructions that the data processing device can receive and execute:
[0105] I1:[E]SUBx10, x11, x10
[0106] I2:[A]ADDx1, x2, x3
[0107] I3:[A]SUBx4, x1, x2
[0108] I4:[E]CLZx13, x2
[0109] I5:[A]LSLx4, x4, #1
[0110] I6:[E]MADDx14, x10, x11, x13
[0111] I7:[A]ADDx8, x3, x9
[0112] I8:[E]CMPx14, #39
[0113] I9:[A]EORx5, x4, x6, LSR #5
[0114] I10:[A]LDRd0, [x5, x8, LSL #3]
[0115] I11:[E]FMSUBd1, d2, d0, d3
[0116] I12:[E]FCSELd2, d1, d5, GT
[0117] I13:[E]STRd2, [x12], #4
[0118] Given the instruction sequence shown above, and assuming that Loaded Instructions (LDRs) are defined as "instructions of a predetermined type," the labeling of instructions depends on the analysis of data dependencies between instructions. These dependencies are... Figure 8A and Figure 8B This is illustrated graphically. Here, any instruction that provides a value to the source operand of a load instruction is considered an "access" instruction, and any instruction that provides a value to the source operand of an "access" instruction is itself considered an access instruction. Instructions not labeled as access instructions are considered execution instructions because they are found not to be part of the access data dependency graph. Therefore, as... Figure 8A As shown, instructions I10, I7, I9, I5, I3, and I2 in the access data dependency graph are given the access marker "A". The remaining instructions (which do not directly or indirectly cause "loading") are marked as execution (E), including I13, I12, I11, I8, I6, I14, and I1. This technology relates to identifying such data dependency graphs for sequences of instructions received by a data processing device, and specifically, as will be described in more detail below with reference to the figures below, this technology provides an apparatus and method that allows a data processing device to carefully construct such access data dependency graphs and mark their constituent instructions with "access" markers (this is executed online, i.e., executed on the spot as the data processing device receives and executes the instructions).
[0119] Figure 9A data processing apparatus according to some embodiments is schematically illustrated. As shown schematically, the data processing apparatus 500 includes an acquire / decode stage 501 that acquires and decodes instructions in a sequence of instructions to be executed by the data processing apparatus. These decoded instructions are stored in an instruction store 502, from which a remapper circuit 503 accesses these decoded instructions and performs any required remapping of the instruction-specified registers. Depending on the remapper stage, instructions are passed to either a release circuit 504 or a release circuit 505. The release circuit 505 releases instructions for execution by an execution circuit 506, while the release circuit 507 releases instructions for execution by an execution circuit 507. These parallel pipelines are also aggregated in a final commit stage 508. A register writer store 510 accessed by the remapper circuit 503 is also provided, where entries are created by the remapper circuit 503. Entries 511 in the register writer store 510 include instruction indicators and register indicators. Specifically, for each instruction encountered by the remapper circuit 503, the remapper circuit creates an entry in the register writer storage device 510 that indicates the instruction and its destination register (i.e., the register whose contents are written by the instruction). It should be noted that, for out-of-order processors, the registers mentioned in entry 511 of the register rewriter storage device 510 are physical registers (in this case, the remapper 503 is a renaming phase arranged to manage the mapping between the schema registers mentioned in the instruction and the physical registers of the data processing device). Conversely, for ordered processors, the registers mentioned in entry 511 of the register rewriter storage device 510 can be schema registers (i.e., as mentioned in the instruction). It should be noted that some instructions may have more than one destination register, and therefore multiple entries 511 may be created in the register rewriter storage device 510 in response to a single instruction.
[0120] Figure 9 The data processing apparatus 500 also includes an instruction tagging queue 512, an instruction tagging circuit 513, and an instruction tag storage device 514. The instruction tag storage device 514 is provided in association with an instruction storage area 502, such that instructions in the instruction storage area 502 can be associated (or not associated) with tags stored in the instruction tag storage device 514. In practice, in some embodiments, the instruction storage area 502 and the instruction tag storage device 514 can be combined into a single storage unit, where tags are stored directly associated with instructions. However, in other embodiments, the instruction storage area 502 is absent, and the instruction tag storage device 514 operates by receiving tags generated by the instruction tagger 513 and providing these tags directly to the front end of the processor (described below). Figure 15B(This type of implementation is illustrated schematically.) The instruction labeling circuit 513 operates by acquiring the identifier of the next instruction queued in the instruction labeling queue 512 and writing it into the instruction label storage device 514 to indicate that the instruction is "labeled". For example, when it is necessary to classify instructions as "access" or "execute" as described above, the label of the instruction indicates that the instruction is a defined "access" instruction (while unlabeled instructions are interpreted as "execute" instructions).
[0121] In addition to storing tags in instruction tag storage 514, instruction tagger 513 also determines whether an instruction has any producer instructions. Producer instructions are those instructions that generate at least one source operand of an instruction. Therefore, based on the specified source register for the current instruction, instruction tagger 513 refers to register writer storage 510 to determine whether any entries indicating that register or these registers are stored therein. When this is the case, the corresponding instruction identifier from the entry in register writer storage 510 is added to instruction tagging queue 512. Thus, a data dependency chain or data dependency graph leading to an instruction of a predetermined type (in this exemplary embodiment, a load instruction) can be identified, and each instruction in that data dependency chain or data dependency graph can be tagged. Note also the path from remapper 503 to instruction tagging queue 512. This is used to initiate the process by inserting the load instruction identifier of any encountered load instructions into the instruction tagging queue. Therefore, instruction tagger 513 receives instruction identifiers from instruction tag queue 12, which are written into the queue through previous iterations (where instruction tagger 513 identifies producer instructions in register writer storage device 510 and causes the producer instructions or those producer instructions to be added to the instruction tag queue), or are inserted into the instruction tag queue by remapper 503 when a load instruction is encountered.
[0122] Figure 10 This is a flowchart illustrating the sequence of steps taken according to a method based on some implementation schemes, which specifically describes... Figure 9In this example, the operation of components such as remapper 503 in the data processing apparatus 500 is illustrated. The process can be considered to begin at step 550, where the component encounters the next instruction in the sequence of instructions being executed by the data processing apparatus. At step 550, it is determined whether the instruction is written to the destination register. If not, the process loops itself to consider the next instruction in the sequence. However, if it is, the process continues to step 551, where an entry associated with the destination register and the instruction is created in the register write storage. Next, at step 552, it is determined whether the instruction has a predetermined type. For example, this could be a determination of whether the instruction is a load instruction. If not, the process returns to step 550. If it is, the process continues to step 553, where the instruction (i.e., its identifier) is added to the instruction tagging queue. Then, the process returns to step 550.
[0123] Figure 11 This is a flowchart illustrating a sequence of steps taken according to some implementation schemes, specifically those steps performed to label instructions, such as those that can be... Figure 9 The steps performed by the instruction tagging circuit 513 of the data processing device 500 are as follows. The process can be considered to begin at step 600, in which the next instruction is received from the instruction tagging queue. Then, at step 601, an entry is created in the instruction tag storage device, thereby "tagging" the instruction, where the association between the instruction and its tag forms the entry in the instruction tag storage device. Then, at step 602, it is determined whether the instruction has one or more producer instructions, i.e., whether at least one source operand of the instruction is given by the contents of a register that has been written to by another instruction. (Refer to the above...) Figure 9 This can be performed, for example, via reference register writer storage device 510 and the entries stored therein. If an instruction does not have any producer instructions, or if a producer is not available in instruction storage area 502, or if a producer instruction has already been previously tagged, the process returns to step 600 to process the next instruction in the instruction tagging queue. However, when one or more producer instructions are identified, the instructions in these producer instructions are added to the instruction tagging queue at step 603, and then the process returns to step 600.
[0124] Figure 12A data processing apparatus 700 according to some embodiments is schematically shown. The fetch circuit 701 and decode circuit 702 operate to retrieve the sequence of instructions to be executed from memory and decode them. This can be further subdivided into micro-operation cache 703, which fills the micro-operations with the decoded instructions. The next pipeline stage is a renaming circuit 704. The data processing apparatus 700 is arranged to perform out-of-order instruction execution, and therefore performs renaming of the physical register architecture to support this operation. Thereafter, depending on whether the given instruction is marked "A" (i.e., access instruction) or "E" (i.e., execute instruction), the given instruction is passed to one of the two execution pipelines shown. The "execute" pipeline... Figure 12 The process is schematically represented by the release phase 705, register read phase 706, execution phase 707, and completion phase 708. The final commit phase 709 is shared with other pipelines. Other "execution" pipelines... Figure 12 The data processing apparatus 700 is schematically represented by a release phase 710, a register read phase 711, an execution phase 712, a memory access phase 713, and a completion phase 714. It should be noted that the access pipeline also has a memory access phase 713 that runs parallel to the execution phase 712. Access instructions labeled "A" processed by the access pipeline take precedence over instructions executed by the execution pipeline. Therefore, the data processing apparatus 700 can be as described above with respect to Figures 1 to 714. Figure 7 This is a decoupled access execution processor of the aforementioned type. Therefore, the self-tagging capability allows this decoupled access execution processor to receive untagged instruction streams and add tags during flight.
[0125] Figure 12The diagram illustrates two storage units, both of which the renaming circuit 704 has access to. The first storage unit is the register write storage 720, and the second is the associated instruction storage 725. For each instruction processed by the renaming phase 703, if the instruction generates a result value stored in a register, the physical register is assigned as the destination register, and a new mapping is performed between the schema register (specified in the instruction) and the physical register. The renaming phase 704 also records the identifier of the instruction responsible for writing to that physical register in entry 721 of the register writer storage 720. Some instructions may have more than one destination register, and therefore multiple mappings can be generated in the renaming phase 704. Furthermore, when a single instruction is responsible for writing to several physical registers, several corresponding entries are generated in the register writer storage 720. When the source operand register is renamed, the renaming phase 704 queries the register writer storage 720 to create content for the associated instruction storage 725. Identifying the instructions written to the source operand register of the current instruction enables the renaming stage 704 to associate these "producer" instructions with the current instruction. Therefore, information obtained from the register writer storage device reveals one or more "producer" instructions that are written to the source operand register of the current instruction. Consequently, each entry 726 in the associated instruction storage device 725 provides a list of other instructions that produce at least one data value consumed by the current instruction.
[0126] The data processing apparatus 700 also includes an instruction labeling queue 730, which precedes a write buffer 731. Providing the write buffer 731 allows for potential speed differences in the operations of the renaming stage 704, the instruction labeler 732, and the instruction labeling queue 730. When the renaming stage 704 encounters an instruction of a predetermined type (in this example, a load instruction), it inserts an identifier for that load instruction into the write buffer 731. This is a mechanism for initiating the carefully crafted process of a data dependency graph, as the load instruction (in this example) is the terminal node of the data dependency graph. The instruction labeler 732 receives instruction identifiers from the instruction labeling queue 730. In the example shown, four parallel instruction labelers are provided, each receiving an instruction identifier from the instruction labeling queue 730. For each instruction identifier retrieved from the instruction tagging queue 30 by one of the instruction taggers in the set of instruction taggers 732, the instruction is written into the access / execution (A / E) tag cache 733, wherein, in this exemplary embodiment, the location in the cache corresponds to the instruction identifier, and bits are written to indicate that the instruction is tagged as an access instruction. The instruction tagger also uses the current instruction identifier to look up the associated instruction storage device 725, and when a corresponding entry is found, reads one or more instruction identifiers designated as producers in that entry. The instruction identifiers of these producer instructions are sent to the instruction tagging queue 730 via write buffer 731 for processing.
[0127] Figure 13A The associated instruction storage device (such as...) is shown. Figure 12 The example content of the associated instruction storage device 725 in the example, where such a structure is referred to here as a "traversal table". The content of the traversal table corresponds to the exemplary instruction sequence described above, and the data dependency graph is in Figure 8A and Figure 8B As shown in the image. Therefore (with) Figure 8A In comparison: I2 is listed as a producer instruction for instruction I3; I3 is a producer instruction for instruction I5; I5 is a producer instruction for instruction I9; and instructions I7 and I9 are producer instructions for instruction I10. It should be noted that instruction I10 is a load instruction and is therefore the terminal node of the data dependency graph.
[0128] Figure 13B A register writer storage device (such as...) is shown. Figure 12 An exemplary content of the register writer storage device 720, referred to in this figure as the "last writer table". It should be understood that the specific physical register mapped to by the architecture registers specified in the instructions depends on the specific configuration of the renaming phase and the availability of the physical registers when these instructions are encountered. Therefore, Figure 13B This is merely a sample printout of a specific, exemplary content from the final writer table. From Figure 13B As can be seen, at the indicated sampling point, instruction I5 is the "last writer" for physical register 26, and instruction I9 is the last writer for physical register 28. Physical registers 25 and 27 can currently be mapped to architecture registers, and there are currently no valid "last writer" instructions, so they are marked with "-".
[0129] Figure 14A and Figure 14B An exemplary configuration of the instruction tag storage device and some exemplary contents of each instruction tag storage device are shown. Figure 14A In an exemplary implementation, the instruction tag storage device is arranged to store entries associated with instruction identifiers, tags, and "unprocessed" indicators. Therefore, for any given instruction, it can be determined whether a corresponding entry exists in the instruction tag storage device, and specifically, whether the instruction has been tagged. The unprocessed flag is used to prevent certain instructions from being placed in the instruction tagging queue. For example, in... Figure 12 In this implementation, this prevents the renaming phase 704 from placing instructions, along with the loads already processed by the tagging unit (thus, by being tagged in this way, it does not need to trigger a new data dependency graph fabrication process), into the write buffer 731. Additionally, it should be noted that... Figure 14A The table for I20 has an unprocessed flag instead of an "A" label. This unprocessed flag can be stored in association with some instructions that are known a priori not to be access instructions, such as branch instructions and storage in the absence of register write-back. It should be noted that if the "Access" label is set, it is not actually necessary to explicitly set the "Unprocessed" flag, because setting the access label also prevents instructions from being added to the instruction labeling queue.
[0130] Figure 14B An alternative embodiment of the instruction tag storage device is shown, wherein this example is referred to as an A / E cache, which can correspond to Figure 12 The example is the A / E tag cache 733. This is a particularly compact structure that requires storing only a limited amount of information because instruction identifiers are mapped to specific cache locations, and bits are stored at those locations to indicate that the instruction mapped to that location is marked as an access instruction. An "unprocessed" bit can also be indicated in a similar way, with the bit stored at the location mapped to by the body instruction. It should be noted that some instructions are marked as unprocessed rather than "accessed," such as branch instructions and the above example of storage in the absence of register write-back.
[0131] Figure 15A and Figure 15BThis schematically illustrates a configuration that allows tagging information to be removed from the instruction tag storage area but retained for future use. Figure 15A The diagram illustrates an example where the decoding stage 800 passes decoded instructions to the micro-operations cache 801, where the cache itself is provided with tagging information, and where this tagging information is access / execution tagging information (e.g., bits indicating whether the instruction is an access instruction). The remapper stage 802 retrieves the instructions from here. This configuration also includes two levels of instruction cache hierarchies 803 and 804, which are also arranged to store instruction information with associated tagging information. Therefore, if an instruction is evicted from the micro-operations cache, its associated tagging information can be sent to these additional levels of instruction cache hierarchies, meaning that this information can then be loaded back into the micro-operations cache at a later time without having to perform the tagging process (and data graph fabrication process) again.
[0132] Figure 15B The diagram schematically illustrates the front-end circuitry 810 preceding the processor's execution pipeline 811. Specifically, this is an implementation where no micro-operations cache exists, but an associated instruction tag cache 812 (which receives tag information from tagging circuitry such as the tagging circuitry described above) provides this tag information directly to the processor's front-end circuitry 810 to associate it with passing instructions. If the instruction tag cache 812 becomes full and entries are evicted, these entries can be sent to other levels of instruction caches 814 and 815. When the same instruction is encountered again, the tag information can be carried into the processor along with the instruction; however, it is not necessary to regenerate the tag information (and repeat the data graph fabrication process).
[0133] Figure 16 The apparatus 10 in some embodiments is illustrated schematically. Apparatus 10 includes various components related to the processing of load requests in a data processing apparatus, which may be, for example, the one referenced above. Figure 2 , Figure 4 and / or Figure 5 One of the data processing devices in the aforementioned data processing apparatus. For example, in... Figure 16 As can be seen from this, the only components shown here are those related to the processing of load requests, and those skilled in the art will generally understand the context of such a device and can incorporate such load processing circuitry. Figure 2 , Figure 4 and / or Figure 5 In the example. Figure 16An incoming load request is shown, received by address generation circuitry 118, which is configured to generate the address required for the corresponding load request. It should be noted that the address generation circuitry can be a dedicated arrangement of circuitry for generating addresses, or it can also be provided by a conventional arithmetic logic unit (ALU) capable of performing integer arithmetic related to address generation. In the latter case, in one embodiment, the decoder can divide the load instruction into two separate micro-operations: one micro-operation performs address calculations derived from the specific addressing mode used, and the other micro-operation performs the actual access specified by the "load" instruction. In this case, once the address generation micro-operation has been performed, the "resolved" address of the "load" instruction is written. Figure 16 The "Pending" Load Buffer (PLB) is shown. It should also be noted that requests may arrive at this buffer out of sequence. Load requests arrive at the Pending Load Buffer (PLB) circuit 120, which in this example is arranged as a FIFO buffer, and therefore, in Figure 16 In the illustration, individual pending load requests can be considered to enter at the top of PLB 120 and progressively enter the dual-track system through the indicated positions, eventually exiting and being passed to coalescing circuit 130. Of course, physical movement of entries in the FIFO typically does not occur; instead, this positional progression is handled by referring to the identifier of each entry. In one function, coalescing circuit 130 forwards load requests from PLB 120 to load processing circuit 140, allowing these load requests to be executed and the corresponding data items, which constitute their bodies, to be retrieved from the memory system. Figure 16 The illustration only explicitly shows the L1 cache 160 of the memory system. However, the coalescing circuit 130 also plays another role in the system, namely, determining whether at least two pending load requests are related to memory addresses that are sufficiently close to each other so that loading processing efficiency can be improved by coalescing these at least two pending load requests into a single load request, based on the pending loads held in PLB 120. This efficient proximity of the corresponding memory addresses is referred to herein as existing when the "address proximity condition" is met. Although this address proximity condition may be defined differently depending on the specific implementation of this technology, Figure 16In the example, this address proximity condition is defined with reference to L1 cache 160, specifically with reference to its cache line size. In other words, coalescing circuit 130 examines the memory address of a pending load assigned to be buffered in PLB 120 and determines whether at least two pending load requests relate to the same cache line. If so, those at least two pending load requests are coalesced by coalescing circuit 130. Part of this action of coalescing circuit 130 includes suppressing the forwarding of all pending load requests except for one of the at least two pending load requests found to satisfy the “address proximity condition” defined by the cache line size. The feedback path from coalescing circuit 130 to PLB 120 is schematically illustrated here. Furthermore, coalescing circuit 130 also generates a corresponding signal that is transmitted to de-coalescing circuit 150. When load processing circuit 140 has triggered a retrieval of data from the memory system (e.g., from L1 cache 160), the data is passed to de-coalescing circuit 150. By receiving a signal from the coalescing circuit 130, the decoupling circuit 150 knows that not only should the data item specified by the load request executed by the load processing circuit 140 be passed as a requested data item, but also the data item specified by at least one other pending load request (whose forwarding to the load processing circuit 140 is suppressed by the coalescing circuit 130) should be extracted and passed as at least one other requested data item. For example, in the case of the cache line length from which data is returned from the L1 cache 160, the decoupling circuit 150 extracts multiple data items to be returned from that cache line in this case.
[0134] Figure 17 schematically shown Figure 16 The apparatus 10, in which valid examples of inputs, processing, and outputs are superimposed. Address generation circuitry 118 is shown to receive a load request that identifies the schema register R20 as the address where a load should occur. Address generation circuitry 118 then determines that schema register R20 (currently) corresponds to physical address "21" and adds a pending load with that address information to PLB 120. PLB 120 is a FIFO buffer structure, therefore the latest pending load request is added to the first entry (at the top in this figure). It should be noted that PLB 120 is typically fully full because entries exist in all possible storage locations, but different states can be maintained for each individual entry, as will be discussed in more detail below. Additionally, it should be noted that only these specific entries are explicitly shown because they are relevant to this discussion. Figure 17 A subset of the entries in PLB 120. Therefore, in Figure 17The sampling printout of the contents of PLB 120 explicitly shows four entries, corresponding to memory address locations 21, 8, 5, and 3 respectively. Additional information or metadata, such as the data access type, format, and access size information corresponding to each entry, can also be stored in the pending load buffer. If needed, this metadata can be shared with the de-coalescing circuitry to retrieve relevant data items from the data returned from memory.
[0135] The coalescing circuit 130 monitors the contents of PLB 120 and determines which requests will be forwarded to the load processing circuit 140. As the contents of PLB 120 progress, the pending load request accessing address 3 becomes the oldest valid pending load request in PLB 120, and the coalescing circuit 130 forwards this request to the load processing circuit 140, thereby marking the entry's status indicator as "in flight" (IF). The "in flight" status means that the entry in PLB 120 for this pending load request is typically held in PLB 120 until the load has been processed and the requested data returned, allowing the entry to be marked as invalid. However, other states of entries in PLB 120 are also used to support this technique. The coalescing circuit 130 monitors and compares the memory addresses that constitute the subjects of the corresponding pending load requests held in PLB 120, specifically identifying multiple entries in PLB 120 that are associated with memory addresses that are close enough to allow "coalescing" of these load requests to occur. Figure 17 In the example, coalescing circuitry 130 is arranged to determine whether multiple pending load requests in PLB 120 relate to memory addresses within the cache line size used in the memory system and specifically in L1 cache 160. Figure 17 In the exemplary sampling printout shown, coalescing circuit 130 determines that two additional pending load requests in PLB 120 (i.e., those pending load requests accessing memory addresses 5 and 8) satisfy their address proximity requirements because the data items retrieved from memory addresses 3, 5, and 8 will be in the same cache line. Therefore, coalescing circuit 130 marks the pending load requests related to memory addresses 5 and 8 as "invalid" and sends an indication to de-coalescing circuit 150 that these three pending load requests have been grouped together in this manner.
[0136] After forwarding the pending load request related to memory address 3, load processing circuit 140 accesses the memory system (including L1 data cache 160) to perform the required load. The cache line returned from L1 data cache 160 includes multiple data items, including those referenced by memory addresses 3, 5, and 8. The data corresponding to the cache line is passed to (or at least accessed by) de-aggregating circuit 150. In the absence of a signal from coalescing circuit 130, de-aggregating circuit 150 will only fetch the data item corresponding to memory address 3; however, if de-aggregating circuit 150 has received an indication from coalescing circuit 130 that the pending load requests related to memory addresses 3, 5, and 8 have been coalesced, de-aggregating circuit 150 fetches the data items corresponding to all three memory addresses from the returned data of the cache line. De-aggregating circuit 150 receives the required data from coalescing circuit 130, which, in response, causes the entry corresponding to the pending load request for memory address 3 to be marked as invalid. Therefore, when the entry reaches the header of PLB 120, the entry is deleted (or at least allowed to be overwritten). Similarly, when the entries corresponding to addresses 5 and 8 reach the header of PLB 120, these entries are similarly deleted (or at least allowed to be overwritten). It should be noted that if the process of handling the load request for the coalescing is interrupted, the corresponding entries can be restored, wherein the entry corresponding to memory address 3 is changed from being in flight to being valid, and the entries corresponding to memory addresses 5 and 8 are changed from being invalid to being valid.
[0137] Figure 18 Schematic illustration based on relative Figure 16 and Figure 17 An exemplary implementation based on the principles illustrated. References have been made. Figure 16 and Figure 17 Discussion Figure 18 Various components in the device. These components are... Figure 18 The same reference numerals are used to identify them. Instruction queue 310, fed into register read circuit 320 and address generation circuit 118, is shown. Therefore, by identifying the sequence of load requests in instruction queue 310, the memory addresses associated with these load requests are determined, and a corresponding entry for each load request is added to PLB 120. Figure 18 Further details of the coalescing circuit 130 are shown below. Figure 18In an exemplary embodiment, the de-coalescing circuit 130 is shown to include a trial proximity check circuit 331, an address proximity check circuit 332, and a hazard detection circuit 333. The trial proximity check circuit 331 and the address proximity check circuit 332 determine in a two-stage process whether multiple entries in the PLB 120 are associated with addresses that are sufficiently close so that their corresponding loads can be advantageously coalesced. Essentially, the trial proximity check circuit 331 performs a coarse comparison, while the address proximity check circuit 332 performs a more precise comparison. The coarser nature of the comparison performed by the trial proximity check circuit 331 allows the check to be performed more quickly, and thus updates the state of the corresponding entries in the PLB 120 more rapidly (temporarily). After performing an initial experimental proximity check (which in some implementations employs a CPU cycle (depending on the PLB size)), the "first" pending load request (i.e., at the head of PLB 120) is dispatched to the load processing unit 341 that initiates memory access. This "first" pending load request has been compared with subsequent pending load requests in the queue of the FIFO PLB 120. Figure 18 In the illustrative exemplary illustration, the memory system accessed by the load processing unit 341 for this purpose includes TLB 342 (including lookup and fault-checking circuitry), L1 cache 160, and the remainder of memory system 365 (accessed in the event of an L1 miss). While this access continues, the coalescing circuit updates any other entries in PLB 120 (which are potential matches in a "hold" state (i.e., by trial proximity check)) and sends information identifying these candidates to address proximity check circuit 332, which performs a more detailed check to determine whether the remaining address bits of a potential match are the same as the address bits of an outgoing load processed by the load processing unit 341. When address proximity check 332 does not find a (sufficiently close) match, the coalescing circuit restores these relevant pending load requests in PLB 120 to a valid state. In other words, these loading requests will then proceed further through PLB 120, where they can be compared with other loading requests to obtain address proximity, and (in the absence of compression before these loading requests reach the head of the PLB queue) these loading requests are passed to the loading processing unit 241 when they reach the head of the PLB queue 120.
[0138] Conversely, if address proximity check 332 finds a match, the coalescing circuitry changes these relevant pending load requests in PLB 120 from held to invalid, and passes information about each “compressed” load to de-coalescing circuitry 150, allowing the desired result to be retrieved from the cache accordingly. This information may include: the load ID, its offset within the cache line; the size of the request; and the ID of the outgoing load it depends on. It should be noted that although the coalescing circuitry spends additional time (e.g., two CPU cycles) performing these actions relative to other load requests (compressed in the example above), this is still less than the typical access time of the L1 cache, meaning that the latency of its operation is effectively hidden. The only case where this latency is not hidden is when other load requests are temporarily held (due to an experimental proximity match), but subsequently discovered by a (full) address proximity test that these other load requests are not a perfect match.
[0139] Load requests issued by load processing unit 341 access TLB 342 to perform the necessary lookups (for translation from virtual to physical addressing) and to respond appropriately to any faults. It should be noted that various types of L1 caches (e.g., virtually indexed or physically indexed) can be provided, and therefore, access to TLB 342 can occur before or after an L1 cache access. When an L1 data cache access is performed and (as a result of a cache hit or through further access to the remainder of memory system 365) the contents of the relevant cache line are returned, data read and path multiplexing circuitry 343 processes the cache line data and passes its contents to de-coalescing circuitry 150 (whose contents are accessible by this de-coalescing circuitry). The de-coalescing circuitry then extracts the required data items (as a result of a load request that includes coalescing multiple data items from one cache line). Referring to the above example of a decoupled access execution processor, these data items can be placed in buffer 350 (which may, for example, correspond to...) Figure 2 Decoupling access buffer 110 Figure 4 Decoupling access buffer 234 or Figure 5 The decoupled access buffer 311 can also be used to send these data items to the result cache 370 and / or the "execution" portion of the full processor (such as...). Figure 4 Example access result cache 236 and execution circuit 220).
[0140] Figure 18The coalescing circuit 331 is also schematically shown to include a hazard detection circuit 333, which forms part of a coherence mechanism that the device supports in a wider data processing system (of which the device forms part). These coherence mechanisms allow multiple master devices in the system to access and modify data items in shared areas of memory in a manner that will be generally familiar to those skilled in the art. The hazard detection circuit is arranged to receive write notifications from external devices (e.g., another master device accessing memory shared with it). These write notifications may, for example, come from snooping requests exchanged in a multi-master system. Thus, when the coalescing circuit passes a load request to the load processing unit 341 (for a specific cache line to be accessed), the hazard detection circuit 333 of the coalescing circuit tracks the access until it is completed, and if a write notification relating to the cache line is received while the cache line is still in flight, the hazard detection circuit takes remedial action. If an external device waits for an acknowledgment signal before proceeding with a write operation, and the sorting rules dictate that the local load being executed should complete first, the hazard detection circuitry delays sending the corresponding acknowledgment signal until a cache line has been retrieved.
[0141] The hazard detection circuit 333 can also take action relative to the contents of PLB 120. For example, when a sequencing rule dictates that an access notified by an external device should be completed before local loading, but local loading has already been processed by the loading processing unit (either by its own capability or by coalescing with at least one other loading request), the hazard detection circuit restores the entry in the pending loading buffer circuit. This can be achieved by changing the entry's "in flight" or "invalid" status back to valid, or by adding the corresponding loading request to the pending loading buffer. The hazard detection circuit 333 prevents the forwarding of loading requests to the loading processing circuit until it is known that the modification indicated by the write notification has been completed. Furthermore, the hazard detection circuit 333 sends a signal to the decoupling circuit notifying that the relevant results for the related loading request should not be returned.
[0142] Figure 19 Examples such as Figure 18The example shown is a valid example of a two-stage address comparison check performed in the coalescing circuit. Here, the exemplary assumption for which the comparison is being performed is that the base memory address (i.e., the valid pending load request corresponding to the head of the pending load buffer) is “261167”. It should be noted that this example is given in decimal notation only for readability purposes, and the principle can be readily converted to a typical binary implementation. In the first “experimental” phase 331, the numbers [2:1] of this address are compared with other addresses related to the pending load request in the PLB, and thus other addresses that also have the number “16” are determined to be experimentally matched. The set of four addresses thus selected causes the status of their pending load requests in the PLB to be updated to “hold”. Next, in the second “full check” phase 332, the numbers [5:3] of the base address are compared with other addresses related to the pending load request in the PLB. In this example, it is determined which of these addresses also has the number “261”. Those that do not match cause the status of their entries in the PLB to be restored to “valid”. Those entries that do match will have their status updated to "invalid" in the PLB, as these entries will be coalesced. Therefore, in the example shown, the coalescing request sent to the de-coalescing circuitry indicates a base request to access address 261167, and the result of this request should also be used to retrieve data items at addresses 261162, 261160, and 261163. It should be understood that this address information does not need to be explicitly transmitted, but can be transmitted in a more compact form, such as using an indicator of the base load ID and the corresponding offset within the cache line.
[0143] Figure 20 The contents of a pending load buffer as content evolves with an exemplary set of content being processed, according to some embodiments, are illustrated. A staggered print of the contents of the pending load buffer and a series of actions 1000-1006 are shown at eight consecutive time points AH. The buffer entries are shown as vertically stacked in the diagram, with the head of the buffer (i.e., holding the oldest entry) at the top and the tail of the buffer (i.e., holding the newest entry) at the bottom. Content A shows a load associated with address 74, which has reached the head of the buffer. This entry is marked invalid (I), and therefore action 1000 discards the invalid entry from the head of the buffer, resulting in content B. Next, action 1001 performs a trial proximity check, and therefore the head entry and any entries that might perfectly match in the address proximity check are marked "Hold" (H). It should be noted that in content C, this set of pending load requests includes "Load 3", "Load 5", "Load 8", and additional loads in the entries between Load 5 and Load 8. This is only for readability purposes; the entries, Figure 20All other entries except for "Load 3", "Load 5", and "Load 8" are not explicitly named. Due to the (full) address proximity check performed as action 1002, content D remains in the pending load buffer, where the entry at the header is marked "In Flight" (IF) (because it has been forwarded to the load processing circuitry), and requests that meet the address proximity condition are marked "Invalid" (I) because they have been aggregated. It should be noted that requests found to be mismatched in entries between Load 5 and Load 8 during the address proximity check are reset to "Valid" (V)—see content D. The next action 1003 relates to a received write notification that corresponds to (and replaces) a load of at least one aggregated load based on the aggregated loads published by Load 3, and therefore, the load is compressed (discarding any results generated in the load processing circuitry) and recovered in the PLB by marking the load as valid again. Content E is produced. Then, a valid pending load request exists at the head of the queue, and action 1004 initiates loading again (forwarding the request to the load processing circuit), and the entry in the PLB is marked as in flight (content F). In this attempt, "Load 3" completes successfully, returning the result of the load request from the decoupling circuit, along with those corresponding to "Load 5" and "Load 8," and then action 1005 marks the "Load 3" request as invalid (see content G). Finally, action 1006 discards the invalid request at the head of the queue, resulting in content H. It should be noted that the entries for "Load 5" and "Load 8" will also be simply discarded upon reaching the head of the queue.
[0144] Figure 21The sequence of steps taken according to some embodiments of the method is shown. The sequence begins at step 1010, where the address required for the load request is generated. At step 1011, the pending load request is buffered in a pending load buffer circuit. At step 1012, the pending load request (which has reached the head of the queue formed by the pending load buffer) is forwarded to the load processing circuit for execution. Then at step 1013, it is determined whether one or more subsequent load requests in the pending load buffer satisfy the address proximity condition relative to the just-issued load request. When the address proximity condition is satisfied, the process continues to step 1014, where forwarding of one or more subsequent load requests that satisfy the address proximity condition to the load processing circuit is suppressed. However, if the address proximity condition is not satisfied at step 1013, the process continues to step 1015. The process also continues from step 1014 to step 1015. At step 1015, a set of data items identified by the forwarded load request is retrieved from the memory system. At step 1016, the data item identified by the load request itself is returned as the result of the load request. If the address proximity condition is not met at step 1013, the process continues from step 1016 to step 1018, which completes the sequence of steps. However, if the address proximity condition is met at step 1013, the process continues from step 1016 through step 1017, in which additional data items corresponding to the one or more subsequent load requests are returned. The process then ends at step 1018.
[0145] Figure 22 A sequence of instructions according to some embodiments is shown, comprising a pilot instruction 1100 followed by a plurality of additional (other) instructions 1101. The pilot instruction 1100 is provided according to the present invention to give processor information relating to the subsequent instructions 1101, and in particular to indicate for each subsequent instruction whether it is considered an "execution" instruction or an "access" instruction. This classification purpose in the context of a decoupled access execution processor is described above with reference to the foregoing figures illustrating the principles of a decoupled access execution processor (and examples thereof) according to the present invention, the discussion of classifying instructions as "execution" instructions or "access" instructions, and examples of data dependency graphs. Figure 22 In the example, it can be seen that the guidance instruction 1100 basically comprises two parts. A first "opcode" part, specifically encoded to identify the instruction as a guidance instruction of this type to the device's decoding circuitry, and another part providing information related to the classification (access or execution) of subsequent instructions in that group. Furthermore, in Figure 22In the example, it can be seen that the information in instruction 1100 is explicitly presented (i.e., in terms of immediate value), where 0 indicates an access instruction and 1 indicates an execution instruction. It should be noted that, as mentioned above, the set of instructions 1101 to which these access / execution flags are applied may immediately follow the pilot instruction 1100, or, especially for timing purposes, one or more other instructions (not shown) may exist in the instruction sequence between the pilot instruction 1100 and the first instruction in the set of instructions 1101.
[0146] Figure 23 The decoding circuit 1110 in some embodiments is illustrated schematically. This decoding circuit may be, for example, Figure 2 Decoding circuit 102 in Figure 4 Decoding circuit 204 in Figure 5 Decoding circuit 302 in Figure 9 The decoding part of circuit 501 in the middle, Figure 12 The decoding circuit 702 and / or Figure 15A The decoding circuit 800 is in the middle. The decoding circuit 1110 is in Figure 23 The diagram is shown as including an opcode recognition circuit 1111, circuitry for performing additional decoding operations 1112, and an access / execution flag circuit 1113. Those skilled in the art will understand that, in order to convey points relevant to the discussion of this technology, the decoding circuit 1110 is shown only at a relatively high level of abstraction. Therefore, for clarity only, many components of a typical modern decoding circuit are not shown in the diagram. A sequence of instructions (i.e., the acquired instructions) is received by the decoding circuit 1110, and the opcode recognition circuit 1111 identifies the guiding instructions of this technology through specific opcodes that form part of the instructions. When one of these instructions is identified, the opcode recognition circuit 1111 signals the access / execution flag circuit 1113 to notify of the situation and also transmits access / execution information encoded in the instructions. Figure 23 In an exemplary embodiment, the opcode recognition circuit 1111 is arranged to recognize Figure 22 The type of guidance instruction shown is used, and therefore, explicit access / execution information provided as part of that instruction is then directly passed to the access / execution marking circuit 1113. This explicit access / execution information is stored in a marking buffer 1114, which forms part of the access / execution marking circuit 1113. In this way, Figure 22The relevant flags for the subsequent set of instructions 1101 are maintained by the decoding circuit 1110, and then applied when the subsequent instruction sequence is received. The application of these flags is performed by a separate decoding operation circuit 1112, which receives the relevant flags for each subsequent instruction. Therefore, the output of the decoding circuit 1110 is based on whether the decoded instruction has a "access" type or "execute" type flag. It should be noted that the decoding circuit 1110 is arranged with a default flag, which here means that the processor defaults to treating instructions as "access" (unless these instructions have a specific type, in which case the processor requests to forward them to a specific part of the processor for other reasons, which is the only way to execute these specific instructions).
[0147] Figure 24 The apparatus 1120 in some exemplary embodiments is illustrated schematically. Acquisition circuitry 1121 receives instructions retrieved from a memory system and provides these instructions to decoding circuitry 1122, which performs a decoding operation, typically as described above. Figure 23 As described. Figure 24 As shown, the memory system includes (at least) instruction caches 1130 and 1131 to strive to avoid as much of the full latency associated with retrieving instructions from their original storage locations in memory. Figure 24 Another feature of the device 1120 is a micro-operation cache 1123, in which micro-operations resulting from instruction decoding are temporarily stored. The issuing circuit 1124 retrieves instructions from the micro-operation cache 1123 (if stored there) to issue instructions to one of the execution circuit 1125 and the access execution circuit 1126. The access execution circuit 1126 includes a decoupled access buffer 1127, which the execution circuit 1125 has access to. An additional memory system 1128 is also shown (i.e., in addition to instruction caches 1130 and 1131, and possibly including, for example, one or more shared caches and / or system caches preceding the actual memory). Therefore, it should be understood that... Figure 24 The execution circuit 1125, access execution circuit 1126, and decoupling access buffer 1127 shown can be considered as any of the examples of these components described above with reference to the foregoing figures.
[0148] refer to Figure 24In the micro-operation cache 1123, it should be noted that each entry in the micro-operation cache may have an associated additional tag (A or E) indicating whether the entry will be processed as an access instruction or an execution instruction (micro-operation). According to this technology, the decoding circuit 1123 is arranged based on the classification of instructions following the lead instruction to apply the tag to the decoded instructions (and / or their equivalent micro-operations) stored in the micro-operation cache 1123. Therefore, when the dispatch circuit 1124 accesses an entry in the micro-operation cache 1123, the dispatch circuit can also be provided with the associated classification (tag) information, and thus these micro-operations can be directed to the appropriate execution circuit 1125 or 1126 on this basis. Instruction caches 1130 and 1131 may also store associated classification information (tags) (as generated by the decoding circuit 1122) applied to entries in the micro-operation cache 1123, and thus these tags can be preserved when entries are evicted from the micro-operation cache to these illustrated levels of the instruction cache hierarchy.
[0149] Figure 25 The apparatus 1160 in some exemplary embodiments is illustrated schematically. Acquisition circuitry 1161 receives instructions retrieved from memory system 1168 and provides these instructions to decoding circuitry 1162, which performs a decoding operation, typically as described above. Figure 23 The device also includes a register renaming circuit 1163 that performs a register renaming operation to allow out-of-order instruction execution. Therefore, the remapped (renamed) instructions are passed to a issuing circuit 1164, which issues instructions to either an execution circuit 1165 or an access execution circuit 1166. The access execution circuit 1166 includes a decoupled access buffer 1167, which the execution circuit 1165 has access to. It should be understood that... Figure 25 The execution circuit 1165, access execution circuit 1166, and decoupling access buffer 1167 shown can be considered as any of the examples of these components described above with reference to the foregoing figures. Figure 25 Three sets of physical registers, 1169, 1170, and 1171, are also shown. Although in Figure 25The registers are shown separately, but these should be understood as a set of physical registers, and the grouping shown is not physical. The three subsets shown are used by renaming circuit 1163 for instructions, which processes them based on whether a given instruction is an "access" instruction or an "execute" instruction. When decoding circuit 1162 encounters a guide instruction according to the present technology, it generates control signals to modify the renaming operation of register renaming circuit 1163, such that the register set is used for the "guided" subsequent instructions, depending on the type of those subsequent instructions. Here, register 1171 holds a value generated by an "access" instruction and consumed only by other "access" instructions; register 1170 holds a value generated by an "access" instruction and consumed by an "execute" instruction; and register 1169 holds a value generated by an "execute" instruction and consumed only by other "execute" instructions.
[0150] Figures 26A to 26C The structure of the guidance instructions according to this technology in some embodiments is illustrated schematically. Figure 26A In this context, instructions are presented as including the opcode, formatting information, and access / execution (A / E) information about one or more subsequent instructions. Figure 26A In the example, the formatting information is binary, where a value of 0 indicates that the AE information is immediate (i.e., normal), and in... Figure 22 In this manner, each bit value in the set of AE information corresponds to a subsequent instruction, indicating its type. Conversely, a format value of 1 indicates that the AE information is compressed. This compression can be fixed and predetermined, so the decoding circuit does not need additional information from the instruction to interpret the instruction (by properly decompressing the instruction). Figure 26B An example is shown where the format information includes two binary bits. (Similar to...) Figure 26A In this case, the first value (00 here) indicates that the AE information is immediate (i.e., ordinary), directly indicating the classification of the subsequent instruction sequence. The three other possible values of the binary pair indicate which of the three different compression schemes (#1, #2, or #3) has been used to encode the AE information in the instructions. In this example, compression scheme #1 (indicated by format information 01) is Run-Length Encoding (RLE), such that, for example, the pattern “AAAAAAEEEAAAAA” is logically represented in the AE information as {6A, 3E, 5A}. Run-Length Encoding (RLE) is known and not described in more detail here—any known RLE procedure and representation of AE information can be applied by those skilled in the art. Figure 26B In the example, the format information "10" (compression scheme #2) indicates a compression scheme according to which the pattern "AAAEEAAAAAEEEEEE" is presented as {A, 3, 2, 5, 6} in the AE information. Figure 26B The example does not use possible compression scheme #3, but in practice, more bits (within its available range) can be used in the instruction encoding to indicate more compression schemes as needed. Figure 26C The example shows that the instruction includes opcode information, AE information, and format / reordering information. Therefore, the format / reordering information indicates the compression scheme used (or not used) (such as in...). Figure 26A and Figure 26B (In the example), but there is another indication (e.g., by a separate single bit) as to whether a known reordering (fixed arrangement) was applied to the bits before compression was applied (or not applied), such as before any compression scheme was applied. This reordering can be applied when the instructions are created, or it can be applied subsequently by the compiler if the compiler changes the order of the instructions. The compiler may do this for any other reason, but it may be done explicitly to allow grouping of access and execution instructions to facilitate compression and thus improve the compression ratio.
[0151] Figure 27 The decoding circuit 1140 in some embodiments is schematically shown. This can be the decoding circuit of the examples above or any of those in the figure. The figure illustrates how to receive, process, and decode via the decoding circuit. Figures 26A to 26C The instruction 1141 is a formatted instruction in the shown format. Instruction 1141 is received by decoding circuit 1141, and the opcode portion is routed to opcode recognition circuit 1142. Format information (and reordering information, if present) and AE information are routed to decompression / reordering circuit 1142. Opcode recognition circuit 1142 provides control information to decompression / reordering circuit 1142 to ensure the correct application of the instruction type, particularly the nature of the formatting information (see, for example...). Figures 26A to 26C (Example). Then, the decompression / reordering circuit 1142 decompresses the AE information (if needed) and performs a reverse reordering (if needed) to generate the required decompressed A / E tags for at least one subsequent instruction. It should be understood that, generally for efficiency reasons, as many subsequent instructions as possible are tagged with the given pilot instruction (within the available coding space).
[0152] Figure 28This is a flowchart illustrating a sequence of steps taken by a decoding circuit according to some implementation methods. This can be a decoding circuit of either the example above or any of the diagrams. At step 1200, the decoding circuit receives the next instruction, and at step 1201, it determines whether the instruction is one of the A / E guidance instructions of the present technology. If not, the flow continues to step 1202, where the decoding circuit performs "normal" decoding of the instruction as needed (since this is typically the case for most instructions received by the decoding circuit) so that the device performs its general data processing operations. However, when an A / E guidance instruction is encountered, the flow continues to step 1203, where AE information is extracted from the instruction. Then, at step 1204, this AE guidance (i.e., the labeling or classification of individual instructions) is applied to relevant subsequent instructions for decoupled access execution performed by the device. For more details on this decoupled access execution, please refer to any of the examples above. The flow returns to step 1200.
[0153] Figure 29 This is a flowchart illustrating the sequence of steps taken by a decoding circuit according to some implementation methods when the guidance instruction also includes compressed AE information. This can be the decoding circuit of the example above or any of the diagrams. The process can be considered to begin at step 1250, in which the decoding circuit receives the next instruction to be decoded. Then, at step 1251, it is determined whether the instruction is an A / E guidance instruction according to the present technology, and as... Figure 28 If not, the process continues through step 1252 so that the instruction is decoded "normally," and the process returns to step 1250. If an A / E guidance instruction is encountered, the process continues to step 1253, where it is further determined whether the instruction instructs the AE information to be compressed and / or reordered. If so, the process continues through step 1254 so that the AE information is decompressed, where it should be understood that such decompression may include decompression and / or reordering, as detailed above. Figures 26A to 26C and Figure 27 The process then proceeds from step 1253 (without compression / reordering) or from step 1254 (with compression / reordering) to step 1255 to extract the original, plain / uncompressed / reordered AE information. Then, at step 1256, this AE guidance is applied to the relevant subsequent instructions for decoupling access execution instructions. Again, for more details regarding this decoupling access execution, please refer to any of the examples above. The process then returns to step 1250.
[0154] Figure 30A specific implementation of a usable simulator is illustrated. While the previously described embodiments implement the invention in terms of means and methods for operating specific processing hardware supporting the technologies involved, it is also possible to provide an instruction execution environment according to the embodiments described herein, which is implemented using a computer program. Such computer programs are generally referred to as simulators, in part because they provide a software-based implementation of a hardware architecture. Types of simulator computer programs include emulators, virtual machines, models, and binary converters, including dynamic binary converters. Typically, the simulator implementation can run on a host processor 1330 supporting the simulator program 1310, optionally running a host operating system 1320. In some arrangements, multiple emulation layers may exist between the hardware and the provided instruction execution environment and / or multiple different instruction execution environments provided on the same host processor. Historically, powerful processors were required to provide simulator implementations that execute at a reasonable speed, but this approach may be reasonable in certain situations, such as when it is desirable to run code native to another processor for compatibility or reuse reasons. For example, the simulator implementation may provide additional functionality to the instruction execution environment that is not supported by the host processor hardware, or provide an instruction execution environment that is typically associated with a different hardware architecture. An overview of the simulation is given in the following literature: “Some Efficient Architecture Simulation Techniques”, Robert Bedichek, Winter 1990 USENIX Conference, pp. 53-63.
[0155] With respect to embodiments previously described with reference to specific hardware constructions or features, in simulated embodiments, equivalent functionality may be provided by suitable software constructions or features. For example, specific circuitry may be implemented as computer program logic in simulated embodiments. Similarly, memory hardware such as registers or cache memory may be implemented as software data structures in simulated embodiments. One or more hardware elements referenced in the previously described embodiments are present in an arrangement on host hardware (e.g., host processor 1330), and where appropriate, some simulated embodiments may utilize the host hardware.
[0156] The simulator program 1310 may be stored on a computer-readable storage medium (which may be a non-transitory medium) and provides a program interface (instruction execution environment) to the target code 1300, which is identical to the application programming interface of the hardware architecture being modeled by the simulator program 1310. Therefore, in such embodiments, the program instructions of the target code 1300 include the aforementioned novel guiding instructions for providing A / E markings, and these program instructions can be executed from within the instruction execution environment using the simulator program 1310, enabling a host computer 1330, which does not actually possess the hardware features described above, to emulate these features.
[0157] In summary, a general overview apparatus and method for data processing are disclosed. When load requests are generated to support data processing operations, these load requests are buffered in a pending load buffer circuit before execution. A coalescing circuit determines, for a first load request, whether a group of one or more subsequent load requests buffered in the pending load buffer circuit satisfies an address proximity condition. The address proximity condition is satisfied when all data items identified by the group of one or more subsequent load requests are included in a series of data items to be retrieved from the memory system in response to the first load request. When the address proximity condition is satisfied, forwarding of the group of one or more subsequent load requests is suppressed. Upon receiving the series of data items retrieved by the load processing circuit, a decoupling circuit returns the data item identified by the load request, and, when the address proximity condition is satisfied, returns one or more additional data items for the one or more subsequent load requests.
[0158] In this application, the phrase "configured as..." is used to mean that the elements of the device have a configuration capable of performing the defined operations. In this context, "configuration" means the arrangement or manner of interconnection of hardware or software. For example, the device may have dedicated hardware that provides the defined operations, or a processor or other processing device may be programmed to perform the function. "Configured as" does not mean that the elements of the device need to be changed in any way to provide the defined operations.
[0159] While exemplary embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it should be understood that the invention is not limited to those precise embodiments, and various changes, additions, and modifications can be made therein by those skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims. For example, features of the dependent claims may be combined with features of the independent claims in various ways without departing from the scope of the invention.
Claims
1. An apparatus for performing a data processing operation, the data processing operation including loading data items from a memory system, the apparatus comprising: An address generation circuit is used to generate an address for loading a request; A pending load buffer circuit is configured to buffer the load request received from the address generation circuit before the load request is executed to retrieve a data item using the address of the load request; A loading processing circuit that, in response to a loading request, retrieves from the memory system a series of data items including a data item identified by the loading request; A coalescing circuit, configured to forward load requests buffered in the pending load buffer circuit to the load processing circuit, and arranged to determine whether an address proximity condition is satisfied for a group or more subsequent load requests buffered in the pending load buffer circuit. The address proximity condition is satisfied when all data items identified by the group of one or more subsequent load requests are included within the group of data items. And wherein the coalescing circuitry suppresses the forwarding of the group of one or more subsequent load requests in response to the address proximity condition being met; and A de-coalescing circuit is configured to receive the series of data items retrieved by the loading processing circuit, and to return the data item identified by the loading request as the result of the loading request. The de-coalescing circuit, in response to satisfying the address proximity condition, returns one or more additional data items from the series of data items, identified by the one or more subsequent load requests, as the result of the one or more subsequent load requests for each of the one or more subsequent load requests in the set of one or more subsequent load requests.
2. The apparatus of claim 1, wherein the series of data items is a cache line, and the address proximity condition is satisfied when the data item identified by the load request and all data items identified by the group of one or more subsequent load requests are included within the cache line.
3. The apparatus of claim 1 or claim 2, wherein the address proximity condition is satisfied when the absolute difference between the address used for the load request and the address used for each of the group of one or more subsequent load requests is less than a predetermined threshold.
4. The apparatus of claim 1, wherein the coalescing circuit is arranged to: determine, before determining whether the address proximity condition is satisfied, whether a trial proximity condition is satisfied for the set of one or more subsequent load requests buffered in the pending load buffer circuit, wherein the coalescing circuit forwards the load request to the load processing circuit and temporarily suppresses the forwarding of the set of one or more subsequent load requests in response to satisfying the trial proximity condition, and wherein the coalescing circuit stops temporarily suppressing the forwarding of the set of one or more subsequent load requests in response to not satisfying the address proximity condition.
5. The apparatus of claim 4, wherein the experimental proximity condition is satisfied when a first portion of all addresses of the group of one or more subsequent load requests matches a first portion of the address of the load request, and The address proximity condition is satisfied when the experimental proximity condition is met and when the second portion of all addresses of the group of one or more subsequent load requests matches the second portion of the address of the load request.
6. The apparatus of claim 1 or claim 2, wherein the coalescing circuit provides a coalescing request indication to the decoalizing circuit in response to satisfying the address proximity condition, the coalescing request indication identifying the loading request and the group of one or more subsequent loading requests; and The de-coalescing circuit responds to the coalescing request instruction and identifies the one or more additional data items based on the coalescing request instruction.
7. The apparatus of claim 1 or claim 2, wherein the pending load buffer circuitry includes a FIFO buffer, wherein the load request is the oldest load request in the FIFO buffer, and wherein the set of one or more subsequent load requests is a newer load request in the FIFO buffer.
8. The apparatus according to claim 1 or claim 2, further comprising: A hazard detection circuit, wherein the hazard detection circuit is used to detect actions related to modifications of the series of data items, and In response to the detection of the action related to the modification of the series of data items, a memory coherence operation is performed to ensure that the load processing circuitry retrieves the series of data items and modifies the series of data items in the order specified by the memory coherence protocol.
9. The apparatus of claim 8, wherein the memory coherence operation comprises: Resume the load request in the pending load buffer circuit and prevent the load request from being forwarded to the load processing circuit until the modification of the series of data items has been completed; as well as When the series of data items is retrieved before the operation of modifying the series of data items has been completed, the de-aggregation circuit is prevented from returning the data item from the series of data items as the result of the load request.
10. The apparatus of claim 9, wherein resuming the load request in the pending load buffer comprises adding the load request to the pending load buffer.
11. The apparatus of claim 8, wherein the action relating to the modification of the series of data items is a write notification issued by another apparatus. Furthermore, the device delays sending an acknowledgment for the write notification in response to detecting the write notification until the loading processing circuit has retrieved the series of data items, wherein the acknowledgment signal notifies the permission to continue modifying the series of data items.
12. The apparatus of claim 4 or claim 9, wherein the pending load buffer circuit is arranged to store a status indicator for each of the load requests buffered in the pending load buffer circuit. The status indicator used for the load request received from the address generation circuit is initially set to indicate a valid status. In response to satisfying the address proximity condition, the status indicator corresponding to the group of one or more subsequent load requests is set to indicate an invalid state, and The coalescing circuit suppresses the forwarding of invalid load requests in response to invalid load requests in the pending load buffer circuit.
13. The apparatus of claim 4, wherein the pending load buffer circuit is arranged to store a status indicator for each of the load requests buffered in the pending load buffer circuit. The status indicator used for the load request received from the address generation circuit is initially set to indicate a valid status. In response to satisfying the address proximity condition, the status indicator corresponding to the group of one or more subsequent load requests is set to indicate an invalid state, and The coalescing circuit suppresses the forwarding of an invalid load request in response to an invalid load request in the pending load buffer circuit; In response to the satisfaction of the experimental proximity condition, the status indicator for the set of one or more subsequent load requests is set to indicate a hold state. The coalescing circuit temporarily suppresses the forwarding of a load request having the held state in response to a load request having the held state in the pending load buffer circuit. In response to the failure to meet the address proximity condition, the status indicator corresponding to the group of one or more subsequent load requests is reset to the valid status.
14. The apparatus of claim 9, wherein the pending load buffer circuit is arranged to store a status indicator for each of the load requests buffered in the pending load buffer circuit. The status indicator used for the load request received from the address generation circuit is initially set to indicate a valid status. In response to satisfying the address proximity condition, the status indicator corresponding to the group of one or more subsequent load requests is set to indicate an invalid state, and The coalescing circuit suppresses the forwarding of an invalid load request in response to an invalid load request in the pending load buffer circuit; When the loading request is forwarded, the status indicator of the loading request is set to indicate that it is in flight. In response to the de-coalescing circuit returning the data item identified by the load request as the result of the load request, the status indicator corresponding to the load request is set to the invalid state, and Resuming the load request in the pending load buffer includes resetting the status indicator corresponding to the load request to the valid status.
15. A method of operating an apparatus for performing a data processing operation, the data processing operation including loading a data item from a memory system, the method comprising: Generate the address used to load the request; Before the load processing circuit executes the load request to retrieve the data item using the address of the load request, the load request is buffered in the pending load buffer circuit. The load request buffered in the pending load buffer circuit is forwarded to the load processing circuit to retrieve a series of data items from the memory system, including the data item identified by the load request; For a set of one or more subsequent load requests buffered in the pending load buffer circuit, it is determined whether an address proximity condition is satisfied, wherein the address proximity condition is satisfied when all data items identified by the set of one or more subsequent load requests are included within the set of data items. In response to satisfying the address proximity condition, the forwarding of the group of one or more subsequent load requests to the load processing circuit is suppressed; In response to the load request, retrieve from the memory system the series of data items including the data item identified by the load request; Receive the retrieved series of data items and return the data item identified by the loading request as the result of the loading request; as well as In response to the address proximity condition being met, for each of the subsequent load requests in the set of one or more subsequent load requests, one or more additional data items from the series of data items identified by the one or more subsequent load requests are returned as the result of the one or more subsequent load requests.
16. An apparatus for performing a data processing operation, the data processing operation including loading a data item from a memory system, the apparatus comprising: A device for generating the address used to load the request; A means for buffering a load request before the means for processing a load request executes the load request to retrieve a data item using the address of the load request; A means for forwarding a load request buffered in the means for buffering a load request to the means for processing a load to retrieve from the memory system a series of data items including a data item identified by the load request, wherein the means for processing a load retrieves from the memory system the series of data items including the data item identified by the load request in response to the load request; A means for determining whether an address proximity condition is satisfied for a set of one or more subsequent load requests buffered in the means for buffering, wherein the address proximity condition is satisfied when all data items identified by the set of one or more subsequent load requests are included within the set of data items. Means for suppressing the forwarding of one or more subsequent load requests to the means for processing the load in response to the fulfillment of the address proximity condition. A means for receiving the retrieved series of data items and for returning the data item identified by the load request as a result of the load request; and A device for performing the following operation: in response to satisfying the address proximity condition, returning one or more additional data items from the series of data items identified by the one or more subsequent load requests as a result of the one or more subsequent load requests for each of the one or more subsequent load requests in the set of one or more subsequent load requests.