Resource allocation adjustment
The integration of resource allocation adjustment circuitry addresses the inflexibility of resource allocation between processing and extension processing circuitry, optimizing resource distribution for improved performance and flexibility in handling varying workloads.
Patent Information
- Authority / Receiving Office
- WO · WO
- Patent Type
- Applications
- Current Assignee / Owner
- ARM LTD
- Filing Date
- 2025-10-16
- Publication Date
- 2026-06-11
Smart Images

Figure GB2025052263_11062026_PF_FP_ABST
Abstract
Description
[0001] Arm Ref P08378 1
[0002] DYC Ref P130989PCT
[0003] RESOURCE ALLOCATION ADJUSTMENT
[0004] The present technique relates to the field of data processing.
[0005] Data processing operations may be performed by processing circuity in response to instructions decoded by decoding circuitry.
[0006] At least some examples of the present technique provide an apparatus comprising: decoding circuitry configured to decode instructions; processing circuitry configured to perform data processing operations in response to the instructions decoded by the decoding circuitry; extension processing circuitry configured to perform other data processing operations asynchronously with respect to data processing operations performed by the processing circuitry; an extension task offload interface separate from an interface by which the processing circuitry issues a memory system request to a memory system, wherein the extension task offload interface is responsive to at least one task offloading instruction decoded by the decoding circuitry to offload the other data processing operations to the extension processing circuitry; and resource allocation adjustment circuitry configured to adjust a resource allocation between the processing circuitry and the extension processing circuitry responsive to a resource adjustment indication.
[0007] At least some examples of the present technique provide a system comprising: the apparatus described above, implemented in at least one packaged chip; at least one system component; and a board, wherein the at least one packaged chip and the at least one system component are assembled on the board.
[0008] At least some examples of the present technique provide a chip-containing product comprising the system described above, wherein the system is assembled on a further board with at least one other product component.
[0009] At least some examples of the present technique provide a computer-readable medium storing computer-readable code for fabrication of an apparatus described above.
[0010] At least some examples provide a computer program for controlling a host data processing apparatus to provide an instruction execution environment for execution of target program code, the computer program comprising: decoding program logic configured to decode instructions; processing program logic configured to perform data processing operations in response to the instructions decoded by the decoding program logic; extension processing program logic configured to perform other data processing operations asynchronously with respect to data processing operations performed by the processing program logic; an extension task offload Arm Ref P08378 2
[0011] DYC Ref P130989PCT interface separate from an interface by which the processing circuitry issues a memory system request to a memory system, wherein the extension task offload interface is responsive to at least one task offloading instruction decoded by the decoding circuitry to offload the other data processing operations to the extension processing circuitry; and resource allocation adjustment program logic configured to adjust a resource allocation between the processing program logic and the extension processing program logic responsive to a resource adjustment indication.
[0012] Further aspects, features and advantages of the present technique will be apparent from the following description of examples, which is to be read in conjunction with the accompanying drawings, in which:
[0013] Figure 1 illustrates an example of an apparatus comprising decoding circuitry, processing circuitry and extension processing circuitry;
[0014] Figure 2 illustrates an example of a processing system;
[0015] Figure 3 illustrates a first example of the extension processing circuitry;
[0016] Figure 4 illustrates a second example of the extension processing circuitry;
[0017] Figure 5 illustrates an example resource allocation before and after adjustment;
[0018] Figure 6 illustrates an example apparatus according to the present techniques;
[0019] Figure 7 illustrates an example method of adjusting a resource allocation;
[0020] Figures 8A, 8B, 8C, and 8D illustrate example software provided control indications;
[0021] Figures 9A, 9B, and 9C illustrate example methods for adjusting a resource allocation based on a software provided control indication;
[0022] Figure 10 illustrates an example apparatus according to some of the present techniques;
[0023] Figure 11 illustrates an example software provided control indication for extension and further extension processing circuitries;
[0024] Figure 12A illustrates an example of differentiating between memory traffic based on assigned PARTI D;
[0025] Figure 12B illustrates an example technique for implementing resource allocation control;
[0026] Figure 13 illustrates an example method of adjusting a resource allocation based on determined performance;
[0027] Figure 14 illustrates an example method of tracking performance metrics using a performance counter and adjusting a resource allocation based on the performance counter;
[0028] Figure 15 illustrates an example method of updating a tracker based on encountering a synchronisation instruction;
[0029] Figure 16 illustrates a system and a chip-containing product; and Figure 17 illustrates a simulation example.
[0030] In the examples discussed below, an apparatus comprises decoding circuitry configured to decode instructions and processing circuitry configured to perform data processing operations Arm Ref P08378 3
[0031] DYC Ref P130989PCT in response to the instructions decoded by the decoding circuitry. The apparatus also comprises extension processing circuitry configured to perform other data processing operations asynchronously with respect to data processing operations performed by the processing circuitry, and an extension task offload interface separate from an interface by which the processing circuitry issues a memory system request to a memory system, wherein the extension task offload interface is responsive to at least one task offloading instruction decoded by the decoding circuitry to offload the other data processing operations to the extension processing circuitry.
[0032] This approach can provide acceleration opportunities for accelerating certain workloads (such as, for example, memory copy, data compression, or encryption tasks), as the extension processing circuitry can free up the processing circuitry to perform other operations while the extension processing circuitry performs its data processing task. However, unlike alternative techniques for accelerating such computations, the data processing task is performed asynchronously using extension processing circuitry accessible via an interface separate from the memory system interface used by the processing circuitry. This means that the extension processing circuitry can be integrated more closely with the processing circuitry than with alternative acceleration techniques using a remote accelerator, graphics processing unit (GPU) or neural processor which is configurable based on the processing circuitry issuing memory store operations to write control data to shared control data structures stored in the memory system. By providing a configuration interface separate from the load / store mechanism used to access memory, the configuration overhead of configuring the data processing on the extension processing circuitry can be reduced, which opens up opportunities for acceleration on shorter computation tasks for which the high performance cost of configuring memory-based control structures used by a remote accelerator would be prohibitive. On the other hand, compared to use of a coprocessor which processes a stream of instructions offloaded by the processing circuitry synchronously, the asynchronous extension processing circuitry can be implemented with reduced circuit area and power cost as there is less need for circuit logic to be expended on result buses, forwarding and hazarding circuit logic, issue queue structures, physical register file read / write ports, etc. which would be used for controlling interaction between respective instructions executed synchronously. With asynchronous processing logic, a dedicated hardware pipeline can be constructed with less need for intermediate results of processing to be accessible by any particular software instruction, reducing the circuit area and power costs of implementing the extension processing circuitry.
[0033] The present inventors have recognized that a resource allocation between the processing circuity and the extension processing circuitry, which may be set at time of design based on expected quality of service and performance requirements for certain workloads, may not be suitable for later use, such as when processing different workloads. This may particularly apply when the characteristic of that different workload is substantially different from that initially used to set the resource allocation at time of design. Indeed, the present inventors have recognized Arm Ref P08378 4
[0034] DYC Ref P130989PCT that for some workloads, the processing circuitry and the extension processing circuitry workloads may starve each other of shared resources. In other cases, the processing circuitry may have excess resources but the extension processing circuitry may have inadequate resources, or vice versa. In cases such as these, it would be useful to be able to adjust the allocation of resources between the processing circuitry and the extension processing circuitry.
[0035] Hence, in the examples discussed below, the apparatus comprises resource allocation adjustment circuitry configured to adjust a resource allocation between the processing circuitry and the extension processing circuitry responsive to a resource adjustment indication. Hence, resources allocated between the processing circuitry and the extension processing circuitry can be adjusted. As a result, situations where the processing circuitry and extension processing circuitry starve each other of resources, or where resources for a given processing task are not suitably shared between the processing circuitry and the extension processing circuitry can be avoided. Hence, overall runtime of a processing workload can be reduced. This improves processing performance.
[0036] In some examples, the resource adjustment indication is associated with a given processing task for the extension processing circuitry. Hence, the indication that controls the resource adjustment may be associated with a particular processing task. As a result, the resource allocation can be adjusted in a specific way for a particular processing task, resulting in improved performance and reduced runtime during processing of the particular processing task. The given processing task may be a task to be performed by the extension processing circuitry or a task that is already being performed when the resource allocation adjustment is made. In some cases, the resource adjustment indication is associated with a given processing task assigned to or offloaded to the extension processing circuitry
[0037] The resource adjustment indication that controls the resource allocation adjustment is not overly limited. In some examples, the resource adjustment indication comprises a software provided control indication. Hence, the present techniques may support an approach where software is able to control the resource allocation adjustment. This can increase the flexibility and control for a software programmer and provide an efficient mechanism for software to control a resource allocation change. Further, this can avoid overhead associated with having hardware logic for providing a resource adjustment indication.
[0038] Again, the software provided control indication is not particularly limited. In some examples, the software provided control indication comprises a stored entry programmable by software, the entry indicative of an adjustment to be made to the resource allocation. Hence, software can efficiently control the resource allocation adjustment, for example based on known or expected workload characteristics or workload profiling. The entry may specify the resource allocation to use for the adjustment, or may trigger use of a predetermined resource allocation. In some cases, the entry may point to a different location where the resource allocation to be used is stored. Arm Ref P08378 5
[0039] DYC Ref P130989PCT
[0040] In some examples, the software provided control indication comprises a plurality of stored entries programmable by software, each associated with a resource allocation for a given extension processing task for by the extension processing circuitry. Thus, multiple entries may be set for a variety of extension processing tasks. This provides finer grain control of the resource allocation adjustment, specific to the task the extension processing circuitry is to perform. As a result, a resource allocation better suited to the extension processing workload can be used, thereby increasing extension processing performance.
[0041] In some examples, the apparatus comprises further extension processing circuitry configured to perform other data processing operations asynchronously with respect to data processing operations performed by the processing circuitry, wherein the extension task offload interface is responsive to at least one task offloading instruction decoded by the decoding circuitry to offload the other data processing operations to the further extension processing circuitry, and wherein the software provided control indication comprises stored entries programmable by software for each of the extension circuitry and the further extension circuitry. Hence, in cases where multiple extension processing circuitry is provided, entries can be set for each of the extension processing circuitry. This provides finer grain control of the resource allocation for cases where extension processing and further extension processing circuitry is provided. As a result, the likelihood that the resource allocation is optimized for both the extension processing and further extension processing circuitry is increased. This improves processing performance.
[0042] In some examples, the software provided control indication is defined in a software memory data structure and / or one or more registers. For example, a system or configuration register may be used to enable software to directly control a desired resource allocation between the processing circuitry and the extension processing circuitry. Using a register to control the resource allocation adjustment can avoid the hardware overhead associated with a hardware provided resource adjustment indication. The entry may take various forms, and in some examples may be a bit flag and in other examples provides information indicative of the allocation of resources to be set with the adjustment (such as a number, percentage, ratio, etc. of a given resource).
[0043] In some examples, the software provided control indication comprises a hint instruction indicative of a requested resource allocation or a future extension processing task that the extension processing circuitry is to perform. In some implementations, a hint instruction may be ignored in favour of runtime information or to reduce hardware complexity. In some examples, a hint instruction may be a dedicated instruction or may be included as part of an extension start instruction (such as a dedicated instruction defined in an instruction set architecture or a store instruction specifying an address mapped to an extension task launch control register) to launch a processing task on the extension processing circuitry, for example. In cases where the hint instruction is part of an extension start instruction, the hint instruction may take the form of register operands or metadata held in a given memory location). Arm Ref P08378 6
[0044] DYC Ref P130989PCT
[0045] A hint instruction may request a specific resource allocation or describe a workload characteristic. Such an approach enables a specific hardware implementation to then determine how to allocate resources based on implementation specific details, such as a number of address generation units or register ports. As an example, a hint instruction may indicate a memory copy (e.g. memcpy) as a workload with streaming memory accesses, and so a memory system may be controlled to allow the corresponding access to bypass certain cache levels, or preferentially evict corresponding data.
[0046] By supporting hint instructions as a way to control the resource allocation adjustment, software complexity can be reduced and portability between implementations can be increased.
[0047] In some examples, a hint instruction may overrule one or more stored entries (e.g. in a configuration register). For example, a hint instruction may indicate a resource allocation for a given processing task or given extension processing circuitry, which may overrule an existing stored entry associated with a resource allocation for that given processing task or the given extension processing circuitry.
[0048] In some examples, the software provided control indication comprises a barrier instruction indicative of a required resource allocation. In some cases, it can be useful for software to provide a stronger prioritization of resource allocation to hardware than a hint instruction, and in cases such as these a barrier instruction may be used. For example, there may be memory ordering requirements between memory traffic associated with the processing circuitry and memory traffic associated with the extension processing circuitry.
[0049] In some examples, the resource allocation adjustment circuitry is configured to differentiate between memory traffic associated with the processing circuitry and memory traffic associated with the extension processing circuitry based on the respective memory traffic having different partition identifiers, and wherein the apparatus comprises partition identifier circuitry configured to assign, to memory traffic associated with the extension processing circuitry, a different partition identifier to a partition identifier used for memory traffic associated with the processing circuitry. Hence, memory traffic associated with the processing circuitry and the extension processing circuitry can be efficiently differentiated. This can support the resource allocation adjustment by making resource allocation adjustments to a given processing or extension processing circuitry’s memory traffic more efficient. For example, it could be used to enforce a memory partitioning scheme (such as the partitioning of caches, interconnect bandwidth and / or memory system bandwidth) or a quality of service scheme. In some examples, this approach can be used with a memory system resource partition and monitoring (MPAM) mechanism, which supports memory system partitioning, discussed further below in relation to figure 12B.
[0050] In some examples, the resource adjustment indication comprises a hardware provided control indication. Hence, additionally, or alternatively, to a software provided control indication, a hardware provided control indication may be used. Arm Ref P08378 7
[0051] DYC Ref P130989PCT
[0052] In some examples, the hardware provided control indication comprises an indication that the resource allocation is to be adjusted based on performance of the processing circuitry and / or the extension processing circuitry during previous processing. Hence, actual performance can be used to dynamically control the resource allocation adjustment. As a result, the resource allocation adjustment can be responsive to actual processing performance. This can result in an improved resource allocation that is better suited to the performance of the processing circuitry and / or extension processing circuitry. The performance may be based on execution time, power or energy consumption and / or memory bandwidth, for example.
[0053] In some examples, hardware may not have the ability to measure performance or a given performance metric directly. In such cases, a related indicator may be used instead. For example, a number of cycle counts may be used as an indicator of execution time. In another example, energy consumption may be estimated based on the number of page table walks and cache misses. Hence, even in cases where hardware does not have an ability to measure performance directly, indications of performance may still be used to inform a resource allocation adjustment.
[0054] In some examples, the apparatus comprises resource prediction circuitry configured to generate the resource adjustment indication by predicting a required future resource allocation based on a performance counter for tracking performance metrics of the processing circuitry and / or the extension processing circuitry. Hence, a hardware predictor may be used to adjust the resource allocation. This can result in a resource allocation that is better suited to a given future processing task, and thus can result in increased processing performance and reduced workload runtime.
[0055] In some examples, the resource prediction circuitry is configured to determine the performance metrics during a previous processing period when the extension processing circuitry was performing a processing task asynchronously to the processing circuitry. Hence, previous processing performance during a processing task may be used to inform the resource allocation adjustment. In some cases, the performance metrics are associated with a previous processing task which is the same as a future processing task, and thus the resources can be adjusted based on a known performance of a given processing task.
[0056] The performance metrics are not particularly limited, and may include one or more of a latency, a number of cache misses, a number of page table walks, and a number of processing stalls.
[0057] In some examples, the performance counter is initialized based on software provided information to bias initial values of the performance metrics or provide bounds for resource allocation adjustment. Hence, software can be used to warm up the predictor and thus reduce warm up times, bound quality of service requirements and improve its runtime characteristics. In some examples, the resource predictor state may be tagged with a context identifier identifying a processing context associated with the resource predictor entry (such as an address space identifier, a virtual machine identifier, and / or an exception level identifier). This can reduce Arm Ref P08378 8
[0058] DYC Ref P130989PCT warmup times and avoid predictions influencing other processes. This state may be managed in accordance with one or more security mitigations to prevent malicious predictor training.
[0059] In some examples, the resource allocation adjustment circuitry is configured to maintain a tracker based on whether, on encountering a synchronization instruction to synchronize operation of the extension processing circuitry and the processing circuitry after a processing task has been offloaded to the extension processing circuitry, the extension processing circuitry had already completed its associated processing task, and wherein the resource adjustment indication comprises the tracker. Hence, resource allocation can be adjusted based on whether the processing circuitry or extension processing circuitry was first to complete its associated processing task. This can provide an efficient and lightweight way to adjust the resource allocation.
[0060] In some examples, to adjust the resource allocation, the resource allocation adjustment circuitry is configured to reserve a resource for use by one of the processing circuitry or extension processing circuitry. For example, a prefetcher resource (such as a level one prefetcher) may be reserved for exclusive use by the processing circuitry, and the extension processing circuitry may be restricted to use another prefetcher resource, such as one out of a plurality of level two prefetchers. This would allow the prefetchers to account for different memory access patterns (e.g. sequential accesses by the processing circuitry and pointer chasing by the extension processing circuitry). The deliberate use of level one and two prefetchers could also account for differences in temporal and spatial data locality, as well as different degrees to which data is shared between multiple processor cores. In another example, part of a cache may be partitioned and reserved for use by one of the processing circuitry or extension processing circuitry. A cache partition may later be adjusted as part of a later resource allocation adjustment.
[0061] In some examples, to adjust the resource allocation, the resource allocation adjustment circuitry is configured to adjust a ratio of resource assigned to the processing circuitry and the extension processing circuitry. For example, a number of buffer entries or a number of address generation units assigned to the processing circuitry or the extension processing circuitry may be changed.
[0062] In some examples, to adjust the resource allocation, the resource allocation adjustment circuitry is configured to prioritise a resource for use by the processing circuitry or the extension processing circuitry. For example, a cache eviction policy may be set to favour the processing circuitry or the extension processing circuitry.
[0063] The resources for which the allocation may be adjusted by the resource allocation adjustment circuitry is not particularly limited. In some examples, the resources comprise one or more of memory system resources, memory load / store processing resources, input / output resources and power resources. For example, the resources may be one or more of the following:
[0064] • processing resources, such as: o slots in a Load Store Unit; Arm Ref P08378 9
[0065] DYC Ref P130989PCT o address generation units; o address translation units; o register ports; and o internal buffers;
[0066] • power resources, such as: o power budget; o power gating to reduce the power consumption of inactive hardware components; and o changing the operating voltage and / or frequency (e.g. using DVFC);
[0067] • shared caches, in the form of: o reserved capacity; o prioritized data eviction; and o allocation of prefetchers (use different prefetchers; or have a prefetcher ignore one of processing circuitry or extension processing circuitry);
[0068] • interconnect bandwidth, in the form of: o prioritization on communication links; and o allocation of buffers slots;
[0069] • main memory bandwidth, in the form of: o prioritized DRAM bus bandwidth.
[0070] To partition processing resources (such as CPU internal resources), the CPU may distinguish accesses from extension processing circuitry and accesses from the host with a bit on its data bus, and corresponding bits on specific resources, such as queue slots.
[0071] In some examples, the processing circuitry and the extension processing circuitry may share a private cache private to a processing element comprising the processing circuitry and the extension processing circuitry and not directly accessible to any other processing element of the apparatus. This can allow the processing circuitry and extension processing circuitry to exchange configuration information, status information and / or results of extension tasks faster than if communication of this information had to be performed via a shared cache or main memory without the extension processing circuitry having direct access into the private cache of the processing circuitry.
[0072] In some examples, the processing circuitry and the extension processing circuitry may share translation table walk circuitry configured to control translation table walk operations for obtaining translation table data from a memory system, such as address translation information. Reusing the translation table walk circuitry of the processing circuitry for memory accesses triggered by the extension processing circuitry which miss in a translation lookaside buffer, saves circuit area by avoiding the need to duplicate the translation table walk circuitry at both processing circuitry and extension processing circuitry. In some examples, the processing circuitry and extension processing circuitry could also share at least one translation lookaside buffer (TLB) for Arm Ref P08378 10
[0073] DYC Ref P130989PCT caching address translation information obtained in a translation table walk operation. However, it is also possible for the extension processing circuitry to have its own dedicated TLB looked up for memory accesses triggered by the extension processing circuitry to identify address translation information. Nevertheless, if the TLB of the extension processing circuitry detects a miss for a given address to be accessed by the extension processing circuitry, the shared translation table walk circuitry associated with the processing circuitry can be used to perform the translation table walk operation to find the missing address translation information.
[0074] Specific examples are now described with reference to the drawings.
[0075] Figure 1 illustrates an example of an apparatus 10. The apparatus 10 may for example be a data processing system such as a system-on-chip or collection of chiplets implementing at least one processor and its memory storage. For example, the components of the apparatus 10 illustrated in Figure 1 may be part of a given processor, e.g. a central processing unit (CPU). While Figure 1 shows the processor as a standalone apparatus (a design for an individual processor core could, for example, be licensed as a separate product from other parts of a wider processing system), as shown in Figure 2 discussed further below the apparatus 10 could also form part of a wider processing system 2 which comprises two or more processors 10 capable of executing respective threads of processing in parallel with each other.
[0076] The apparatus 10 includes decoding circuitry 13 which decodes instructions fetched from an instruction cache or a memory system, and processing circuitry 6 which processes the instructions decoded by the decoding circuitry 13 to perform data processing operations on operands obtained from registers 8 or the memory system, to generate processing results which may be written back to the registers 8 or to the memory system. The processing circuitry 6 comprises a pipeline comprising a number of pipeline stages for performing respective functions in response to the instructions, with the pipeline stages operating in a pipelined manner so that a later pipeline stage can be performing a later stage of processing on an older instruction in parallel with an earlier pipeline stage performing an earlier stage of processing on a younger instruction which appears later in program order than the earlier instruction. In some instances, it is also possible to perform out-of-order processing where a younger instruction in program order can bypass an older instruction to be executed in an order which differs from the order in which those instructions appear in program order.
[0077] Instructions processed by the pipeline of the processing circuitry 6 may be processed synchronously, such that for a given instruction the access to registers 8 to obtain instruction operands and writeback to registers 8 to write a processing result can be synchronised in timing relative to register read / write operations for other instructions. For synchronously processed arithmetic / logical instructions, a given instruction type may be associated with a certain defined number of cycles required for the instruction to execute, so that if the instruction is dispatched for execution in a given cycle then its result is guaranteed to be available by a certain subsequent cycle. Also, for the synchronously processed instructions, the architectural result of that Arm Ref P08378 1 1
[0078] DYC Ref P130989PCT instruction is made available as part of executing the instruction itself, so commitment of the synchronously processed instruction implies the result of that instruction will be made available for reference by a subsequent instruction. If a given synchronously processed instruction is stalled, then any dependent operations referencing the result of that instruction may also be blocked from being executed (unless a speculation mechanism is provided to predict the result of the synchronously processed instruction to break the chain of dependency).
[0079] The apparatus 10 also has extension processing circuitry 23 to which the decoding circuitry 13 can, in response to an extension task offloading instruction, offload an extension task which is to be performed by the extension processing circuitry 23 asynchronously with respect to other data processing operations performed by the processing circuitry 6. Unlike for synchronously processed instructions, the result of the extension task is not guaranteed to be available once the extension task offloading instruction has been committed. Instead, separate instructions (separate from the offloading instruction) may be decoded to allow querying of whether the extension task is complete and to obtain any results. The extension task offloading instruction is non-blocking in that it can be committed when the extension processing circuitry 23 has accepted the offloaded extension task (or otherwise indicated that it is unavailable to accept the offload extension task), but does not require its commitment to be delayed until the extension task is actually performed. This means that younger instructions in the thread of processing including the extension task offloading instruction can continue to be processed on the processing circuitry 6, while the extension task is performed asynchronously on the extension processing circuitry 23 in the background of ongoing processing on the processing circuitry 6. The instructions processed synchronously on the processing circuitry 6 may, at a later point of program flow, query whether the extension processing circuitry 23 has completed its task and if so obtain any results either directly from extension processing circuitry 23 or from a cache or memory. As shown in Figure 1 , the apparatus 10 has an extension task offload interface 24 providing a direct configuration path for the decoding circuitry 13 to cause offloading of an extension task to the extension processing circuitry 23, where the direct configuration path is separate from the path by which the processing circuitry 6 issues requests (e.g. coherence transactions) to a memory system to request access to data stored in the memory system. This means the extension processing circuitry 23 can be integrated directly into the regular processing circuitry 6 of a processor, rather than being a remote accelerator accessed via the memory system. In some examples, the extension processing circuitry 23 may have direct access to the register file 8 used by the processing circuitry 6, which can be useful during the handover phase when an extension task is being offloaded to the extension processing circuitry 23 or when the result of the extension task is being transferred back to the processing circuitry 6, to allow parameters of the extension task and results to be shared between the processing circuitry 6 and the extension processing circuitry 23 via the registers 8. It is also possible for such sharing of parameters and results to be via a private cache (e.g. level 1 cache) associated with the processing circuitry 6. Arm Ref P08378 12
[0080] DYC Ref P130989PCT
[0081] The type of extension task supported by the extension processing circuitry 23 is not particularly limited, and may for example be associated with a memory copy, compression, or encryption task. By offloading extension tasks to the extension processing circuitry, the extension tasks can be carried out asynchronously using relatively lightweight circuit logic, thereby freeing up CPU execution resource for other operations, providing a significant performance speed up at low additional circuit area cost.
[0082] Figure 1 discussed above shows components of an individual processor 10. However, Figure 2 shows an example showing the processor in a wider context 10 of a data processing system 2. The processing system 2 comprising at least one CPU 10 which comprises the decoding circuitry 13, processing circuitry 6, registers 8 and extension processing circuitry 23 (and extension task offload interface 24, although the interface is not explicitly shown in Figure 2) as discussed above. There could also be at least one other CPU 10 which does not include the extension processing circuitry. The CPUs 10 are examples of memory system requesters which access shared memory 110 via an interconnect 106. The memory 110 may also be shared with other types of memory system requester, such as a graphics processing unit (GPU) 100, input / output (I / O) device 102 or remote hardware accelerator 104. The hardware accelerator 104 is coupled to the memory system interconnect 106, remote from the CPU 10. Software executing on a CPU 10 can configure the hardware accelerator 104 to perform a particular class of processing function on data stored in memory 110, by configuring control data structures also stored in the memory 110 which define command queues and / or other parameters for controlling the hardware accelerator 104. Hardware accelerator commands are defined as part of a command set dedicated to a particular hardware implementation of hardware accelerator, rather than being generic ISA instructions in the instruction set supported by the instruction decoding circuitry 13 of a CPU 10. As the configuration path between the processing circuitry 6 of a CPU 10 and the hardware accelerator 104 is via memory-based data structures, offloading of operations from CPU 10 to hardware accelerator 104 is much slower than offloading of an extension task from processing circuitry 6 to extension processing circuitry 23, as memory accesses to those structures may contend for bandwidth on the memory system interconnect 106 shared with the other requesters 10, 100, 102. The same applies where the CPU 10 configures the GPU 100 to carry out processing.
[0083] Figures 3 and 4 show two examples of how the extension processing circuitry 23 could be provided in association with a particular processor 10 (CPU).
[0084] In the example of Figure 3, the processor (CPU) 10 is schematically shown to have a pipelined configuration, which for the purposes of brevity and clarity is shown in a conceptual representation here. The illustrated pipeline stages comprise an instruction cache 11 , a fetch stage 12, a decode stage 13, a micro-op cache 14, an issue stage 15, and a register access stage 16. A sequence of instructions is retrieved from memory (not shown) and cached in the instruction cache 11. The fetch stage 12 controls which instructions are retrieved as the sequence of Arm Ref P08378 13
[0085] DYC Ref P130989PCT instructions and these instructions are then decoded in the decode stage 13. This decoding essentially identifies the type of each instruction, as well as any further operands specified by the instruction, and generates control signals to control the remainder of the apparatus to perform the data processing operation(s) defined by the instruction. Decoding the instructions may comprise splitting an instruction into one or more micro-ops, and these micro-ops can be cached in the micro-op cache 14. The final stage of the pipeline before execution is the issue stage 15, where instructions (or micro-ops) are queued pending the availability of the register values they specify as operands and the corresponding functional unit of the data processing pipeline which will carry out the defined operation. Generally, the data processing operation(s) defined by the instructions are carried out by the functional units that form part of the data processing pipeline, namely the load / store unit 17, and the execute units 18 and 19. These latter execute units may for example be arithmetic logic units (ALUs), floating point units (FPUs), and so on. The functional units that form part of the data processing pipeline perform their data processing operations on data values which are provided from a set of registers (conceptually represented by the register access stage 16 in the figure) and result values of those data processing operations are returned to the set of registers. The load / store unit 17 is provided for the purpose of storing values from the set of registers to the memory system, of which only a level 1 cache 21 and a level 2 cache 22 are shown in the figure. At least the L1 cache 21 is private to the CPU 10 and the L2 cache 22 could be either private or shared with another CPU 10, when part of a wider data processing system. The data processing apparatus 10 is also shown to comprise a branch unit 20, which is used to execute branch instructions and which may feed back information about branch outcomes to the fetch stage 12 for use in training a branch predictor provided in the fetch stage 12 for predicting outcomes of branch instructions.
[0086] The processor 10 also comprises extension processing circuitry 23, which is provided to support efficient performance of one or more defined processing tasks. The extension processing circuitry is closely associated with the data processing pipeline and is configured to perform the defined processing task (also referred to herein as a delegated task or extension task) in response to a delegation signal received from the data processing pipeline. The extension processing circuitry 23 is an example of a threadlet extension (TE). The sequence of operations it carries out to perform the defined function can be referred to as a threadlet. The extension processing circuitry 23, although closely associated with the data processing pipeline, is configured to perform the delegated task asynchronously to the data processing operations performed by data processing pipeline. Threadlets are functions or collections of operations that can be executed asynchronously relative to other CPU activity once launched. The directive or command sent to the extension processing circuitry 23 to initiate the delegated task is generated in response to an extension task offload instruction, such as an extension start instruction defined for this purpose in the instruction set of the data processing pipeline (an XSTART instruction is an example of an extension start instruction). In other examples, the extension start instruction is not defined as a Arm Ref P08378 14
[0087] DYC Ref P130989PCT dedicated instruction in an instruction set, but rather could be a store instruction to a particular address range.
[0088] One approach can be that the extension task offloading instruction comprises an instruction which writes to a control register. For example, a set of control registers (e.g. system registers or memory-mapped registers) may be provided which are used as an interface for configuring the extension processing circuitry to perform an offloaded extension task. The control registers may include registers for defining parameters of the extension task (e.g. the address of the low-precision data to be processed by the low-precision computation extension task). The control registers may also include a launch register. In some examples, the task offloading instruction could therefore be an instruction which writes to the launch register, triggering offloading of the extension task to the extension processing circuitry (with the extension processing circuitry also being passed any other control parameters which may have been written to other registers of the set of control registers prior to executing the instruction which writes to the launch register).
[0089] In the case where the task offloading instruction is an instruction which writes to a control register, the instruction type of the control register updating instruction could be a generic instruction type also used for other control register updates not related to extension task offload. For example, the task offloading instruction could be a system register updating instruction, or a store instruction specifying an address mapped to the launch register. The task offloading instruction may therefore be associated with a parameter that distinguishes whether a particular instance of the instruction represents the task offloading instruction or represents a register update instruction or memory access instruction not related to extension task offload. That parameter could be defined in different ways. In some examples, the parameter could be specified as an operand of the task offloading instruction, e.g. a register operand where the instruction references a selected register and the value stored in that register defines the parameter identifying whether the instruction is the task offloading instruction. Alternatively, the parameter may be stored in another control register other than the control register written by the task offloading instruction, where that parameter may have been written to that other control register by an earlier instruction than the task offloading instruction. For example, where the set of control registers includes the launch register and one or more parameter registers for defining parameters for controlling the offloaded extension task, the task offloading instruction may specify that the launch register should be updated, but the parameter distinguishing that the particular extension task to be offloaded is the low-precision computation extension task, say (or any other type of extension task supported by the extension processing circuitry), may be in one of the parameter registers and so may not itself be directly specified by the task offloading instruction. Nevertheless, even though the overall sequence of instructions for controlling the offload may involve a sequence of multiple instructions (one or more instructions to set the parameter registers, followed by the instruction writing to the launch register), the final instruction of the Arm Ref P08378 15
[0090] DYC Ref P130989PCT sequence which writes to the launch register can be regarded as a task offloading instruction which actually causes the low-precision computation extension task to be offloaded to the extension processing circuitry.
[0091] Hence, it will be appreciated that there are a variety of ways in which the extension task offload interface for controlling offload of extension tasks to the extension processing circuitry can be controlled using instructions decoded by the decoding circuitry.
[0092] Thus, an extension start instruction progresses along the data processing pipeline in the manner that any other CPU instruction would, but when the decoding circuitry 13 identifies the extension start instruction it can signal directly to the extension processing circuitry 23. The close integration of the extension processing circuitry 23 with data processing pipeline is illustrated by the fact that the extension processing circuitry 23 has direct access to the load / store unit 17, and thus it shares the data processing pipeline’s path to memory (e.g. having access to the private cache 21 of the CPU 10). The extension processing circuitry 23 can also share translation table walk circuitry (not shown in Figure 3) which is used to obtain address translation information from memory. The extension processing circuitry 23 also has access to the set of registers 8 accessed by register access stage 16, such that for example, the extension start instruction can specify one or more registers as operands, and the values from these registers are then passed directly to the extension processing circuitry 23 in association with the command sent to initiate the delegated task. Upon completion of the task, results of the delegated task can be returned to the register values via an extension synchronisation instruction. An extension synchronisation instruction returns to a register one or more values produced by the extension processing circuitry (or reports a status of the extension processing circuitry), and delays operations dependent on such registers until the extension processing task is complete. The extension synchronisation instruction may be a dedicated instruction defined in an instruction set architecture (referred to herein as an XSYNC instruction).
[0093] Figure 4 schematically illustrates an alternative implementation of the apparatus 10 according to some examples. This example provides a comparison to the examples of Figure 3, in which examples the extension processing circuitry was closely embedded with the data processing pipeline, to the extent that those instances of extension processing circuitry may be considered to be within the CPU. In the example of Figure 4, the apparatus 10 comprises a CPU 51 and separate extension processing circuitry (threadlet extension) 23 which are not as closely integrated. For example, this is illustrated by the fact that each has its own path to memory, with an L1 cache 53 private to the CPU 51 and an L1 cache 54 private to the threadlet extension 23. They share the L2 cache 55 (which can still be regarded as a private cache of the CPU 51 as this cache may not be shared with any other memory system requester, so a cache coherency protocol implemented by system interconnect 106 may treat the L2 cache as if it is a private cache). Nevertheless, the threadlet extension 23 remains tightly coupled to the CPU 51 , and can be launched quickly when an extension start instruction is encountered in the CPU pipeline Arm Ref P08378 16
[0094] DYC Ref P130989PCT specifying the function this threadlet extension 23 performs. The threadlet extension 23 can get data directly from CPU registers at the start of its execution. Upon completion, it can return values via an extension synchronisation instruction. Figure 4 also shows the threadlet extension 23 as having its own private TLB 56, in which it can cache currently used address translations. As a preparatory step before or associated with the delegation signal, content from the TLB 57 in the CPU 51 can be copied into the private TLB 56 in order to pre-warm this cache before the threadlet begins operation. If a memory access request issued by the extension circuitry 23 misses in its private TLB 56, a signal may be issued to a memory management unit (MMU) 58 of the CPU 51 which causes translation table walk circuitry 59 of the CPU 51 to obtain address translation information from memory and return the required address translation information to the private TLB 56 of the threadlet extension processing circuitry 23 for use in translating a memory address specified by the memory access request.
[0095] Figure 5 shows example runtime estimations for a CPU and a threadlet (i.e. TE) in an unmanaged resource allocation case and a managed resource allocation case. For the unmanaged case, where resource allocation adjustment is not performed, it can be seen that the threadlet has a greater runtime caused by a lack of threadlet resources, which holds back the total runtime even if the CPU has a short runtime. By performing a resource allocation adjustment (i.e. the managed case) to adjust the allocation of resources between the threadlet and the CPU by assigning CPU resource to the threadlet, the total runtime can be reduced. This can be seen from the reduction in total runtime from the unmanaged case to the managed case.
[0096] Figure 6 shows a data processing apparatus 10 according to the present technique. Apparatus 10 includes decoding circuitry 13, processing circuitry 6, extension processing circuitry 23, and an extension task offload interface 24 as discussed above. Apparatus 10 also includes resource allocation adjustment circuitry 60 to adjust a resource allocation between the processing circuitry 6 and the extension processing circuitry 23 responsive to a resource adjustment indication.
[0097] Operation of the resource allocation adjustment circuitry 60 is shown in figure 7. At step 702, a resource adjustment indication is determined. As discussed further below, the resource adjustment indication is not particularly limited, and may in some examples be software provided and / or hardware provided. At step 704 and responsive to the resource adjustment indication, a resource allocation between the processing circuitry 6 and the extension processing circuitry 23 is adjusted. As discussed further below, this adjustment is not particularly limited, and may in some examples include reserving a resource, adjusting an amount or ratio of an assigned resource, or prioritising a resource. In this way, a resource allocation between the processing circuitry 6 and the extension processing circuitry 23 can be adjusted.
[0098] In one example, the resource allocation adjustment circuitry may be configured to monitor outstanding memory requests originating from the processing circuitry and the extension processing circuitry on encounter with a synchronisation instruction, and on future iterations of Arm Ref P08378 17
[0099] DYC Ref P130989PCT the code, assign different prioritizations to the processing circuitry and the extension processing circuitry on this basis.
[0100] The resources adjustment indication may comprise a software provided control indication. Examples of software provided control indications are shown in figures 8A to 8D.
[0101] Figure 8A shows a stored entry programmable by software. The entry may be defined in a software memory data structure, or one or more registers. This figure shows three stored entries defined in a system control register, each entry associated with a resource allocation for a given extension processing task (i.e. a processing task that is to be performed by the extension processing circuitry 23). In this example, for a memory copy task, the resource allocation is for the extension processing circuitry to be allocated 60% of a given resource (compared to 40% for the CPU for example). The resource being allocated may vary depending on implementation and is not particularly limited. In this example, for a compression task or an encryption task, the extension processing circuity is to be allocated 20% of a given resource.
[0102] Hence, the resource allocation adjustment circuitry may determine the processing task that the extension processing circuitry is to perform, check the system control register, and responsive to a stored entry associated with the processing task to be performed, adjust the resource allocation between the extension processing circuitry and the processing circuitry. For example, responsive to a determination that a memory copy task is to be performed by the extension processing task, the resource allocation adjustment circuitry may adjust the resource allocation between the extension processing circuitry and the processing circuitry based on the stored entry associated with a memory copy task. As a result, the resource allocation can be adjusted based on the specific task being performed to reduce task runtime.
[0103] It will be appreciated that figure 8A shows an example of stored entries, processing tasks and example resource allocation values, but that these may vary depending on implementation. In particular, it will be appreciated that the entries may be defined in one or more registers (such as a register for each given extension processing task), and that the specific values are only shown as an example. Indeed, it will be appreciated that the level and nature of information provided by the stored entry may vary depending on implementation. In some examples, the stored entry is indicative of a percentage / ratio or an absolute value for a resource allocation. In other examples, where the resource relates to prioritisation of bandwidth for example, the stored entry may be a single bit flag to set whether the extension processing circuitry is prioritised over the processing circuitry. The present disclosure is not particularly limited in this respect.
[0104] Figure 8B shows a hint instruction. The hint instruction may be indicative of a requested resource allocation or a future extension processing task that the extension processing circuitry is to perform. For example, resource allocations may be set based on previous characterisation of a given processing task and so a hint instruction may provide an indication that the given processing task is to be performed and provide an indication as to the resource allocation to use. In some cases, the hint instruction may be indicative of a future processing workload Arm Ref P08378 18
[0105] DYC Ref P130989PCT characterisation, which can then be used to adjust the resource allocation in a non-task specific way.
[0106] Figure 8C shows an example of a hint instruction, an XSTART instruction. As discussed in relation to figure 3, an XSTART instruction is an example of an extension start instruction, and so the XSTART instruction itself may be used to provide the software provided control indication. For example, the XTSTART instruction may specify a register operand or metadata stored at a location in memory to provide the indication to adjust the resource allocation.
[0107] Figure 8D shows a barrier instruction. The barrier instruction, in contrast to a hint instruction, may indicate a required resource allocation (rather than a requested one that may be ignored in the case of a hint instruction). For example, in some cases there may be memory ordering requirements and so a barrier instruction can be used to enforce memory ordering. The barrier instruction, as for the hint instruction and extension start instruction, may be extended to provide an indication of the resource allocation adjustment.
[0108] Figures 9A, 9B, 9C show steps for adjusting a resource allocation based on a software programmable stored entry, hint instruction, and barrier instruction, respectively.
[0109] Figure 9A includes steps 902 and 904. At step 902, a software programmable stored entry indicative of an adjustment to be made to the resource allocation is determined. For example, one or more registers or data structures may be checked. These may be checked periodically or in response to determining a given processing task or instruction. In other examples, an update to the register or data structure may trigger the resource allocation adjustment circuitry to check the register or data structure. At step 904, the resource allocation between the processing circuitry and the extension processing circuitry is adjusted based on the stored entry. As discussed above, the stored entry may itself be indicative of the adjustment to the resource allocation. For example, the stored entry may specify the resource allocation adjustment to use. In other examples, the stored entry may indicate that a resource allocation is to be adjusted based on a bit flag or multibit flag, for example by incrementing or decrementing a given resource allocation or setting a cache eviction policy or prioritisation level for memory traffic.
[0110] Figure 9B includes steps 906 and 908. At step 908, a hint instruction indicative of a requested resource allocation or a future extension processing task is encountered. As discussed, in some implementations, a hint instruction may be ignored, may be bounded by one or more configuration registers, may supersede or may be superseded by other resource allocation indications (such as a stored entry in a system control register or by a resource allocation indication generated by the resource prediction circuitry). At step 904, the resource allocation between the processing circuitry and the extension processing circuitry is adjusted based on the hint instruction. In some examples, the hint instruction specifies the resource allocation to be used for the adjustment. In other examples, the resource allocation adjustment circuitry is configured to determine the resource allocation for the adjustment based on the hint instruction. For example, the resource allocation adjustment circuitry may check one or more data structures or registers Arm Ref P08378 19
[0111] DYC Ref P130989PCT for the resource allocation to use for the adjustment in response to encountering the hint instruction. Figure 9B also applies to an extension start instruction, such as an XSTART instruction in place of the hint instruction.
[0112] Figure 9C includes steps 910 and 912. At step 910, a barrier instruction indicative of a required resource allocation is encountered. At step 904, the resource allocation between the processing circuitry and the extension processing circuitry is adjusted based on the barrier instruction. In some examples, the barrier instruction specifies the required resource allocation to be used for the adjustment. In other examples, the resource allocation adjustment circuitry is configured to determine the required resource allocation for the adjustment based on the barrier instruction. For example, the resource allocation adjustment circuitry may check one or more data structures or registers for the required resource allocation to use for the adjustment in response to encountering the barrier instruction.
[0113] Figure 10 shows a data processing apparatus 10 according to examples of the present technique. Apparatus 10 includes decoding circuitry 13, processing circuitry 6, extension processing circuitry 23, an extension task offload interface 24, and resource allocation adjustment circuitry 60, as discussed in relation to figure 6. In addition to figure 6, apparatus 10 also includes further extension processing circuitry 62, which is second extension processing circuitry and thus corresponds to extension processing circuitry 23. The extension task offload interface 24 also offloads data processing operations to the further extension processing circuitry 62 responsive to at least one task offloading instruction decoded by the decoding circuitry 13. It will be appreciated that the number of extension processing circuitries may vary depending on implementation.
[0114] In this example, apparatus 10 also includes resource prediction circuitry 64. Resource prediction circuitry 64 is configured to generate a resource adjustment indication by predicting a required future resource allocation based on a performance counter for tracking performance metrics of the processing circuitry 6 and / or the extension processing circuitry 23 (and / or the further extension processing circuitry 62 if provided). For example, the resource prediction circuitry 64 may determine the performance metrics during a previous processing period when the extension processing circuitry 23 was performing a processing task asynchronously to the processing circuitry 6. The performance metrics may be from existing performance monitoring units or from a performance monitoring unit associated with the extension processing circuitry 23. As discussed herein, the performance metrics are not particularly limited and may include a latency, a number of cache misses, a number of page table walks, and a number of processing stalls. In some examples, the performance metrics include a task completion time.
[0115] As discussed herein, the performance counter may be initialised based on software provided information to bias initial values of the performance metrics. This can reduce a warm up time for the resource prediction and thus improve resource prediction. The software provided information may comprise a stored entry in a register or data structure and / or a hint instruction. In some examples, the software provided information may provide bounds for resource allocation Arm Ref P08378 20
[0116] DYC Ref P130989PCT adjustment. For example, a configuration register may include one or more entries indicative of a minimum and / or maximum ratio of processing circuitry / extension processing circuitry memory accesses per cycle
[0117] In examples where there is more than one extension processing circuitry, such as the example of figure 10, the software provided control indication may comprise stored entries programmable by software for each of the extension processing circuitries. This is shown in figure 11. These entries may be stored in one or more data structures or registers, and each entry may indicate a resource allocation for each of the extension processing circuitry 23 and the further extension processing circuitry 62. In this way, fine grain resource control can be supported in examples with multiple extension processing circuitries.
[0118] The apparatus 10 of figure 10 also includes partition identifier circuitry 66. The partition identifier circuitry 66 is configured to allocate, to memory traffic associated with the extension processing circuitry 23 (and further extension processing circuitry 62 if provided), a different partition identifier to a partition identifier used for memory traffic associated with the processing circuitry. The resource allocation adjustment circuitry 60 is configured to differentiate between memory traffic associated with the processing circuitry and memory traffic associated with the extension processing circuitry based on the respective memory traffic having different partition identifiers. This is shown in figure 12A, where a different PARTID (i.e. partition identifier) is assigned to memory traffic associated with the processing circuitry 6 to the memory traffic associated with the extension processing circuitry 23. It will be appreciated that while apparatus 10 of figure 10 is shown to include both the resource prediction circuitry 64 and the partition identifier circuitry 66, in some examples one or both of these may be omitted from apparatus 10.
[0119] Hence, memory traffic can be efficiently differentiated, and this may support the partitioning of shared resources or enforcement of quality of service requirements. For example, this may support the use of an MPAM mechanism, which itself can enforce partitioning of memory resources, such as caches, interconnect bandwidth, and memory system bandwidth.
[0120] An MPAM mechanism will now be described in greater detail with reference to figure 12B, which may be used to implement resource allocation control.
[0121] A processing element 1204 having a processing pipeline 1224 for processing instructions supports an instruction set architecture which provides software with the ability to define, for a given software workload, one or more partition identifiers which distinguish one software workload from another. Such partition identifiers can be specified in memory system requests sent out to the memory system, and propagate through the memory system along with those requests, so that memory system components can identify which software workload a given request relates to.
[0122] In particular, the registers 1226 of the processing element 1204 include a set of partition identifier registers 1228 used to set one or more workload identifiers which are specified by a memory system request sent to the cache 1212 of the processing element 1204 or other parts of Arm Ref P08378 21
[0123] DYC Ref P130989PCT the memory system. The processing element 1204 includes circuitry (e.g. the load / store unit 1229) which selects which items of partition identifying information are specified by the memory system request, based on the information stored in the one or more partition identifier registers 1218. The partition identifying information specified by the memory system request may include one or more identifiers which act as a label to distinguish memory system requests issued on behalf of different execution environments (e.g. different software execution environments executed by the processing element 1204). The partition identifying information does not influence which addresses in memory are allowed to be accessed by a particular execution environment, but is used for resource allocation control for regulating the level of performance seen for memory accesses issued by a particular execution environment and / or for control of resource utilisation monitoring so that separate resource utilisation metrics can be gathered for different workloads (execution environments).
[0124] As shown in Figure 12B, a given memory system component (e.g. system cache 1216, interconnect 1214, memory controller of memory 1220, or private cache 1212) could include resource allocation control circuitry 1238 which uses the partition identifying information for selecting resource allocation control settings, e.g. which limit the amount of memory system bandwidth which a particular execution environment is allowed to use, or limit a maximum fraction of cache capacity that a given execution environment is allowed to allocate for its own information. The resource allocation control circuitry 1238 may have access to a number of sets of resource allocation setting information 1239 (each set corresponding to a given value of the partition identifying information), which specifies how to control resource allocations for requests specifying that value of the workload identifying information. For example, the resource allocation settings 1239 may be defined in a memory-based table structure stored in memory 1220 which is accessed at an address determined based on a base address defined in a register of the memory system component 1216, 1214, 1220, 1212. The base address register may be memory-mapped so that the base address can be set by software executing on a CPU 1204 by executing a store instruction specifying an address mapped to the base address register. Alternatively, other configuration interface mechanisms may be provided as a configuration interface 1237 of the memory system component 1216, 1214, 1220, 1212 to allow software to control the resource allocation settings for handling requests with a given value of the partition identifying information.
[0125] Also, as shown in Figure 12B, at least one instance of resource monitoring circuitry 1230 can have a similar configuration interface 1237 for configuring the resource monitoring circuitry 1230 to maintain resource utilisation metrics for one or more distinct partitions as identified by respective values of the partition identifying information. When a memory system request specifying a given value of the partition identifying information is detected, the resource monitoring circuitry 1230 checks whether that value of the partition identifying information corresponds to a partition for which a partition-specific resource utilisation metric is to be maintained, and if so updates a corresponding one of the partition-specific resource utilisation metrics 1235. The Arm Ref P08378 22
[0126] DYC Ref P130989PCT gathered metrics 1235 are made available for access by the system resource management agent 1234 (e.g. software on a CPU or a hardware processor responsible for resource allocation control).
[0127] Hence, the partition identifier control registers 1228 provide a mechanism by which software executing on the CPU 1204 may control labelling of memory access requests to assign partition identifiers to memory access requests, which can be used to control resource allocation and / or gathering of partition-specific resource utilisation metrics at a memory system component within the memory system.
[0128] In some examples, the partition identifying information assigned to a given memory access request may include more than one identifier, e.g.: a partition identifier (PARTI D) which is used to control which set of resource allocation settings are applied by resource allocation control circuitry 38 of a memory system component 16, 14, 20, 12; and a performance monitoring group identifier (PMG) which is used by resource monitoring circuitry 30 to select which of several partition-specific resource utilisation metrics are to be updated based on the memory access request.
[0129] In some examples, the PARTID and PMG may be considered independent identifiers, with the resource allocation control circuitry 38 selecting between resource allocation control settings based on the PARTID (independent of PMG) and the resource monitoring circuitry 1230 selecting which resource utilisation metric to update based on the PMG (independent of PARTID). Alternatively, one of the PARTID and PMG may be regarded as a sub-identifier which distinguishes between different sub-classes of partitions corresponding to a given value of the other identifier. For example, while resource allocation control may be based on PARTID only (independent of PMG), resource monitor selection may be based on the combination of PARTID and PMG (so that partitions having the same PARTID but different PMGs might have different resource utilisation metrics maintained specific to each of those partitions even though the partitions share the same resource allocation settings controlled based on PARTID). The opposite approach is also possible, with resource utilisation monitor selection based on PMG only and resource allocation control based on the combination of PARTID and PMG. Regardless of the particular approach taken, providing multiple identifiers can give more flexibility in providing different granularity of control over resource allocation control compared to resource utilisation monitoring. However, it will be appreciated that providing multiple identifiers is not essential, and other approaches may provide a single identifier used to control both selection of resource allocation settings applied by resource allocation control circuitry 1238 (e.g. caps on maximum cache allocation or maximum bandwidth consumption) and for selection of which resource utilisation metric to update.
[0130] One technique for implementing the resource allocation control is to use Memory Partition and Monitoring architecture provided by Arm Limited, with the additional functionality of assigning Arm Ref P08378 23
[0131] DYC Ref P130989PCT to memory traffic associated with the extension processing circuitry, a different partition identifier to a partition identifier used for memory traffic associated with the processing circuitry, as discussed herein. By using a different identifier for extension processing circuitry memory traffic from processing circuitry memory traffic, MPAM mechanisms for resource allocation control based on partition identifier discussed above may be implemented.
[0132] Figure 13 shows steps for adjusting resource allocation based on determined performance. At step 1302, performance of the processing circuitry 6 and / or the extension processing circuitry 23 is determined. This may include tracking the performance metrics using the performance counter as discussed above. At step 1302, the resource allocation between the processing circuitry 6 and the extension processing circuitry 23 is adjusted based on the determined performance.
[0133] Figure 14 shows steps for adjusting resource allocation based on a performance counter. At 1402, a performance counterfortracking performance metrics of the processing circuitry and / or extension processing circuitry is initialised. In some examples this step may be omitted. The initialisation may be based on software provided information, to warm up the prediction logic. For example, a configuration register may be used to set initial resource allocations targeted to known workload characteristics. This resource allocation could then be adjusted over time based on performance metric(s) measured at runtime. Software information may also provide bounds for the predictor to ensure minimum quality of service and / or shape performance characteristics, such as bandwidth utilization.
[0134] At 1404, performance metrics of the processing circuitry and / or the extension processing circuitry are tracked using the performance counter during a processing period when the extension processing circuitry is performing a processing task asynchronously to the processing circuitry. At 1406, the resource allocation for future performance of the processing task is adjusted based on the performance counter. Hence, for future iterations of the processing task, the resource allocation can be set based on known performance metrics of the processing / extension processing circuitry during a previous time when that same processing task was performed.
[0135] In an example using extension start and extension synchronisation instructions, such as XSTART and XSYNC instructions, a resource predictor may be configured to identify a processing task of the extension processing circuitry based on a program counter of the XSTART instruction, and measure the resource utilization by the processing circuitry and the extension processing circuitry from the start of the processing task until a corresponding XSYNC task retires. In some cases, existing performance monitoring unit (PMU) event sources may be used as the performance metrics. On a later launch of the same processing task (e.g. XSTART at the same program counter), the resource allocation adjustment circuitry may adjust the resource allocation based on the previous resource utilization.
[0136] Figure 15 shows steps for updating a tracker based on encountering a synchronisation instruction. One lightweight way to implement performance-based adjustment of the resource Arm Ref P08378 24
[0137] DYC Ref P130989PCT allocation is to increase / decrease the resources allocated to one of the processing circuitry or extension processing circuitry depending on which of these was waiting when a synchronisation instruction is encountered.
[0138] At step 1502, an extension synchronisation instruction is encountered (e.g. as described in relation to figure 3). The extension synchronisation instruction is to synchronise operation of the extension processing circuitry and the processing circuitry after a processing task has been offloaded to the extension processing task (in some cases, on completion of the processing task). In some cases, the extension synchronisation instruction returns to a register one or more values produced by the extension processing circuitry (or reports a status of the extension processing circuitry), and delays operations dependent on such registers until the extension processing task is complete.
[0139] At step 1504, it is determined whether the extension processing task already completed its associated processing task when the synchronisation instruction was encountered. If a positive determination is made, the process continues to step 1506, where the tracker is updated to indicate that the extension processing circuitry had already completed its processing task. This may indicate that the processing circuitry was under-resourced, and so this step may include increasing the resource allocation for the processing circuitry or adjusting the resource allocation in favour of the processing circuitry. When a negative determination is made at step 1504, the process continues to step 1508, where the tracker is updated to indicate that the extension processing circuitry had not already completed its processing task. This may indicate that the extension processing circuitry was under-resourced, and so this step may include increasing the resource allocation for the extension processing circuitry or adjusting the resource allocation in favour of the extension processing circuitry. In this way, the resource allocation can be efficiently adjusted based on performance.
[0140] In one example, the tracker comprises a counter. In this example, rather than updating the tracker to indicate that the extension processing circuitry had already or had not already completed its processing task at steps 1506 / 1508, instead the counter may be incremented in one direction each time the extension processing circuitry had already completed the processing task and incremented in another direction each time the processing circuitry had not already completed its processing task. The current counter value may then be used to control the resource allocation adjustment. Hence, the resource adjustment indication may comprise the counter. For example, this control may be based on either a continuous decrease / increase in resources with successive values of the counter, or based on determining whether a value of the counter satisfies a predetermined threshold. For example, for counter values less than a predetermined threshold, one resource control setting may be selected, and for counter values greater than or equal to the predetermined threshold, a different resource control setting may be selected.
[0141] Various approaches discussed above may be combined as discussed below. For example, using information from the resource predictor, stored entries, and / or software hints, the Arm Ref P08378 25
[0142] DYC Ref P130989PCT resource allocation adjustment circuitry may be configured to determine whether to power-gate the processing circuitry or extension processing circuitry. This can be useful because such information may indicate that a given component may not be required for an extended period of time (e.g. 10s of processing cycles), and thus that the given component may be gated-off. In other examples, such information may be used to slow the processing circuitry or extension processing circuitry by reducing its operating voltage and / or frequency. This can reduce the overall power consumption and avoid the need for thermal throttling.
[0143] Concepts described herein may be embodied in a system comprising at least one packaged chip. The apparatus described earlier is implemented in the at least one packaged chip (either being implemented in one specific chip of the system, or distributed over more than one packaged chip). The at least one packaged chip is assembled on a board with at least one system component. A chip-containing product may comprise the system assembled on a further board with at least one other product component. The system or the chip-containing product may be assembled into a housing or onto a structural support (such as a frame or blade).
[0144] As shown in Figure 16, one or more packaged chips 400, with the apparatus described above implemented on one chip or distributed over two or more of the chips, are manufactured by a semiconductor chip manufacturer. In some examples, the chip product 400 made by the semiconductor chip manufacturer may be provided as a semiconductor package which comprises a protective casing (e.g. made of metal, plastic, glass or ceramic) containing the semiconductor devices implementing the apparatus described above and connectors, such as lands, balls or pins, for connecting the semiconductor devices to an external environment. Where more than one chip 400 is provided, these could be provided as separate integrated circuits (provided as separate packages), or could be packaged by the semiconductor provider into a multi-chip semiconductor package (e.g. using an interposer, or by using three-dimensional integration to provide a multi-layer chip product comprising two or more vertically stacked integrated circuit layers).
[0145] In some examples, a collection of chiplets (i.e. small modular chips with particular functionality) may itself be referred to as a chip. A chiplet may be packaged individually in a semiconductor package and / or together with other chiplets into a multi-chiplet semiconductor package (e.g. using an interposer, or by using three-dimensional integration to provide a multilayer chiplet product comprising two or more vertically stacked integrated circuit layers).
[0146] The one or more packaged chips 400 are assembled on a board 402 together with at least one system component 404 to provide a system 406. For example, the board may comprise a printed circuit board. The board substrate may be made of any of a variety of materials, e.g. plastic, glass, ceramic, or a flexible substrate material such as paper, plastic or textile material. The at least one system component 404 comprise one or more external components which are not part of the one or more packaged chip(s) 400. For example, the at least one system component 404 could include, for example, any one or more of the following: another packaged Arm Ref P08378 26
[0147] DYC Ref P130989PCT chip (e.g. provided by a different manufacturer or produced on a different process node), an interface module, a resistor, a capacitor, an inductor, a transformer, a diode, a transistor and / or a sensor.
[0148] A chip-containing product 416 is manufactured comprising the system 406 (including the board 402, the one or more chips 400 and the at least one system component 404) and one or more product components 412. The product components 412 comprise one or more further components which are not part of the system 406. As a non-exhaustive list of examples, the one or more product components 412 could include a user input / output device such as a keypad, touch screen, microphone, loudspeaker, display screen, haptic device, etc.; a wireless communication transmitter / receiver; a sensor; an actuator for actuating mechanical motion; a thermal control device; a further packaged chip; an interface module; a resistor; a capacitor; an inductor; a transformer; a diode; and / or a transistor. The system 406 and one or more product components 412 may be assembled on to a further board 414.
[0149] The board 402 or the further board 414 may be provided on or within a device housing or other structural support (e.g. a frame or blade) to provide a product which can be handled by a user and / or is intended for operational use by a person or company.
[0150] The system 406 or the chip-containing product 416 may be at least one of: an end-user product, a machine, a medical device, a computing or telecommunications infrastructure product, or an automation control system. For example, as a non-exhaustive list of examples, the chipcontaining product could be any of the following: a telecommunications device, a mobile phone, a tablet, a laptop, a computer, a server (e.g. a rack server or blade server), an infrastructure device, networking equipment, a vehicle or other automotive product, industrial machinery, consumer device, smart card, credit card, smart glasses, avionics device, robotics device, camera, television, smart television, DVD players, set top box, wearable device, domestic appliance, smart meter, medical device, heating / lighting control device, sensor, and / or a control system for controlling public infrastructure equipment such as smart motorway or traffic lights.
[0151] Concepts described herein may be embodied in computer-readable code for fabrication of an apparatus that embodies the described concepts. For example, the computer-readable code can be used at one or more stages of a semiconductor design and fabrication process, including an electronic design automation (EDA) stage, to fabricate an integrated circuit comprising the apparatus embodying the concepts. The above computer-readable code may additionally or alternatively enable the definition, modelling, simulation, verification and / or testing of an apparatus embodying the concepts described herein.
[0152] For example, the computer-readable code for fabrication of an apparatus embodying the concepts described herein can be embodied in code defining a hardware description language (HDL) representation of the concepts. For example, the code may define a register-transfer-level (RTL) abstraction of one or more logic circuits for defining an apparatus embodying the concepts. The code may define a HDL representation of the one or more logic circuits embodying the Arm Ref P08378 27
[0153] DYC Ref P130989PCT apparatus in Verilog, SystemVerilog, Chisel, or VHDL (Very High-Speed Integrated Circuit Hardware Description Language) as well as intermediate representations such as FIRRTL. Computer-readable code may provide definitions embodying the concept using system-level modelling languages such as SystemC and SystemVerilog or other behavioural representations of the concepts that can be interpreted by a computer to enable simulation, functional and / or formal verification, and testing of the concepts.
[0154] Additionally or alternatively, the computer-readable code may define a low-level description of integrated circuit components that embody concepts described herein, such as one or more netlists or integrated circuit layout definitions, including representations such as GDSII. The one or more netlists or other computer-readable representation of integrated circuit components may be generated by applying one or more logic synthesis processes to an RTL representation to generate definitions for use in fabrication of an apparatus embodying the invention. Alternatively or additionally, the one or more logic synthesis processes can generate from the computer-readable code a bitstream to be loaded into a field programmable gate array (FPGA) to configure the FPGA to embody the described concepts. The FPGA may be deployed for the purposes of verification and test of the concepts prior to fabrication in an integrated circuit or the FPGA may be deployed in a product directly.
[0155] The computer-readable code may comprise a mix of code representations for fabrication of an apparatus, for example including a mix of one or more of an RTL representation, a netlist representation, or another computer-readable definition to be used in a semiconductor design and fabrication process to fabricate an apparatus embodying the invention. Alternatively or additionally, the concept may be defined in a combination of a computer-readable definition to be used in a semiconductor design and fabrication process to fabricate an apparatus and computer- readable code defining instructions which are to be executed by the defined apparatus once fabricated.
[0156] Such computer-readable code can be disposed in any known transitory computer- readable medium (such as wired or wireless transmission of code over a network) or non- transitory computer-readable medium such as semiconductor, magnetic disk, or optical disc. An integrated circuit fabricated using the computer-readable code may comprise components such as one or more of a central processing unit, graphics processing unit, neural processing unit, digital signal processor or other components that individually or collectively embody the concept.
[0157] Figure 17 illustrates a simulator implementation that may be used. Whilst the earlier described embodiments implement the present invention in terms of apparatus and methods for operating specific processing hardware supporting the techniques concerned, it is also possible to provide an instruction execution environment in accordance with the embodiments described herein which is implemented through the use of a computer program. Such computer programs are often referred to as simulators, insofar as they provide a software based implementation of a hardware architecture. Varieties of simulator computer programs include emulators, virtual Arm Ref P08378 28
[0158] DYC Ref P130989PCT machines, models, and binary translators, including dynamic binary translators. Typically, a simulator implementation may run on a host processor 1730, optionally running a host operating system 1722, supporting the simulator program 1710. In some arrangements, there may be multiple layers of simulation between the hardware and the provided instruction execution environment, and / or multiple distinct instruction execution environments provided on the same host processor. Historically, powerful processors have been required to provide simulator implementations which execute at a reasonable speed, but such an approach may be justified in certain circumstances, such as when there is a desire to run code native to another processor for compatibility or re-use reasons. For example, the simulator implementation may provide an instruction execution environment with additional functionality which is not supported by the host processor hardware, or provide an instruction execution environment typically associated with a different hardware architecture. An overview of simulation is given in “Some Efficient Architecture Simulation Techniques”, Robert Bedichek, Winter 1990 IISENIX Conference, Pages 53 - 63.
[0159] To the extent that embodiments have previously been described with reference to particular hardware constructs or features, in a simulated embodiment, equivalent functionality may be provided by suitable software constructs or features. For example, particular circuitry may be implemented in a simulated embodiment as computer program logic. Similarly, memory hardware, such as a register or cache, may be implemented in a simulated embodiment as a software data structure. In arrangements where one or more of the hardware elements referenced in the previously described embodiments are present on the host hardware (for example, host processor 1730), some simulated embodiments may make use of the host hardware, where suitable.
[0160] The simulator program 1710 may be stored on a computer-readable storage medium (which may be a non-transitory medium), and provides a program interface (instruction execution environment) to the target code 1700 (which may include applications, operating systems and a hypervisor) which is the same as the interface of the hardware architecture being modelled by the simulator program 1710. Thus, the program instructions of the target code 1700 described above, may be executed from within the instruction execution environment using the simulator program 1710, so that a host computer 1730 which does not actually have the hardware features of the apparatus 2 discussed above can emulate these features.
[0161] The simulator program 1710 includes decoding program logic 1712, processing program logic 1714, extension processing program logic 1716, an extension offload interface 1718, and resource allocation adjustment program logic 1720 which emulates the functionality of the decoding circuitry 13, processing circuitry 6, extension processing circuitry 23, extension offload interface 24, and resource allocation adjustment circuitry 60 described earlier.
[0162] Some examples are set out in the following clauses:
[0163] 1. An apparatus comprising: decoding circuitry configured to decode instructions; Arm Ref P08378 29
[0164] DYC Ref P130989PCT processing circuitry configured to perform data processing operations in response to the instructions decoded by the decoding circuitry; extension processing circuitry configured to perform other data processing operations asynchronously with respect to data processing operations performed by the processing circuitry; an extension task offload interface separate from an interface by which the processing circuitry issues a memory system request to a memory system, wherein the extension task offload interface is responsive to at least one task offloading instruction decoded by the decoding circuitry to offload the other data processing operations to the extension processing circuitry; and resource allocation adjustment circuitry configured to adjust a resource allocation between the processing circuitry and the extension processing circuitry responsive to a resource adjustment indication.
[0165] 2. The apparatus of clause 1 , in which the resource adjustment indication is associated with a given processing task for the extension processing circuitry.
[0166] 3. The apparatus of any preceding clause, in which the resource adjustment indication comprises a software provided control indication.
[0167] 4. The apparatus of clause 3, in which the software provided control indication comprises a stored entry programmable by software, the entry indicative of an adjustment to be made to the resource allocation.
[0168] 5. The apparatus of clause 3 or 4, in which the software provided control indication comprises a plurality of stored entries programmable by software, each associated with a resource allocation for a given extension processing task for the extension processing circuitry.
[0169] 6. The apparatus of any of clauses 3 to 5, comprising further extension processing circuitry configured to perform other data processing operations asynchronously with respect to data processing operations performed by the processing circuitry, wherein the extension task offload interface is responsive to at least one task offloading instruction decoded by the decoding circuitry to offload the other data processing operations to the further extension processing circuitry, and wherein the software provided control indication comprises stored entries programmable by software for each of the extension circuitry and the further extension circuitry.
[0170] 7. The apparatus of any of clauses 3 to 6, in which the software provided control indication is defined in a software memory data structure and / or one or more registers.
[0171] 8. The apparatus of any of clauses 3 to 7, in which the software provided control indication comprises a hint instruction indicative of a requested resource allocation or a future extension processing task that the extension processing circuitry is to perform.
[0172] 9. The apparatus of any of clauses 3 to 8, in which the software provided control indication comprises a barrier instruction indicative of a required resource allocation.
[0173] 10. The apparatus of any preceding clause, in which the resource allocation adjustment circuitry is configured to differentiate between memory traffic associated with the processing circuitry and memory traffic associated with the extension processing circuitry based on the Arm Ref P08378 30
[0174] DYC Ref P130989PCT respective memory traffic having different partition identifiers, the apparatus comprising partition identifier circuitry configured to assign, to memory traffic associated with the extension processing circuitry, a different partition identifier to a partition identifier used for memory traffic associated with the processing circuitry.
[0175] 11. The apparatus of any preceding clause, in which the resource adjustment indication comprises a hardware provided control indication.
[0176] 12. The apparatus of clause 11 , in which the hardware provided control indication comprises an indication that the resource allocation is to be adjusted based on performance of the processing circuitry and / or the extension processing circuitry during previous processing.
[0177] 13. The apparatus of clause 12, comprising resource prediction circuitry configured to generate the resource adjustment indication by predicting a required future resource allocation based on a performance counter for tracking performance metrics of the processing circuitry and / or the extension processing circuitry.
[0178] 14. The apparatus of clause 13, in which the resource prediction circuitry is configured to determine the performance metrics during a previous processing period when the extension processing circuitry was performing a processing task asynchronously to the processing circuitry.
[0179] 15. The apparatus of any of clauses 13 or 14, in which, to determine the performance metrics, the resource prediction circuitry is configured to determine one or more of: a latency, a number of cache misses, a number of page table walks, and a number of processing stalls.
[0180] 16. The apparatus of any of clauses 13 to 15, in which the performance counter is initialized based on software provided information to bias initial values of the performance metrics or provide bounds for resource allocation adjustment.
[0181] 17. The apparatus of any preceding clause, in which the resource allocation adjustment circuitry is configured to maintain a tracker based on whether, on encountering a synchronization instruction to synchronize operation of the extension processing circuitry and the processing circuitry after a processing task has been offloaded to the extension processing circuitry, the extension processing circuitry had already completed its associated processing task, and wherein the resource adjustment indication comprises the tracker.
[0182] 18. The apparatus of any preceding clause, in which to adjust the resource allocation, the resource allocation adjustment circuitry is configured to reserve a resource for use by one of the processing circuitry or extension processing circuitry.
[0183] 19. The apparatus of any preceding clause, in which to adjust the resource allocation, the resource allocation adjustment circuitry is configured to adjust a ratio of resource assigned to the processing circuitry and the extension processing circuitry.
[0184] 20. The apparatus of any preceding clause, in which to adjust the resource allocation, the resource allocation adjustment circuitry is configured to prioritise a resource for use by the processing circuitry or the extension processing circuitry. Arm Ref P08378 31
[0185] DYC Ref P130989PCT
[0186] 21. The apparatus of any preceding clause, in which the resource allocation is associated with one or more of: memory system resources, memory load / store processing resources, input / output resources and power resources.
[0187] 22. The apparatus according to any preceding clause, in which the processing circuitry and the extension processing circuitry share a private cache private to a processing element comprising the processing circuitry and the extension processing circuitry and not directly accessible to any other processing element of the apparatus.
[0188] 23. The apparatus according to any preceding clause, in which the processing circuitry and the extension processing circuitry share translation table walk circuitry configured to control translation table walk operations for obtaining translation table data from a memory system.
[0189] 24. A system comprising: the apparatus of any preceding clause, implemented in at least one packaged chip; at least one system component; and a board, wherein the at least one packaged chip and the at least one system component are assembled on the board.
[0190] 25. A chip-containing product comprising the system of clause 24, wherein the system is assembled on a further board with at least one other product component.
[0191] 26. A computer-readable medium storing computer-readable code for fabrication of the apparatus of any of clauses 1 to 23.
[0192] 27. A computer program comprising instructions which, when executed by a host data processing apparatus, control the host data processing apparatus to provide an instruction execution environment for execution of target program code, the computer program comprising: decoding program logic configured to decode instructions; processing program logic configured to perform data processing operations in response to the instructions decoded by the decoding program logic; extension processing program logic configured to perform other data processing operations asynchronously with respect to data processing operations performed by the processing program logic; an extension task offload interface separate from an interface by which the processing circuitry issues a memory system request to a memory system, wherein the extension task offload interface is responsive to at least one task offloading instruction decoded by the decoding circuitry to offload the other data processing operations to the extension processing circuitry; and resource allocation adjustment program logic configured to adjust a resource allocation between the processing program logic and the extension processing program logic responsive to a resource adjustment indication.
[0193] In the present application, the words “configured to...” are used to mean that an element of an apparatus has a configuration able to carry out the defined operation. In this context, a Arm Ref P08378 32
[0194] DYC Ref P130989PCT
[0195] “configuration” means an arrangement or manner of interconnection of hardware or software. For example, the apparatus may have dedicated hardware which provides the defined operation, or a processor or other processing device may be programmed to perform the function. “Configured to” does not imply that the apparatus element needs to be changed in any way in order to provide the defined operation.
[0196] In the present application, lists of features preceded with the phrase “at least one of’ mean that any one or more of those features can be provided either individually or in combination. For example, “at least one of: [A], [B] and [C]” encompasses any of the following options: A alone (without B or C), B alone (without A or C), C alone (without A or B), A and B in combination (without C), A and C in combination (without B), B and C in combination (without A), or A, B and C in combination.
[0197] Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope of the invention as defined by the appended claims.
Claims
Arm Ref P08378 33DYC Ref P130989PCTCLAIMS1. An apparatus comprising: decoding circuitry configured to decode instructions; processing circuitry configured to perform data processing operations in response to the instructions decoded by the decoding circuitry; extension processing circuitry configured to perform other data processing operations asynchronously with respect to data processing operations performed by the processing circuitry; an extension task offload interface separate from an interface by which the processing circuitry issues a memory system request to a memory system, wherein the extension task offload interface is responsive to at least one task offloading instruction decoded by the decoding circuitry to offload the other data processing operations to the extension processing circuitry; and resource allocation adjustment circuitry configured to adjust a resource allocation between the processing circuitry and the extension processing circuitry responsive to a resource adjustment indication.
2. The apparatus of claim 1 , in which the resource adjustment indication is associated with a given processing task for the extension processing circuitry.
3. The apparatus of any preceding claim, in which the resource adjustment indication comprises a software provided control indication.
4. The apparatus of claim 3, in which the software provided control indication comprises a stored entry programmable by software, the entry indicative of an adjustment to be made to the resource allocation.
5. The apparatus of claim 3 or 4, in which the software provided control indication comprises a plurality of stored entries programmable by software, each associated with a resource allocation for a given extension processing task for the extension processing circuitry.
6. The apparatus of any of claims 3 to 5, comprising further extension processing circuitry configured to perform other data processing operations asynchronously with respect to data processing operations performed by the processing circuitry, wherein the extension task offload interface is responsive to at least one task offloading instruction decoded by the decoding circuitry to offload the other data processing operations to the further extension processing circuitry, and wherein the software provided control indication comprises stored entries programmable by software for each of the extension circuitry and the further extension circuitry.Arm Ref P08378 34DYC Ref P130989PCT7. The apparatus of any of claims 3 to 6, in which the software provided control indication is defined in a software memory data structure and / or one or more registers.
8. The apparatus of any of claims 3 to 7, in which the software provided control indication comprises a hint instruction indicative of a requested resource allocation or a future extension processing task that the extension processing circuitry is to perform.
9. The apparatus of any of claims 3 to 8, in which the software provided control indication comprises a barrier instruction indicative of a required resource allocation.
10. The apparatus of any preceding claim, in which the resource allocation adjustment circuitry is configured to differentiate between memory traffic associated with the processing circuitry and memory traffic associated with the extension processing circuitry based on the respective memory traffic having different partition identifiers, the apparatus comprising partition identifier circuitry configured to assign, to memory traffic associated with the extension processing circuitry, a different partition identifier to a partition identifier used for memory traffic associated with the processing circuitry.
11. The apparatus of any preceding claim, in which the resource adjustment indication comprises a hardware provided control indication.
12. The apparatus of claim 11 , in which the hardware provided control indication comprises an indication that the resource allocation is to be adjusted based on performance of the processing circuitry and / or the extension processing circuitry during previous processing.
13. The apparatus of claim 12, comprising resource prediction circuitry configured to generate the resource adjustment indication by predicting a required future resource allocation based on a performance counter for tracking performance metrics of the processing circuitry and / or the extension processing circuitry.
14. The apparatus of claim 13, in which the resource prediction circuitry is configured to determine the performance metrics during a previous processing period when the extension processing circuitry was performing a processing task asynchronously to the processing circuitry.
15. The apparatus of any of claims 13 or 14, in which, to determine the performance metrics, the resource prediction circuitry is configured to determine one or more of: a latency, a number of cache misses, a number of page table walks, and a number of processing stalls.Arm Ref P08378 35DYC Ref P130989PCT16. The apparatus of any of claims 13 to 15, in which the performance counter is initialized based on software provided information to bias initial values of the performance metrics or provide bounds for resource allocation adjustment.
17. The apparatus of any preceding claim, in which the resource allocation adjustment circuitry is configured to maintain a tracker based on whether, on encountering a synchronization instruction to synchronize operation of the extension processing circuitry and the processing circuitry after a processing task has been offloaded to the extension processing circuitry, the extension processing circuitry had already completed its associated processing task, and wherein the resource adjustment indication comprises the tracker.
18. The apparatus of any preceding claim, in which to adjust the resource allocation, the resource allocation adjustment circuitry is configured to: reserve a resource for use by one of the processing circuitry or extension processing circuitry; adjust a ratio of resource assigned to the processing circuitry and the extension processing circuitry; and / or prioritise a resource for use by the processing circuitry or the extension processing circuitry.
19. The apparatus of any preceding claim, in which the resource allocation is associated with one or more of: memory system resources, memory load / store processing resources, input / output resources and power resources.
20. The apparatus according to any preceding claim, in which the processing circuitry and the extension processing circuitry share a private cache private to a processing element comprising the processing circuitry and the extension processing circuitry and not directly accessible to any other processing element of the apparatus.
21. The apparatus according to any preceding claim, in which the processing circuitry and the extension processing circuitry share translation table walk circuitry configured to control translation table walk operations for obtaining translation table data from a memory system.
22. A system comprising: the apparatus of any preceding claim, implemented in at least one packaged chip; at least one system component; and a board,Arm Ref P08378 36DYC Ref P130989PCT wherein the at least one packaged chip and the at least one system component are assembled on the board.
23. A chip-containing product comprising the system of claim 22, wherein the system is assembled on a further board with at least one other product component.
24. A computer-readable medium storing computer-readable code for fabrication of the apparatus of any of claims 1 to 21 .
25. A computer program comprising instructions which, when executed by a host data processing apparatus, control the host data processing apparatus to provide an instruction execution environment for execution of target program code, the computer program comprising: decoding program logic configured to decode instructions; processing program logic configured to perform data processing operations in response to the instructions decoded by the decoding program logic; extension processing program logic configured to perform other data processing operations asynchronously with respect to data processing operations performed by the processing program logic; an extension task offload interface separate from an interface by which the processing circuitry issues a memory system request to a memory system, wherein the extension task offload interface is responsive to at least one task offloading instruction decoded by the decoding circuitry to offload the other data processing operations to the extension processing circuitry; and resource allocation adjustment program logic configured to adjust a resource allocation between the processing program logic and the extension processing program logic responsive to a resource adjustment indication.