Data processing system

By using a multicast communication network to partition the processing units and configuring a main cache controller in the data processing system, the problem of insufficient independence between multiple isolated subsystems is solved, and flexible resource allocation and efficient data processing are achieved.

CN114820271BActive Publication Date: 2026-06-16ARM LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ARM LTD
Filing Date
2022-01-28
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing data processing systems suffer from insufficient independence and isolation when providing independent graphics processing operations across multiple isolated subsystems, and resource partitioning leads to increased costs and the number of components.

Method used

Multiple processing units are configured into different partitions using a multicast communication network. The main cache controller enables unified management and response signal combination of multiple caches within the partition, supporting flexible resource allocation and independent operation.

🎯Benefits of technology

It enables flexible, adaptive, and efficient communication between multiple processing units, supports both safety-critical and non-safety-critical operations, and improves the overall flexibility and configurability of the system.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN114820271B_ABST
    Figure CN114820271B_ABST
Patent Text Reader

Abstract

The invention is entitled "Data processing system". The invention discloses a data processing system comprising a plurality of processing units which can be configured as different partitions of processing units. The system comprises a multicast communication network for routing cache communications within a partition of processing units. A cache controller of one of the processing units within the partition can be configured as a master cache controller for a set of caches within the partition. The master cache controller can operate to issue a request to all of the caches in the combination of caches simultaneously over the multicast communication network. The multicast communication network is configured to combine response signals from the different processing units within the partition to provide a combined response signal to the master cache controller representing an overall request-response state of the caches to which the request was issued.
Need to check novelty before this filing date? Find Prior Art

Description

Background Technology

[0001] The technology described in this article relates to data processing systems, and more specifically to data processing systems comprising multiple processing units, such as multiple graphics processing units (processors) (GPUs).

[0002] It is becoming increasingly common for data processing systems to require graphics processing operations for multiple isolated subsystems. For example, a vehicle may have displays for a main instrument console, additional navigation and / or entertainment screens, and advanced driver assistance systems (ADAS). Each of these systems may need to perform its own graphics processing operations, and, for example, for formal safety requirements, they may need to be able to operate independently of each other.

[0003] One approach to such systems is to provide a single graphics processing unit (GPU) that shares time among the different graphics processing functions required. However, separate time sharing may not provide sufficient independence and isolation between different subsystems that may require graphics processing.

[0004] Alternatively, a completely separate graphics processing unit can be provided for each required graphics processing function. However, this can have negative impacts, such as in terms of the number of processing components and / or the required cost, because this would require creating a partition of time-fixed resources on the SoC (System-on-Chip).

[0005] Therefore, the applicant believes that there is still room for improvement in data processing systems that require multiple independent data processing functions, such as graphics processing functions for multiple different displays. Attached Figure Description

[0006] Implementations of the technology described herein will now be described by way of example only, with reference to the accompanying drawings, wherein:

[0007] Figure 1 A data processing system according to one implementation scheme is illustrated schematically;

[0008] Figure 2 schematically shown Figure 1 Further details of the data processing system;

[0009] Figure 3 schematically shown Figure 1 and Figure 2 Further details of the data processing system;

[0010] Figure 4 The components of a graphics processing unit in one embodiment are shown schematically and in more detail.

[0011] Figure 5The components of a graphics processing unit, including an internal cache communication network, are shown schematically and in more detail in one embodiment.

[0012] Figure 6 The illustration schematically shows a partition of multiple processing units arranged in a "master-slave" configuration according to one implementation scheme;

[0013] Figure 7 A communication network within a processing unit is schematically illustrated to allow cached communication to be transmitted across the processing unit to other processing units within a partition of multiple processing units to which the processing unit belongs.

[0014] Figure 8 This shows when the processing unit is configured as Figure 6 When a main processing unit is a partition of multiple processing units in a process, the configuration of the communication network of that processing unit is as follows:

[0015] Figure 9 This shows when the processing unit is configured as Figure 6 When multiple processing units are partitioned into intermediate processing units, the configuration of the communication network of that processing unit is considered.

[0016] Figure 10 This shows when the processing unit is configured as Figure 6 When a partition of multiple processing units is the last-level slave processing unit, the configuration of the communication network of that processing unit is determined.

[0017] Figure 11 This is a flowchart illustrating the implementation plan;

[0018] Figure 12 It is a flowchart that explains in more detail how invalid cache requests can be handled according to the implementation plan; and

[0019] Figure 13 A multicast buffered communication protocol according to one implementation is illustrated.

[0020] Where appropriate, similar reference numerals are used for similar features in all figures. Detailed Implementation

[0021] A first embodiment of the technology described herein includes a data processing system, the data processing system comprising:

[0022] Multiple processing units, which can be configured as different corresponding partitions of the processing units, wherein each partition includes a set of one or more processing units of the multiple processing units, wherein at least some of the multiple processing units include one or more caches and corresponding cache controllers; and

[0023] A configurable multicast communication network is provided for routing communications to and from multiple caches within corresponding partitions of the processing unit.

[0024] The multicast communication network is configurable such that the cache controller of one of the processing units within a corresponding partition of the plurality of processing units can be configured as the "master" cache controller for the set of the plurality of caches within the partition.

[0025] The primary cache controller is operable to simultaneously issue requests to all caches in the set of multiple caches within the partition via the multicast communication network.

[0026] The multicast communication network is configured to combine the corresponding responses from all caches to which requests from processing units in the partition are published to generate a corresponding response signal for the processing unit.

[0027] The multicast communication network is further configured to combine corresponding response signals from different processing units within the partition and provide a combined response signal to the main cache controller, the combined response signal representing the overall request-response status of all caches in the set of multiple caches to which the request was published.

[0028] A second embodiment of the technology described herein includes a method for operating a data processing system, the data processing system comprising:

[0029] Multiple processing units, which can be configured as different corresponding partitions of the processing units, wherein each partition includes a set of one or more processing units among the multiple processing units.

[0030] At least some of the plurality of processing units include one or more caches and corresponding cache controllers;

[0031] One of the features is a multicast communication network for routing cache communication to and from multiple caches within a partition of a processing unit, the multicast communication network being configured such that the cache controller of one of the processing units within the partition can be configured as a "master" cache controller for a collection of multiple caches within the partition:

[0032] The method includes:

[0033] For the corresponding partitions of multiple processing units:

[0034] The primary cache controller for the partition simultaneously issues requests to all caches in the set of multiple caches within the partition via the multicast communication network.

[0035] The multicast communication network combines the requests from the processing units in the partition with the corresponding responses published to all the caches to generate a corresponding response signal for the processing unit.

[0036] The multicast communication network also combines corresponding response signals from different processing units within the partition and provides the combined response signal to the main cache controller; and

[0037] The primary cache controller uses the combined response signal to determine the overall request-response status of all caches in the set of multiple caches to which the request was published.

[0038] The technology described herein relates to a data processing system comprising multiple processing units (e.g., graphics processing units). Each processing unit within a group of processing units can be configured as a corresponding "partition" of processing units within the group, wherein each partition comprises a set or subset of one or more processing units of the group, and wherein at least some partitions (in one embodiment) comprise multiple processing units of the group.

[0039] In this way, processing units within a group can be organized into different processing unit partitions in a flexible and adaptive manner. For example, the organization of processing units is not fixed during system manufacturing, so the controller for the group (e.g., a software arbitrator) can configure the allocation and organization of processing units into different partitions as needed, and, for example, in an implementation, then change the allocation, for example, in use.

[0040] The benefit here is that processing resources can therefore be dynamically partitioned across multiple processing units within a group, which can thus provide a more flexible and “configurable” data processing system.

[0041] For example, in one implementation, the data processing system may include multiple substantially similar groups, and in another implementation, they may be functionally equivalent. The processing units within a group can then be flexibly configured, and in another implementation, the different partitions serving as processing units can be reconfigured as needed, for example, according to current processing requirements.

[0042] As will be discussed further below, this then means that the data processing system of the technology described herein can provide a system in which a “pool” of processing units can be flexibly and adaptively allocated between, for example, security-critical and non-security-critical domains, and virtual machines for providing processing capabilities for those virtual machines, while supporting processing capabilities for, for example, security-critical and non-security-critical operations, in an efficient manner.

[0043] The processing units within a partition can combine operations to generate data processing output, for example, and in an implementation scheme, in a "master-slave" type arrangement, as will be explained further below. During processing operations using a partition, it is therefore desirable to transmit various types of communication signals between the processing units.

[0044] Therefore, in order to realize the processing units within the partition, in one embodiment they are connected to each other via a communication bridge, enabling them to communicate with each other. Thus, each processing unit can, and in this embodiment, have a corresponding set of communication interfaces, allowing the processing unit to connect to other processing units within the partition.

[0045] Communication bridges between processing units within a partition, together with the internal communication networks (e.g., interconnections) within each processing unit in the partition, can define various communication networks for routing different types of communication between processing units within the partition. Therefore, processing units are configured to collectively form a communication network that allows communication to be routed between processing units within their respective partitions.

[0046] The techniques described herein specifically relate to and relate to communication networks used to route "multicast" cache communications to a collection of multiple caches within such a data processing system, including multiple processing units that can be configured as different partitions of processing units.

[0047] For example, in the techniques described herein, at least some of the processing units, and each of the processing units, includes one or more corresponding caches, such that a corresponding partition among the multiple processing units may include a corresponding set of multiple buffers consisting of one or more caches of the processing units within the partition. Multiple cache sets operating according to the techniques described herein, for example, can issue multicast communication and therefore can typically be distributed across different processing units within a partition, for example, depending on the configuration of the set or subset of processing units within the corresponding partition, the number and arrangement of caches within the partition.

[0048] The "multicast" cached communication network of the technology described herein can be provided in any suitable manner as needed. For example, in one embodiment, each processing unit within a partition has its own internal multicast cached communication network (which may be dedicated to multicast cached communication). When multiple processing units are configured into respective partitions, the multicast cached communication network can therefore be formed by the respective internal multicast cached communication network together with appropriate communication bridges for different (e.g., logically adjacent) processing units within interconnecting partitions. Thus, processing units within a partition can be configured (logically) to collectively form a multicast communication network that allows multicast cached communication to be appropriately routed to the processing units within the partition.

[0049] As will be discussed further below, the applicant recognizes that at some point during a processing operation using a corresponding partition of a processing unit, it may be desirable to issue requests simultaneously (e.g., in a “multicast” manner) to all caches in multiple cache sets within the partition.

[0050] In this regard, the applicant has recognized that in the case of such a data processing system comprising multiple processing units, the multiple processing units can be configured as different corresponding partitions of the processing units, and the cache communication network should ideally be able to allow such multicast cache communication to be issued across multiple different processing units within a partition (the processing units may operate in different (independent) clock domains, for example, and in an embodiment), for example, and in an efficient manner in an embodiment.

[0051] The applicant further recognizes that, in order to achieve the desired overall flexibility and adaptability of the data processing system, such as as described above, the multicast cache communication network should ideally and should also be able to accommodate different possible partitions (configurations) among multiple processing units, which may include, for example, different numbers and arrangements of caches, depending on the set or subset of processing units within the partition.

[0052] These requirements can be addressed using a multicast cache communication network for a data processing system based on the techniques described in this article.

[0053] In particular, and to conclude here, the techniques described herein provide a multicast communication network for routing cache communication to multiple caches, which are configurable such that a cache controller of one of the processing units within a respective partition of the multiple processing units can be configured as a “master” cache controller for the multiple cache sets within the partition.

[0054] Then, the cache communication networks for the corresponding partitions in the multiple processing units are configured such that the designated primary cache controller can operate to simultaneously issue requests to all caches in multiple cache sets within the partition. Any responses to the requests from the caches in the multiple cache sets can then be provided to the primary cache controller via a multicast communication network.

[0055] In the techniques described herein, a multicast cache communication network is configured to combine responses from all caches with the processing unit that issued the request, thereby generating a total response signal for the processing unit as a whole. As will be further explained below, the combined response signal is generated for each processing unit, and the response signals of multiple processing units in a partition are combined themselves, for example, and in an embodiment, the overall combined response signal is provided as a sequence for the (entire) set of requests issued.

[0056] Therefore, the multicast cache communication network of the technology described herein is operable to provide a "combined" cache response signal to the master cache controller, the master cache response signal indicating the overall request response status of all caches in the plurality of cache sets that issued the request.

[0057] For example, the main cache controller may receive a (first) combined response signal indicating that all caches in the multiple cache sets that issued the request have received and acknowledged the request (therefore the overall request response status is that all caches have received and responded to the request).

[0058] In this case, the main cache controller can thus determine that all caches have received the request signal, and, for example, in an implementation, the main cache controller can subsequently invalidate the request signal (after the action that triggered the request (e.g., memory update), for example, after the memory update).

[0059] Correspondingly, and in one implementation, the main cache controller may (subsequently) receive a (second) combined response signal indicating that all caches in the plurality of cache sets have received and acknowledged receiving the request signal to disclaim the assertion (and thus the overall request response status is when all caches have completed the request).

[0060] In this scenario, the primary cache controller can then determine that all caches have received and acknowledged the request to remove the assertion. It can then proceed with any further processing or waiting for the request to complete.

[0061] In this way, the primary cache controller can handle such requests and responses at the level of the entire partition, meaning that for all caches within the partition, it does not need to track (or know) which individual caches within the set of composite caches have received / completed the request. This, in turn, means that the primary cache controller does not need to know how many caches are in the setup, or how the caches in the set are distributed across processing units within the partition.

[0062] As will be explained further below, this arrangement is therefore particularly well-suited for use with the data processing systems of the types described above, including multiple processing units that can be configured as different corresponding partitions of processing units.

[0063] Therefore, the techniques described herein provide a cached communication network that can effectively implement multicast cached communication within such data processing systems while maintaining the desired overall flexibility and configurability of the data processing system.

[0064] Therefore, the techniques described in this paper offer various benefits compared to other data processing systems.

[0065] A data processing system may include any desired and suitable number of processing units. In an implementation, there may be four or eight processing units, but the data processing system may include more or fewer processing units as needed.

[0066] The processing unit can be any suitable and desired form of processing unit. In the embodiments, a processing unit is a processing unit designed to perform a particular form or type of processing operation, and specifically, in the embodiments, a hardware accelerator is used as the particular form or type of processing operation. Therefore, the processing unit can be, for example, any suitable and desired form of processing unit and accelerator, such as a video processing unit, a machine learning accelerator, a neural network processing unit, etc.

[0067] In this implementation, the processing unit is a graphics processing unit (GPU). In this case, the graphics processing unit of the data processing system can include any suitable and desired form of graphics processing unit. The graphics processing unit can perform any suitable and desired form of graphics processing, such as raster-based rendering, ray tracing, hybrid ray tracing, etc.

[0068] This situation is where the implementation of the techniques described herein will be described primarily with reference to a graphics processing unit. However, unless the context otherwise requires, the features described herein can be applied in the same and similar way to other forms of processing units and to other forms of processing units, and the techniques described herein are extended to such systems using processing units other than graphics processing units.

[0069] The processing unit can be used for all forms of output that the data processing unit can output. Therefore, in the case of a graphics processing unit, it can be used when generating frames for display, rendering to texture output, etc. However, the techniques described herein can also be used where the graphics processing unit is used to provide other processing and operations and outputs, such as other processing and operations and outputs that may not be associated with a display or image. For example, the techniques described herein can also be used in non-graphics use cases, such as ADAS (Advanced Driver Assistance Systems), which may not have a display and can process input data (e.g., sensor data, such as radar data) and / or output data (e.g., vehicle control data) that are not associated with an image. Generally, the techniques described herein can be used for any desired graphics processor data processing operation, such as GPGPU (General Purpose GPU) operation.

[0070] As described above, a data processing system comprises a group of multiple (similar) processing units (e.g., multiple graphics processing units).

[0071] For example, individual processing units may be identical or different from each other, depending on their processing resources and capabilities. In some implementations, each of the processing units includes substantially the same type of functional unit, but different processing units may have different numbers of functional units. In some implementations, each processing unit in a group is substantially functionally equivalent, for example, and is identical in the implementation.

[0072] In examples of such implementations, at least some of the processing units can be combined with at least one other processing unit among a plurality of processing units to generate data processing outputs, and in some of these implementations, the processing units are also capable of operating to generate data processing outputs independently (i.e., independently of any other processing unit among the plurality of processing units).

[0073] A processing unit itself can be divided into one or more sets (“partitions”) of one or more processing units, wherein each set (partition) of one or more processing units is operable to generate data processing output independently of any other set (partition) of one or more processing units of one or more sets (partitions) of one or more processing units.

[0074] In an implementation scheme, a processing unit can be configured as a partition of any desired set or subset of one or more available processing units. That is, multiple processing units can be partitioned according to any desired arrangement of the processing units.

[0075] Processing units within a given partition can typically be organized and configured in any suitable manner as needed.

[0076] In one implementation, when a processing unit operates in combination with at least one other processing unit among a plurality of processing units to generate the same data processing output, the processing units of the set (partition) operate in a "master-slave" type arrangement, wherein one processing unit of the set (partition) operates as a master (primary) processing unit, which controls the processing operations on one or more other processing units, each acting as a slave (secondary) processing unit.

[0077] For example, in the case where a partition comprises multiple processing units, one of the processing units, in an implementation configured as the "master" processing unit, provides a software interface to the virtual machine itself and a collection of one or more slave processing units. This then has the advantage for any virtual machine using a partition, which still exists, even though only a single processing unit exists.

[0078] Therefore, when the processing unit is used as a “master” processing unit, in an implementation, the management unit of the master processing unit (e.g., a “job manager”) provides a software interface (e.g., a driver for the virtual machine in question) and its corresponding set of links to slave processing units.

[0079] Similarly, in implementations where the management unit of the main processing unit is configured to distribute processing tasks across the main processing unit and from the processing units (but the arrangement makes the arrangement from the software (driver) side) still only instructs and sends processing tasks to a single processing unit.

[0080] Correspondingly, when a processing unit operates in slave mode (as a subordinate processing unit controlled by another master processing unit), the operation of the processing unit is implemented in the corresponding configuration. For example, any functional units that are redundant in the "slave" processing unit are made "inactive" in one implementation when the processing unit is configured to operate as a "slave".

[0081] In one implementation, an arbiter for a group of processing units can reconfigure the allocation of processing units in the group to the corresponding partitions of the group in use. In this case, a given processing unit and / or partition is reset and / or powered down (and then restarted) during reconfiguration in the implementation. Correspondingly, if a virtual machine has access to the partition to be reconfigured, in this implementation, appropriate stopping and pausing of the virtual machine is permitted prior to the reconfiguration.

[0082] Once processing units have been configured to operate (e.g., to access the corresponding (subset) partitions of a processing unit), virtual machines using the processing units can be allowed to access the partitions of the group's processing units so that the partitions of the group's processing units can perform processing operations for the virtual machines.

[0083] A virtual machine that accesses a set of processing units can take any suitable and desired form. For example, a virtual machine can execute one or more applications and / or can be implemented by an application itself. Virtual machines (and applications, for example) can run on any desired and suitable processor, such as one or more (e.g., host) processors (e.g., central processing units) of a data processing system.

[0084] The techniques described in this article specifically relate to routed multicast cache communication within a partition of a processing unit.

[0085] For example, a typical data processing unit may include one or more caches, such as instances for locally storing data within the processing unit to reduce and / or accelerate external memory access. A corresponding cache controller may also be provided to the processing unit for controlling cache operations, for example, for managing cache coherency across multiple caches, and for processing requests from the cache to retrieve data from external memory. This is the case for at least some processing units in the data processing system of the technology described herein (and in embodiments, the case for all processing units in the group, which are substantially functionally equivalent in embodiments, e.g., and are identical in embodiments).

[0086] Therefore, in the technology described herein, at least some of the processing units within a group of processing units, and all of the processing units in one embodiment, include one or more caches and corresponding cache controllers.

[0087] In one embodiment, at least some of the processing units, and in another embodiment, the processing units include multiple functional units (e.g., but not limited to one or more execution units (e.g., shader cores), management units (e.g., job managers), caches (e.g., L2 caches) that provide an interface to external memory, tiler units, etc.).

[0088] At least some of the functional units themselves may include address translation caches (e.g., shift-look buffers (TLBs)) that can be used to translate recently stored virtual memory addresses into physical memory addresses used by the functional unit. For example, all memory addresses issued by software and operated by the functional unit can be “virtual” memory addresses. To access external physical memory, these virtual addresses must therefore be translated into physical memory addresses.

[0089] In one implementation, the communication can be routed to a multicast communication network and the multiple cache sets that operate in the manner described herein include a collection of multiple address translation caches (e.g., TLBs), and the corresponding cache controller includes a memory management unit (MMU) that can operate against the MMU, for example in a normal manner, to determine the corresponding virtual-to-physical memory address translation.

[0090] Therefore, in the implementation, the (main) cache controller is a (main) memory management unit (MMU), and the (main) cache controller is capable of operating to issue multicast requests to include a set of address translation caches (e.g., TLBs) within the functional unit of the processing unit within the partition (and in all implementations).

[0091] Therefore, any reference to “cache” or “cache controller” in this document may refer to “address translation cache” (such as TLB) and “memory management unit”, respectively.

[0092] However, while the techniques described in this paper may find specific utility in this situation, it should be understood that the multicast cache communication network described herein can generally be applied to any suitable set of multiple caches.

[0093] Therefore, in the data processing system of the technology described herein, the corresponding partitions of multiple processing units can and will typically include a set of corresponding caches (e.g., TLBs), wherein at least some of the caches (TLBs) in the multiple cache sets reside on different processing units within the partition. Typically, each corresponding partition in the multiple processing units can have multiple caches with different sets, which consist of caches composed of processing units within the partition, so different partitions may contain different numbers of caches, for example determined by the processing units within the partition.

[0094] The applicant has recognized that there may be a need to be able to simultaneously, for example, transmit signals to all caches in multiple cache sets within a partition of multiple processing units in a “multicast” manner (e.g., to all TLBs within a partition of multiple processing units).

[0095] This will be done in specific instances when it is desired to invalidate all caches within multiple cache sets within a partition. For example, this might be the case in the case of a complex set of TLBs where a new set of memory address maps is required. For instance, this might occur when memory address maps accessed via caches need to be updated, for example, due to a memory allocation failure, and / or when new memory address maps need to be configured for new software workloads or in response to a system reset.

[0096] In such cases, any cache that should go into external memory in an implementation should be "locked" in one implementation to prevent external memory access during updates, and then "invalidated" (flushed) to ensure that any date cached content, such as memory address mapping in the case of a TLB, is cleared from the cache, so that when the lock is removed, processing can continue to use the updated cached content.

[0097] To do this, in one implementation, a cache invalidation request (signal) is simultaneously transmitted to all caches that need to be invalidated.

[0098] Therefore, in the implementation, the request transmitted by the main cache controller to multiple cache sets within the partition includes a request to invalidate all caches in the multiple cache sets.

[0099] This is the situation in which the techniques described in this article will be primarily described with reference to the case where a request transmitted by the primary cache controller using a multicast cache communication network is a cache invalid request.

[0100] However, unless the context otherwise requires, the features described in the techniques described herein can be applied in the same and similar way to other forms of requests and for other forms of requests that can be suitably and desirablely multicast simultaneously to multiple caches within a partition, such as, but not limited to, cache “clear” requests or memory barriers for forced ordering.

[0101] As described above, in the techniques described herein, at least some of the processing units, and in one embodiment, each of the processing units, has a corresponding (local) cache controller that is operable to control an associated set of one or more caches of the processing unit.

[0102] Therefore, when a partition contains only a single processing unit, the corresponding (local) cache controller controls the operation of the cache associated with the processing unit.

[0103] However, in the technique described herein, when a partition contains multiple processing units, instead of a separate cache controller for each processing unit in the partition, the operation of the corresponding cache for that processing unit is controlled individually. For example, when the processing units operate independently of each other, the cache controller for one of the processing units in the partition is configured as the “master” cache controller for the partition.

[0104] The primary cache controller can then operate to control multiple cache sets within the partition, including caches from its own processing unit (the processing unit that designates the primary cache controller) and caches from any other (slave) processing units within the partition. In this case, the cache controller designated as the primary cache controller is active, while the corresponding cache controllers on other processing units within the partition are "inactive" in the implementation.

[0105] Multiple cache sets operating in accordance with the techniques described herein may include any suitable set or subset of caches from processing units within a respective partition. In one embodiment, multiple cache sets operating in accordance with the techniques described herein include a set of all caches of a certain type from processing units within a partition, for example, and in another embodiment, all address translation caches (TLBs) are located in all address translation caches (TLBs) of the processing units within the partition. However, various other arrangements will be possible.

[0106] In the case where the processing units in the partition are arranged in a “master-slave” configuration, as in the implementation, the master cache controller in one implementation is a cache controller residing on the master processing unit (and for ease of explanation, the following description will primarily refer to the master cache controller as a cache controller on the master processing unit).

[0107] However, it is not necessary for the main cache controller to reside at the main processing unit. For example, the main processing unit can, in principle, configure the cache controller from the processing unit as the main cache controller for partitioning, wherein the main cache controller on the main processing unit then operates under the control of the main processing unit.

[0108] Furthermore, for some partitions of the processing unit, there can be multiple main processing units and / or multiple main cache controllers, for example, each main cache controller controls a corresponding subset of the cache within the partition. In this respect, various arrangements will be possible.

[0109] In the technique described herein, the main cache controller is operable to simultaneously send communications to all caches within a set of multiple caches in its respective partition. This type of communication can also be referred to as "multicast" (or "broadcast") communication.

[0110] This is the type of multicast cache communication specifically addressed by the techniques described in this article. For example, as mentioned above, a particular example of this would be when it is desired to invalidate multiple (e.g., all caches) caches within a partition simultaneously. However, various other arrangements will be possible.

[0111] When the primary cache controller issues such a request to all caches in multiple cache sets within its respective partition, the request is thus transmitted via a multicast communication network to each of the processing units and to each cache in the multiple cache sets residing on each of the processing units. The (manual) protocol for handling such multicast requests is as follows in the implementation scheme.

[0112] First, the main cache controller activates the request signal (e.g., and in an implementation, by setting the request signal high (e.g., to "1")). The high request signal is transmitted to all caches in multiple cache sets via a multicast communication network.

[0113] When a cache in one of the multiple cache sets that issued the request receives a high request signal, the cache processes the request (e.g., locks the transaction and verifies that the cache is in the cache of the invalid request), and after completing the request, the cache, in the implementation, is then able to operate to issue an acknowledgment to the main cache controller. Therefore, the cache returns an appropriate response signal to the request along the cache communication network to the main cache controller. For example, and in the implementation, when a cache in one of the cache sets within the partition that issued the request receives a high request signal, the corresponding cache response signal is set high.

[0114] Similarly, when the primary cache controller subsequently releases the assertion request (e.g., after all caches have received and completed the request, such as when all caches have been invalidated, in the case of a cache invalidation request, or, in the implementation, after any action that triggers the completion of the request), the request signal is set low (to "0") in the implementation. The low request signal is then transmitted via multicast cache communication to all caches in the multiple cache sets, and in the implementation, the corresponding cache response signal is low when the low request signal is received by a cache.

[0115] When the cache receives a low-demand signal, it processes the assertion release request accordingly (e.g., by removing the lock) and is then able to continue its normal operation.

[0116] Therefore, the cache can actually send a second or additional signal to the main cache controller to indicate that the cache is no longer responding (or in other words, to indicate that the cache has received, for example, and processed a low (deactivation) request signal).

[0117] However, other suitable handshake protocols can be used. For example, instead of setting the signal high when the request is asserted, as mentioned above, the signal can be set low (and then set high again when the signal is deasserted). However, various other arrangements will be possible.

[0118] The main cache controller can then monitor the request response status of (all) caches in the multiple cache sets, and act accordingly.

[0119] For example, when the main cache controller determines that a request has been determined by all caches in the multiple cache sets that issued the request, the main controller can subsequently de-assert the assertion request signal (e.g., and in an implementation, once any desired action / update has been performed).

[0120] Therefore, when the primary cache controller operates to issue a request to all caches in multiple cache sets within its respective partition, the primary cache controller in the implementation scheme activates the corresponding request signal, and the primary cache controller in the implementation scheme waits until it is determined that the closed signal has been received by all caches in all caches before the request signal is deasserted.

[0121] After the cache receives such a request from the main cache controller, the request should then be handled accordingly by the cache (e.g., by invalidating its cache entries, in the case that the request is a cache invalidation request).

[0122] While a request may be completed in multiple cache sets, a cache may be locked in one implementation, causing any processing that uses the cache to potentially have to wait until the request is complete, at which point the cache can be unlocked. For example, in the case of a cache invalidation request triggered by a memory update, processing should not use any cached content (e.g., memory address mappings in the case of a TLB) on the date the cached content appeared. Locking the cache while the request is pending ensures that this will not happen.

[0123] Therefore, after determining that all caches in multiple cache sets have received the request, the main cache controller should wait until it is determined that all caches have completed the request and no longer respond to the request (so that it is safe to continue processing), for example, before allowing processing to continue, for example, by removing any locks.

[0124] To monitor the status of requests, the primary cache controller therefore monitors the request response status of the caches within the partitions. In more conventional data processing systems, this could be accomplished by directly monitoring responses from each individual cache, etc.

[0125] However, in the configurable data processing system described in this paper, the cache communication protocol should ideally be able to accommodate different partitions containing different numbers of processing units, making it possible to attempt to track responses from each individual cache with a relatively complex communication network.

[0126] Therefore, in the technique described herein, instead of directly monitoring the response from each individual cache within the multiple cache sets that issued the request, the main cache controller monitors the combined response signal provided by the multicast cache communication network and indicates the overall request response status of all caches in the multiple cache sets.

[0127] In the technique described herein, this is accomplished by configuring a multicast cache communication network such that corresponding responses from all caches issuing the request, from (and each) processing unit, are combined to give a response signal for the processing unit as a whole. Then, in an implementation, the multicast communication network further combines the combined response signal from each processing unit with the combined response signal from each processing unit, such that an overall combined response signal representing the combined responses from all caches of all processing units in the partition is provided to the main cache controller, wherein the combined response signal represents the overall request response state of all caches in the plurality of cache sets that issued the request (e.g., not the individual request response state of any single cache in the plurality of caches that issued the request).

[0128] In other words, in the technique described herein, instead of attempting to track which response signal is associated with which individual cache, the response signals of all caches within each processing unit are combined, and this is accomplished in an implementation where each processing unit in a partition is responsible for the response signals from each processing unit itself. These response signals are then combined to give an overall combined response signal for all caches across all processing units, which is then provided to the main cache controller.

[0129] For example, in one implementation, the processing units within a partition of multiple processing units are arranged as a linear sequence of processing units (e.g., in a “daisy-chain” type arrangement), and in another implementation, the multicast cache communication network is arranged such that communication of multiple cache sets passes linearly along the sequence of processing units, for example, such that a processing unit will receive signals (only) from an adjacent processing unit.

[0130] Therefore, in the implementation, the multicast communication network is arranged such that cached communication is routed sequentially to and from processing units within a partition of multiple processing units, such that each processing unit in the partition receives signals from its adjacent processing units in the sequence. This means that in a given sequence of multiple processing units, each processing unit will be connected to only one or two adjacent processing units (depending on its position in the sequence).

[0131] Therefore, the main processing unit can be directly connected to only one of the processing units in its corresponding partition (the first processing unit in the sequence), while the other processing units in the partition are only indirectly connected through intermediate processing units in the sequence.

[0132] In one implementation, the multicast cache communication network is arranged such that each processing unit in the sequence of processing units of a partition is used to combine a response signal with all its corresponding caches, as well as a corresponding combined response signal provided from an adjacent processing unit in the sequence. Thus, in one implementation, the multicast communication network is configured to combine a corresponding response from all caches with all caches from a processing unit, the request being issued by a corresponding combined response signal provided by the previous adjacent processing unit in the sequence. This is done at each processing unit in the sequence in one implementation to generate a combined response signal for each processing unit, the combined response signal representing the overall request response status in all caches within the processing unit and in any previous processing unit in the sequence. The overall combined response signal from the processing unit is then provided as input to the next adjacent processing unit in the sequence, and so on.

[0133] This means that signals from all caches in each processing unit in the sequence are combined sequentially in one implementation to generate an overall combined response signal, which is then returned to the main cache controller along the processing unit sequence. Therefore, the combined response signal ultimately provided to the main cache controller represents the overall request response status of all caches from all processing units in the processing unit sequence.

[0134] This may be particularly beneficial in the context of the technology described in this paper, where processing units can be configured as different partitions, and such a main cache controller may have to issue and monitor such communications from multiple caches that may reside on different processing units, and may have to monitor communications in different partitions with different numbers of caches.

[0135] For example, using combined response signals in this way avoids the need for a primary cache controller to individually track the state of any particular cache in the multiple cache sets that issue the request when such multicast requests are made, and thus simplifies the cache communication network and facilitates improved system flexibility.

[0136] In the implementation, each of the processing units therefore includes logic for combining a cache response signal from its own associated cache with a cache response signal from a previously adjacent processing unit in the sequence of processing units, and for providing the combined cache response signal to the adjacent processing units in the sequence (and returning to the main cache controller).

[0137] For example, in one implementation, each of the processing units includes first logic operable to perform and operate on cache responses from a corresponding cache on the processing unit itself, as well as a combined response signal received from an adjacent processing unit in the sequence. Therefore, when all signals are high, for example because all caches are currently responding to a request (e.g., in one implementation, such that all cache response signals are set high), the and signal will be set accordingly (e.g., set high), and the setting of the and signal will indicate that all caches in the current processing unit and all caches in previous processing units in the sequence are currently responding.

[0138] As described above, this is an implementation scheme for each processing unit in the sequence, including a main processing unit, such that the main cache controller receives a first combined response signal, which is efficiently generated from the sum of all cache response signals from all processing units in the sequence.

[0139] Therefore, the first combined response signal received by the primary cache controller indicates whether all caches are currently responding. Thus, when the first combined response signal (and signal) is first set, the primary cache controller can determine that all caches have received the request, for example, allowing the request signal to be deasserted. This ensures that the request remains pending until each cache has acknowledged.

[0140] In one implementation, each of the processing units further includes second logic operable to perform an OR operation on a cache response from a corresponding cache on the processing unit itself, and a combined response signal received from the next processing unit in the chain. Thus, the second logic residing on the main processing unit is operable to provide a second combined response signal (OR signal) indicating whether any cache in the complex set of caches is currently responding (and thus, in an implementation indicating whether any cache response signal is currently set high).

[0141] For example, when the OR signal is set first, it means that at least one cache in the set of composite caches that issued the request is currently responding. Correspondingly, when the OR signal goes down (after it has been set first), it means that none of the caches are currently responding, for example, because all cache response signals have been deasserted (in response to the request signal being deasserted). Therefore, the main cache controller can determine that all caches have received a deassertion request and thus safely continue processing.

[0142] Therefore, in one implementation, each processing unit has associated first logic for generating a first combined response signal, the first combined response signal indicating that all caches in the plurality of cache sets to which the request was issued are currently responding to the request. In another implementation, each processing unit also has associated second logic for generating the first combined response signal, the first combined response signal indicating that any cache in the set of caches to which the request was issued is currently responding to the request.

[0143] The effect of all this is that the multicast cache communication network is arranged to provide a master cache controller with one or more combined response signals indicating the overall request response status of all caches in the partition. For example, and in an embodiment, the determined overall request response status could be that all caches in the set of multiple caches have received a valid request (or accordingly, they have not yet received the request) and / or that the request has been unclosed for all caches in the set of multiple caches, for example, such that all caches have completed the request (or accordingly, they have not yet completed the request). In an embodiment, as discussed above, two separate combined response signals (e.g., a signal and an OR signal) are provided to the master cache controller.

[0144] Therefore, in the implementation, the main cache controller is operable to determine the overall request response status of all caches in the set of complex caches from the combined response signal.

[0145] For example, in an embodiment where the primary cache controller is configured to determine the current overall request response state of all caches that issued all the requests from a first and / or a second combined response, the embodiment is configured to determine the overall request response state from the first combined response and / or the second combined response signal. In one embodiment, the determined overall request response state is one or more of the following: (i) no cache in the plurality of cache sets where the request has been received; (ii) at least some caches in the plurality of cache sets where the request has been received; (iii) all caches in the set of multiple caches that issued the request have received the request; (iv) at least some caches in the set of multiple caches where the request has been completed; and (v) all caches in the plurality of cache sets that issued the request have completed the request.

[0146] In the implementation, the main cache controller is operable to determine these states (at least) in this order. For example, in the implementation, the main cache controller first determines whether all caches in the multiple cache sets have received the request. After that point, the request signal can be deasserted. Then, the main cache controller determines whether the deasserted request signal has been received by all caches in the multiple cache sets, for example, such that all caches have completed the request (e.g., processing can resume, or, for example, the next request can be issued, etc.).

[0147] As described above, in this implementation, which is implemented using two combined response signals, there is a sum of responses from all caches in multiple cache sets (such that when the OR signal is low, it is determined that all caches have received the request) and a second signal that is an OR combination of responses from all caches in multiple cache sets (such that when the OR signal returns low (after it has been previously set), all caches have stopped responding, for example, it is determined that all caches have completed the request).

[0148] Therefore, in the implementation scheme, the overall request protocol is as follows: When the primary cache controller issues a request, the primary cache controller enables the request signal; when the first cache responds, the OR signal will be set, but this will not cause any changes; instead of the primary cache controller waiting for the OR signal to be set before de-asserting the request signal, because the setting of the OR signal confirms that all caches have received the request; then, the primary cache controller waits for the OR signal to fall, because this confirms that all caches have completed the request.

[0149] The benefit of this approach is that the primary cache controller does not need to know how many caches are in a partition or which caches are responding, because it only needs overall knowledge when all caches have responded and when all cache requests have been fulfilled.

[0150] Therefore, the multicast cache communication protocol described above can allow for improved, for example, simplified, multicast communication protocols, such as in one implementation, the protocol knowing which cached components are actually responding. This also contributes to increased flexibility, as each processing unit only needs to connect to its adjacent processing units in a partitioned sequence of processing units, such as in a "daisy-chain" arrangement, rather than having to directly connect to each processing unit.

[0151] The aforementioned communication protocol can also support asynchronous communication because it can use a relatively simple handshake protocol, for example, simply relying on a combination of response signals set to high / low, as described above.

[0152] This avoids the need for clock synchronization between the controller and the cache (since the request-response interface is synchronized with the controller). Therefore, this approach is particularly suitable for cache communication in the differentiation of different processing units, where the processing units typically and in implementations have independent clock domains. In this case, one or more synchronizers can be provided as needed. For example, each processing unit may include one or more synchronizers (e.g., double-flip-triggered synchronizers) to account for cross-clock domain boundaries between processing units within a partition. As another example, synchronization can be performed by the main processing unit.

[0153] However, other suitable multicast communication protocols may be used as needed.

[0154] The multicast cache communication network in the implementation includes at least a first communication line for transmitting request signals to the caches in the set of multiple caches and a second or additional communication line for transmitting response signals from the caches to the main cache controller. In one implementation, the communication network is arranged such that the communication lines for both request and response signals are sequentially connected, for example, in a "daisy-chain" arrangement.

[0155] Request / response signals are transmitted via a communication network. Therefore, the communication network is configured to process such transactions. In some implementations, the communication network is dedicated to such multicast communications, for example, and in some implementations, the communication network is configured only to process such multicast transactions. (In this case, additional communication networks can be provided for other communications, as will be explained further below).

[0156] In some cases, multicast communication networks may be able to route multicast communications to a specified subset (less than all) of the caches within a partition's processing units. In this scenario, the set of multiple caches from which the main cache controller can emit multicast signals may be a subset (less than all) of all caches within the partition.

[0157] However, in the implementation, the multicast communication network is configured for, and in the implementation, "broadcast" communication to all caches within a partition or to at least all caches of a certain type (e.g., all TLBs from all processing units within the partition).

[0158] In some implementations, as described above, the communication network is configured to handle cache invalidation requests (e.g., and in some implementations, solely for handling cache invalidation requests). For example, providing a dedicated network for cache invalidation requests may be beneficial, for instance, to prevent cache invalidation requests from interfering with other communications within the network.

[0159] A cache invalidation request can be triggered in any suitable manner, as needed. For example, in the case of a complex set of TLBs, a new set of memory address maps may be required. This could be done after a system reset. Alternatively, it could be done in response to a memory allocation failure. As another example, this could be triggered by a new software processing job that requires reconfiguration of memory maps. In implementations, a cache invalidation request is a request that invalidates all data in the cache. However, this is not the case.

[0160] For example, while in some implementations the invalidation request / response signal may be the only communication on a multicast communication network, the network may also include additional channels for providing other types of information. For instance, the network may have channels for providing identification information and cache invalidation requests, which identify which specific data will be invalidated. Therefore, instead of simply verifying all data in the cache, each cache may identify a specific cache line storing data corresponding to the identification information, and then invalidate those lines while validating the others. For example, the identification information may identify the address or address group of the data to be invalidated, or the address space or context in which the data will be invalidated.

[0161] In one implementation, the multicast cache communication network itself can also be used for different partitions of the processing unit to allow cache controllers of different processing units within the group to be configured as primary cache controllers for different corresponding partitions.

[0162] Providing a reconfigurable cached communication network can facilitate increased overall flexibility of the data processing system. For example, this subsequently avoids the need to fix the communication network, for instance, during system manufacturing, because the communication network can be configured as needed, and, for example, reconfigured during use in an implementation, such that the requirements of the multicast cached communication network do not restrict the configuration of processing units to corresponding partitions.

[0163] For example, in this approach and which benefits from the techniques described herein, all processing units within the group may be, and in one implementation are substantially functionally equivalent to each other, for example, and in another implementation are identical to each other, such that each processing unit and any processing unit is configured as needed to act as a master processing unit or a slave processing unit, and then the communication network is configured accordingly based on the organization of the processing units within the partition.

[0164] For example, as described above, each processing unit in the implementation includes one or more communication bridges for connecting the processing unit to other processing units in its corresponding partition, for example, in a "daisy-chain" arrangement, such that each processing unit is connected (only) to an adjacent processing unit in the chain. Therefore, each processing unit can have a corresponding set of communication interfaces, allowing for differentiated communication between adjacent processing units on the communication bridges.

[0165] To allow for the configuration of multicast communication networks for different zones in the manner described above, each of the processing units may, for example and in an implementation, include a set of one or more isolation switches, allowing communication interfaces to be selectively isolated, for example, to prevent communication from said communication interface. Therefore, this isolation logic can ensure proper behavior at the power domain boundaries of the zones, for example by coupling the first and last units in the chain to ground or by looping return requests and responses as needed.

[0166] Therefore, in the implementation, each processing unit includes one or more network interfaces for communicating with adjacent processing units in the partition, and each processing unit further includes a set of one or more isolation switches that can be configured to isolate communication from the respective network interfaces.

[0167] For example, in one implementation, the main processing unit is located at a logical end of the processing unit sequence. Therefore, an isolation switch on one side of the main processing unit can be configured to isolate the main processing unit from communication from said side (because communication from that side should not occur when the main processing unit is at the end of the processing unit sequence). Correspondingly, the final slave processing unit located at the other end of the processing unit sequence can be isolated at that other end. In the case of one or more intermediate "slave" processing units located between the main processing unit and the final processing unit, these can be opened to communicate at both ends, allowing communication to be passed from processing unit to processing unit along the sequence of processing units.

[0168] Therefore, each processing unit can be configured as a master processing unit or a slave processing unit in the implementation scheme, and the communication network can be appropriately configured, for example, by setting up relevant isolation switches, based on the position of the processing unit within the processing unit sequence.

[0169] However, other logical network arrangements can also be used. For example, in another implementation, the processing units in the partition can be connected in a "ring," in which case the interface of each processing unit can be opened at both ends, allowing communication to occur around the ring.

[0170] In the implementation scheme, operation using the techniques described herein can be selectively activated for a given output generated by the processing unit.

[0171] This operation can be controlled, for example, by a (software) driver for the processing unit or by a suitable manager or "arbitrator" that controls access to the processing unit. Therefore, in one embodiment, one or more processing units may be selectively configured to operate in a manner described herein, for example, and in an embodiment, on a per-output basis.

[0172] Of course, other arrangements are possible.

[0173] As required by the technical requirements described herein, the processing unit (or unit) of the data processing system may additionally include any or all of the normal components, functional units, and elements that such a processing unit may include.

[0174] Each processing unit may have the same set of functional units, or some or all of the processing units may be different from each other.

[0175] As described above, to promote flexibility, all processing units within a group are substantially functionally equivalent in the implementation scheme, for example, such that any processing unit within a processing unit can, in principle, be configured as a "master" or "slave" of the corresponding partition. Therefore, in the implementation scheme, each of the processing units includes the same functional units (although some functional units may be "inactive," for example, when they operate as slaves).

[0176] Therefore, in the case of a graphics processing unit, for example, each graphics processing unit in the embodiments includes one or more execution units, such as one or more shader (programmable processing) cores. In the embodiments, each graphics processing unit includes multiple shader cores, such as three or four shader cores.

[0177] In one implementation, the graphics processing unit (and therefore the graphics processing system) is a tile-based graphics processing unit, and (e.g., all) the graphics processing unit also includes a tiling unit (bicycle or grader).

[0178] The processing unit can be operated to perform processing under the control of a host processor (e.g., CPU). The host processor can be any suitable and desired host processor of the data processing system. The host processor can, and in one embodiment, execute an application that may require data processing by the processing unit, and includes and executes appropriate drivers (e.g., a compiler) for the processing unit, enabling it to prepare commands, instructions, data structures, etc., for the processing unit to execute and use in response to requests for data processing operations from an application executing on the host processor to perform desired data processing operations.

[0179] The processing unit in the implementation scheme further includes one or more of the following, and all of them in one implementation scheme: a management unit (e.g., a job manager) that provides a host processor (or virtual machine) (software) interface for the processing unit and is also operable to divide data processing tasks assigned to the processing unit into subtasks and assign subtasks to execution units or units for execution to the processing unit; and a cache (e.g., an L2 cache) for processing data generated by the processing unit and providing an interface to the external (primary) system memory of the data processing system.

[0180] In one embodiment, at least some of the functional units of the processing unit (e.g., shader cores, management units, L2 caches, etc.) include address translation caches (such as translation back buffers). In another embodiment, the processing unit also includes a memory management unit (MMU) (however, if desired, a suitable memory management unit may or alternatively be located outside the processing unit or unit).

[0181] As mentioned above, the techniques described herein can be used for specific utility of multicast communication within processing units within a partition, and in one embodiment, the communication network is configured to handle such communication (and, for example, and in one embodiment, only to process such multicast communication into the address translation cache within the processing unit within the partition).

[0182] However, the techniques described in this article can generally be applied to any suitable caching system, and to any suitable type or level of cache that can be provided within a processing unit.

[0183] Each processing unit will also include a suitable communication network to provide communication between the various units of the processing unit, such as memory transactions between the execution unit and / or the cache of the processing unit, subtask control operations between the job manager and the execution unit, etc.

[0184] Of course, other thread groups are also possible.

[0185] The communication networks described above and those specifically involved in the techniques described herein are configured to handle "multicast" cached communication. In implementations, the communication network is dedicated to such multicast cached communication, making it potentially unusable for other types of communication. In this case, and typically, other communication networks will exist within the processing system to allow other types of (e.g., cached) communication.

[0186] For example, in one implementation, a separate cache communication network is provided to support “unicast” cache communication from the main cache controller to any specific (single) cache within a partition. For instance, most of the communication issued by the main cache controller during processing operations can be unicast communication, targeting a specific cache.

[0187] In one embodiment, the unicast cache communication network is configured as a switch network that provides multiple communication paths between a primary cache controller and any given cache within a processing unit of a partition, the primary cache controller being configured for this purpose. Therefore, when a communication path is blocked, because it is already being used to provide signaling on the network, the switch network can choose another path to route the signal to the target cache. If such a switch network is also used for multicast communication, this can block a significant number (or even all) of the communication paths used to communicate with a given cache when multicast messages are sent, causing other communications to be delayed. This can be avoided by providing a separate network dedicated to multicast communication, and this is as described above in the embodiments of the techniques described herein.

[0188] However, other arrangements are possible, and subject to the specific requirements of the multicast communication network of the technology described herein, the data processing system of the technology described herein may typically include any (other) suitable communication network as needed.

[0189] In addition to the processing units, controllers, etc. necessary for operation in the manner described herein, the data processing system may further include any other suitable and desirable components, elements, units, etc. that the data processing system may include.

[0190] Therefore, a data processing system may include, for example, one or more peripheral devices, such as one or more output devices (e.g., displays, vehicle controllers, etc.) and / or one or more input devices (e.g., human-computer interfaces, vehicle sensors, etc.).

[0191] In one implementation, when the data processing system comprises multiple processing units (which can operate independently or in combination), each processing unit may receive processing instructions, such as data processing output from a host processor or a virtual machine executing on a virtual machine, and execute the received instructions independently. For example, each processing unit in the implementation has associated (task) management circuitry (e.g., a job manager) that provides a suitable software interface to the processing unit when operating in independent mode. In one implementation, at least some of the processing units may also operate in combination, for example, in a master-slave arrangement.

[0192] Virtual machines (host processors) can have access to one or more peripheral devices of the same set, or, for example, a separate set of peripheral devices can be provided for different groups of virtual machines (again, this may be beneficial for security and / or security purposes).

[0193] The overall data processing system in the implementation scheme includes appropriate (system) memory for storing data used by the processing units during processing and / or storing data generated by the processing units as a result of processing. Different groups of processing units can be configured to connect to the same (system) memory, or separate system memory can be provided for different groups (again, this may be beneficial for security and / or safety purposes).

[0194] Correspondingly, different groups of processing units can be connected to external system memory via the same or different memory interconnects.

[0195] Therefore, in an implementation, the data processing system includes a processing unit and one or more host data processing units (processors) (e.g., central processing units), and one or more virtual machines execute on one or more virtual machines (in one implementation, together with one or more drives (for the processing unit).

[0196] In one embodiment, the data processing system and / or processor further includes one or more memory and / or memory devices for storing the data described herein and / or storing software for performing the processes described herein, and / or communicating with said one or more memory and / or memory devices.

[0197] In one implementation, the various functions of the technology described herein are executed on a single processing platform.

[0198] The techniques described herein can be implemented in any suitable system, such as a properly configured microprocessor-based system. In some implementations, the techniques described herein are implemented in computer- and / or microprocessor-based systems.

[0199] The various functions of the technology described herein can be performed in any desired and suitable manner. For example, the functions of the technology described herein can be implemented in hardware or software as needed. Therefore, for example, unless otherwise specified, the various functional elements, stages, and “devices” of the technology described herein may include suitable one or more processors, one or more controllers, functional units, circuit systems, circuits, processing logic units, microprocessor arrangements, etc., which are operable to perform various functions, such as appropriate dedicated hardware elements (processing circuits / circuit systems) and / or programmable hardware elements (processing circuits / circuit systems), which can be programmed to operate in a desired manner.

[0200] It should also be noted here that, as those skilled in the art will understand, the various functions of the techniques described herein can be copied and / or executed in parallel on a given processor. Similarly, various processing stages can share processing circuitry, etc., if desired.

[0201] Depending on the hardware required to perform the specific steps or functions described above, the system may otherwise include data processing devices and / or any one or more or all of the usual functional units included in the system.

[0202] Those skilled in the art should also understand that all embodiments of the technology described herein may, as appropriate, include any one or more or all of the features described herein in one embodiment.

[0203] The methods described herein can be implemented at least in part using software, such as computer programs. Therefore, it can be seen that, when viewed from another embodiment, the techniques described herein provide: computer software particularly suitable for performing the methods described herein when installed on a data processor; computer program elements including computer software code portions for performing the methods described herein when the program elements are run on the data processor; and a computer program including code suitable for performing all steps of one or more methods described herein when the program is run on a data processing system. The data processor may be a microprocessor system, a programmable FPGA (Field-Programmable Gate Array), etc.

[0204] The techniques described herein also extend to computer software carriers that, when used to operate a graphics processor, renderer, or other system including a data processor, cause said processor, renderer, or system to perform the steps of the methods described herein in conjunction with said data processor. Such computer software carriers can be physical storage media, such as ROM chips, CD-ROMs, RAM, flash memory, or disks, or they can be signals, such as electronic signals transmitted through wires, optical signals, or radio signals, such as signals to satellites.

[0205] It should also be understood that not all steps of the method described herein need to be performed by computer software. Therefore, in contrast to another broad implementation, the technology described herein provides computer software and such software installed on a computer software carrier for performing at least one step of the method described herein.

[0206] Therefore, the techniques described herein can be suitably embodied as computer program products used with computer systems. Such embodiments may include a set of computer-readable instructions fixed on a tangible, non-transitory medium, such as a computer-readable medium, for example, a disk, CD-ROM, ROM, RAM, flash memory, or hard disk. It may also include a set of computer-readable instructions that can be transmitted to a computer system via a modem or other interface device through a tangible medium (including, but not limited to, optical or analog communication lines) or passively using wireless technologies (including, but not limited to, microwave, infrared, or other transmission technologies). This set of computer-readable instructions embodies all or part of the functions previously described herein.

[0207] Those skilled in the art will understand that such computer-readable instructions can be written in a variety of programming languages ​​to be used with many computer architectures or operating systems. Furthermore, such instructions can be stored using any current or future memory technology (including, but not limited to, semiconductor, magnetic, or optical technologies), or transmitted using any current or future communication technology (including, but not limited to, optical, infrared, or microwave technologies). It is conceivable that such computer program products can be distributed as removable intermediates along with accompanying printed or electronic documentation (e.g., shrink-wrapping software), pre-loaded onto computer systems (e.g., system ROM or fixed disks), or distributed via networks (e.g., the Internet or the World Wide Web) from servers or electronic bulletin boards.

[0208] Implementation schemes of the technologies described herein will now be described.

[0209] Figure 1 An implementation scheme of a data processing system in the form of an automotive system-on-a-chip (SoC) is shown.

[0210] like Figure 1 As shown, the data processing system 1 of this embodiment includes three CPU (Central Processing Unit) clusters: a first "Quality Management" (QM) cluster 2, which includes a CPU 3 running "Quality Management" software (therefore, CPU 3 does not have automotive safety features); a second, "ASIL" (Automotive Safety Integrity Level) (Functional Safety, FuSa) cluster 4, which includes a CPU 5, but this time runs appropriate safety certification software; and a "Safety Island" cluster 6, which includes a CPU 7 running safety certification software for configuring the system and handling faults.

[0211] like Figure 1 As shown, each CPU cluster also includes its own General Interrupt Controller (GIC) 8, 9, 21.

[0212] The CPU cluster also includes a “graphics processing” cluster 10, which includes a collection 11 of graphics processing units (“slices”) that are capable of providing processing capabilities to virtual machines running on QM cluster 2 and ASIL cluster 4, as discussed further below.

[0213] In this example, the set 11 of graphics processing units includes eight graphics processing units (slices 0-7, where each slice is a graphics processing unit of the set), but of course, other numbers of graphics processing units are possible. As will be discussed further below, in this embodiment, the graphics processing units (GPUs) can operate in various modes, namely as “standalone” GPUs or as one or more primary (master) linked sets and one or more secondary (slave) GPUs.

[0214] The graphics processing unit 11 also has a management circuit (partition manager) 12 associated with it (as part of the graphics processing cluster 10).

[0215] like Figure 1 As shown, the system supports three separate communication bus connections for the graphics processing cluster 10: a first communication bus 18, which can be used, for example, for non-security-critical services and is therefore used by the QM cluster 2; a second bus 19, which can be a security-critical / safety bus and is used, for example, for security-critical services and is therefore used by the ASIL cluster 4; and a third bus 20, which can be a security-critical / safety bus but also has access restrictions (i.e., can only be accessed by the bus master with appropriate privileges) and is only used for communication configured by the security island 6.

[0216] The system also includes a suitable system cache 13, a DRAM controller 14, interconnects 15 and 16, and a system memory management unit (sMMU) 17 (e.g., providing two-level address translation to separate secure and insecure address spaces, and separating memory accesses based on the memory accesses for each virtual machine of the graphics processing cluster 10).

[0217] Of course, it can be Figure 1 Functional units, processors, system elements, and components not shown in the diagram.

[0218] The management circuitry (partition manager) 12 for the graphics processing unit 11 is operable to configure and set a configurable communication network that sets communication paths between different graphics processing units (slices) 11, and also how to communicate with the QM cluster 2 (and specifically, which of buses 18 and 19 is available for communication with the respective graphics processing unit). In particular, it can configure the communication network to distribute the graphics processing units (slices) 11 into two different groups in this embodiment: one group for the QM cluster 2 (and coupled to bus 18 of the cluster), and one group for the ASIL cluster 4 (and coupled to bus 19 of the cluster).

[0219] It also allows setting up a configurable communication network to divide the graphics processing units into different groups. The management circuit (partition manager) also supports and can configure the organization of the graphics processing units in a group as one or more independently assignable partitions (subsets) of the graphics processing units (slices).

[0220] The management circuitry (partition manager) 12 also provides a set of "access windows" in the form of a communication interface, allowing the virtual machine to access and control a given partition of the graphics processing unit. In this embodiment, each such access window includes a set of (communication) registers having a corresponding set of physical addresses that can be used to resolve those registers.

[0221] These access windows also provide mechanisms whereby virtual machines can communicate with an arbitrator (which is used for the group of graphics processing units that use the virtual machine), and specifically provide mechanisms for virtual machines and the arbitrator to exchange messages, such as virtual machines requesting processing resources, and the arbitrator controlling the virtual machine's access to the processing unit (partition), and / or when a virtual machine uses it to make itself use the partition, for example, to allow different virtual machines to access it. The virtual machine arbitrator interface is separate from the virtual machine graphics processing unit partition interface.

[0222] Therefore, the graphics processing cluster 10 effectively provides a set of graphics processing resources, including graphics processing units (slices) 11 and partitions and access windows supported by management circuitry 12, which can be subdivided into multiple (two in this embodiment) graphics processing resource "groups", each containing one or more graphics processing units (slices) and associated with one or more independently assignable partitions and one or more "access windows" of the graphics processing cluster.

[0223] In this embodiment, the management circuit (partition manager) 12 supports the graphics processing unit 11 in dividing it into two distinct groups (one for use by QM cluster 2 and the other for ASIL cluster 4) and provides a set of 16 access windows for virtual machines to communicate with the partitions of the graphics processing unit. Of course, other arrangements are possible.

[0224] In this implementation, the configuration of these graphics processing resources into the corresponding groups is accomplished under the control of the (permission) controller executed on the security island 6 by the management circuit (partition manager) 12, and the corresponding arbitrators executed on the QM cluster 2 and ASIL cluster 4.

[0225] To support this operation, the management circuitry (partition manager) 12 further includes appropriate configuration interfaces, which, in some embodiments, can be accessed and set by the controllers of the security island 6 and the arbitrator on the CPU cluster, respectively, in the form of appropriate sets of configuration registers. The controllers and arbitrators can then set their configuration registers accordingly, thereby controlling the management circuitry (partition manager) 12 to configure the graphics processing resources (and, in particular, the configurable communication network for configuring the graphics processing resources) accordingly. The management circuitry (partition manager) 12 may also include one or more state machines for this purpose.

[0226] Figure 2 This is illustrated, and QM cluster 2, ASIL (FuSa) cluster 4 and security island 6 are shown, as well as the (permission) system controller 30 executing on security island 6, the arbitrator 31 executing on QM cluster 2 and the arbitrator 32 executing on ASIL (FuSa) cluster 4.

[0227] Arbitrators 31 and 32 are capable of operating to control access by virtual machines running on the respective clusters to the corresponding graphics processing resource groups already allocated to the clusters. Arbitrator 32 for ASIL cluster 4 is configured to operate and support operations in an appropriate safety-critical manner. Arbitrator 31 for QM clusters does not need to be configured to operate and support safety-critical operations.

[0228] Each arbitrator can operate in association with a corresponding hypervisor to manage the operations of virtual machines running on the cluster (but separately from the hypervisor).

[0229] Figure 2 A set of corresponding virtual machines 33 executing on QM cluster 2 and a set of virtual machines 34 executing on ASIL cluster 4 are also shown. In this example, it is assumed that there are two virtual machines executing on each cluster, but of course, other arrangements will be possible. Each cluster correspondingly executes an appropriate graphics processing unit (GPU) driver 35 for each virtual machine it supports.

[0230] Figure 2 The corresponding communication links between the controller 30 and the arbitrators 31 and 32 are also shown, as well as from the controller 30 and the arbitrators 31 and 32, and the virtual machines 33 and 34 (via driver 35), to the management circuitry (partition manager) 12 of the graphics processing unit cluster 10.

[0231] Controller 30 can configure one or more graphics processing units 10, one or more partitions supported by partition manager 11, and one or more access windows supported by partition manager to each "resource group". Each group is also assigned to a corresponding one of "cluster" communication buses 18 and 19, depending on whether the group will be used by QM cluster 2 (in which case it will be assigned to the corresponding QM cluster bus 18) or by ASIL cluster 4 (in which case it will be assigned to ASIL bus 19).

[0232] To configure the appropriate groups of graphics processing resources available for QM cluster 2 and ASIL cluster 4, the controller 30 on security island 6 sets appropriate configuration parameters in the (permission restriction) configuration register of management circuit (partition manager) 12, and accordingly configures the communication network for graphics processing unit (slice) 11 in response to management circuit 12. Figure 1 and 2 As shown, the controller 30 communicates directly with the management circuit (partition manager) 12 via the restricted configuration bus 20.

[0233] As will be understood from the above, in this embodiment of the technology described herein, the graphics processing unit and its associated management circuitry can be considered to be divided into three distinct “safe” structural domains: a “control” safety domain 50, which includes the main configuration control of the management circuitry 12 owned and controlled by the “safety island” CPU cluster 6; a “safety critical” domain 51, which includes a set of graphics processing resources used by the “safety critical” ASIL CPU cluster 4; and a second “non-safety critical” domain 52, which includes a set of graphics processing units to be used, etc., owned by the QM CPU cluster 2.

[0234] Figure 3 This shows, and in more detail, the arrangement of the "ownership" of different aspects of the management circuit and the "ownership" of the graphics processing unit between different domains.

[0235] like Figure 3As shown, the management circuitry (partition manager) 12 includes, in particular, a set of control interfaces (communication interfaces) 53 that can be used to control the management circuitry to configure graphics processing resource groups and then use the resources within the groups. These control (communication) interfaces include corresponding address spaces and register sets (processor clusters) that can be resolved by appropriate software executed on the processor.

[0236] These control interfaces first include a “system” interface 54, which includes a set of control registers that can be used, for example, to set system parameters, such as a stream ID for a corresponding access window.

[0237] System interface 54 can also be used (by controller 30) to configure fault protection and detection settings (operations), such as enabling desired fault detection mechanisms (and their interruptions), enabling fault detection of desired groups, partitions and graphics processing units, and / or configuring behavior in the event of a fault (e.g., whether fault reporting is enabled or disabled, the current operation should terminate or continue, etc.).

[0238] Then, the "assignment" interface 55 is used by the controller 30 on the security island CPU cluster 6 to allocate resources (and thus graphics processing units (slices), partitions, and access windows) to appropriate groups and assign groups to appropriate communication buses.

[0239] like Figure 3 As shown, these interfaces 54 and 55 of the management circuit are used by and belong to the controller 30 on the security island processor cluster 6, and are accessed through the corresponding access bus 20 for communication with the security island CPU cluster 6.

[0240] Then, the management circuit 12 further includes a set of “group” configuration interfaces 56, which can be used by the arbitrator for the respective groups to configure resources within the group, and in particular to configure and set the allocation of graphics processing units and access windows to the respective partitions within the group.

[0241] like Figure 3 As shown, these group configuration interfaces can be accessed by the corresponding arbitrator, which is assigned to the corresponding communication bus of the processor cluster that performs the arbitrator execution.

[0242] exist Figure 3 In the example shown, it is assumed that groups 0 and 1, partitions 0 and 1, graphics processing units (slices) 0-2 and an appropriate set of access windows have been assigned to ASIL CPU cluster 4, and therefore the cluster will be controlled by the corresponding arbiter 32 via ASIL cluster communication bus 19.

[0243] Correspondingly, groups 2-3, partitions 2-3, graphics processing units 3-7, and a suitable set of access windows have been assigned to QM cluster 2, and the cluster will therefore be controlled by arbitrator 31 via QM cluster bus 20.

[0244] If needed, resources can be used to access other distributions within the group (and therefore across CPU clusters).

[0245] In addition to group configuration interface 56, the management circuit also provides a set of partition control interfaces 57, which can be used by the arbitrator to power the graphics processing unit group to which the partition belongs, in particular, to power on the open and closed partitions, reset the partitions, and also, as will be further discussed below, trigger fault detection tests of the partitions.

[0246] Then, the management circuit 12 finally provides a set of access windows 58 to provide communication and control interfaces, thereby allowing the virtual machine to access and control partitions of the graphics processing unit group that have been granted access. As mentioned above, the access windows also provide appropriate messages through the interface for communication between the arbiter and the virtual machine to which the access window belongs.

[0247] Figure 3 Also shown is a configurable communication network 59 for the management circuitry, which, as described above, can be set up under the control of the controller on the security island 6 to configure the graphics processing units into appropriate groups and couple them to an appropriate one of the communication buses 19, 20, etc.

[0248] The management circuitry is connected to three separate communication buses, as discussed above, which can be used to communicate with the management circuitry and the graphics processing unit: the access control bus 20 for communicating with the security island CPU cluster 6, the bus 19 for communicating with the ASIL CPU cluster 4, and the bus 20 for communicating with the QM CPU cluster 2.

[0249] To further support and facilitate the separation of hardware between different groups of graphics processing units (and therefore different architectures), management circuitry 12 is capable of powering and independently shutting down respective partitions of graphics processing units and individual graphics processing units within those partitions, and correspondingly, independently resetting the partitions of graphics processing units (and individual graphics processing units). This is accomplished under the control of the group of graphics processing units in question via the corresponding partition interface 57.

[0250] On the other hand, such as Figure 3 As shown, the management circuitry itself is always powered (and can be powered only under the control of the system controller 30 on the island CPU 6). Correspondingly, the management circuitry can only be reset by the system controller 30 on the island CPU 6. Figure 3As shown, in this implementation, there are two levels of "reset", which can be applied to the management circuitry, a first "reset" that resets all hardware, and a second "recovery reset" that resets all hardware, in addition to the error reporting mechanism (which can be used, for example, when error recovery requires a reset (e.g., because the unit is unresponsive)).

[0251] Moreover, such as Figure 3 As shown, each CPU cluster has its own independent interrupt. In this implementation, each partition of the management circuitry and graphics processing unit can generate its own independent interrupt. The interrupt is broadcast to all CPU clusters in the system, where the corresponding interrupt controller identifies for each CPU cluster whether the broadcast interrupt applies to it or not to a partition of a set of graphics units that it owns in the case of ASIL CPU cluster 4 and QM CPU cluster 2, or from the management circuitry in the case of Security Island CPU cluster 6.

[0252] In this implementation, in order to further support the operation of the graphics processing unit groups in separate "safety critical" and "non-safety critical" domains, and under the control of the "safety island" domain, the system further supports and uses appropriate fault protection mechanisms for managing circuit 12 and graphics processing unit 11.

[0253] Specifically, the management circuit operates permanently under a high level of fault protection, in this embodiment by always and permanently undergoing a fault detection process (monitoring). This is achieved in this embodiment by protecting the management circuit through a dual-core locked-step fault detection mechanism, i.e., the management circuit is safe twice, where one instance of the management circuit is used to check the operation of the other instance of the management circuit at all times (and if there is any difference between them, it will be considered an indication of a fault).

[0254] On the other hand, the graphics processing unit (GPU) is not protected by the dual-core lockout step, but is instead protected against failures using the Built-in Self-Test (BIST). In this embodiment, this BIST can be selectively triggered on the GPU under the control of an arbitrator for the GPU group to which the GPU belongs. In particular, as described above, the arbitrator can use the partition control interface 57 to trigger a partition's BIST failure detection test.

[0255] like Figure 3 As shown, to support the use of BIST fault detection testing for the graphics processing unit, the data processing system also includes a suitably configured BIST unit (circuit) 60. Therefore, when an arbitrator for a group of graphics processing units instructs the graphics processing unit to undergo a built-in self-test, the test will be appropriately performed on the graphics processing unit in question by the BIST unit.

[0256] Figure 4 The arrangement and components of each graphics processing unit (slice) 11 in this embodiment are shown in more detail.

[0257] like Figure 4 As shown, in this embodiment, each graphics processing unit (slice) includes one or more execution units, such as a programmable processing (shader) core 500 (SC) and a layer tiler 502 (HT). In this embodiment, each graphics processing unit is tile-based. Different graphics processing units 11 may have different sets of execution units and have more than Figure 4 The more execution unit types shown.

[0258] Each graphics processing unit (GPU) also includes a Level 2 cache 504 (L2) for input data to be used for data processing tasks and output data via a cache interface 506. The cache interface 506 is connected to external system memory 116 via a suitable memory interconnect. The GPU may also include a memory management unit (MMU) 508, but this may or may alternatively be located outside the GPU.

[0259] Each graphics processing unit 11 also includes one or more communication bridges, including a slave bridge 510 for connecting to a master graphics processing unit (the master graphics processing unit can be directly connected or daisy-chained with other slave graphics processing units) and / or a master bridge 512 for connecting to slave graphics processing units. The master bridge 512 is used in master mode to connect one or more slave graphics processing units (via daisy-chaining), and can also be used in slave mode to connect additional daisy-chained slave graphics processing units.

[0260] In this implementation, communication bridges 510 and 512 are implemented to support asynchronous interfaces between graphics processing units (GPUs) because this allows for easier implementation of GPUs, since clocks can then be independent when GPUs are linked.

[0261] Each graphics processing unit (GPU) also includes a job manager 514. This provides a software interface for the GPU 11 and thus receives tasks (commands and data) from the driver running on the CPU cluster via a task interface 516 (commands and data), divides the tasks given by the driver into subtasks, and assigns the subtasks to various execution units (shader cores 500, tilers 502) for execution. When the GPU 11 is capable of operating as a master device, the job manager 514 is configured to also control the execution units of linked slave GPUs. Correspondingly, for GPUs 11 capable of slave operation, the job manager 514 can be disabled when the GPU is operating in slave mode.

[0262] like Figure 4 As shown, various functional units of each graphics processing unit are interconnected via asynchronous communication interconnect 518. This asynchronous communication interconnect carries various services, such as memory transactions between execution units and the Level 2 cache 504 (L2), and subtask control services between the job manager 514 and the execution units. Figure 4 As shown, the asynchronous interconnect 518 is also connected to the corresponding slave and master bridges 510, 512 of the graphics processing unit 11, and includes a suitable switch (not shown) that can be activated to enable or disable communication across (via) bridges 510, 512 to the connected graphics processing unit.

[0263] The different operating modes of the graphics processing unit (standby mode, master mode, and slave mode) are set (enabled and disabled) by appropriately configuring the routing of the asynchronous interconnect 518. Thus, for example, when the graphics processing unit operates in standby mode, slave and master bridges 510, 512 are disabled to prevent communication via (across) the bridges. Correspondingly, when the graphics processing unit is used as a master bridge, master bridge 512 is enabled to allow communication with connected graphics processing units. Correspondingly, when the graphics processing unit is used as a slave, slave bridge 510 is enabled to allow communication with connected graphics processing units.

[0264] In this embodiment, the asynchronous interconnect 518 is reconfigured by the management circuitry (partition manager) 12 via the configuration interface 520 of the graphics processing unit 11. Any route configuration (or reconfiguration) in this embodiment occurs only during a reset of the graphics processing unit.

[0265] Each graphics processing unit 11 also has an associated identifier unit 522, which stores an identifier or identifier of the (currently enabled) access window assigned to that graphics processing unit. The identifier is provided by management circuitry 12 via the identifier interface 524 of the graphics processing unit. The graphics processing unit can then output the identifier, for example, along with output data from L2 cache 504. The identifier can be used for memory access permission checks, such as when a virtual machine and / or graphics processing unit may be unable to access data associated with another virtual machine and / or graphics processing unit because it does not know the correct identifier for accessing that data.

[0266] Figure 4 The decoder operation according to the implementation of the technology described herein is shown in more detail; however, it should be noted again that... Figure 4 This is just an illustration, and for clarity, various components and connections have been omitted from the diagram.

[0267] Similarly, the data processing system and / or graphics processing unit of this embodiment may include, as needed, one or more of the features described in US2017 / 0236244, the entire contents of which are incorporated herein by reference, and / or US2019 / 0056955, the entire contents of which are incorporated herein by reference.

[0268] This implementation specifically relates to cache communication, and more specifically to providing an efficient multicast cache communication protocol for simultaneously verifying all address translation caches (e.g., TLBs) within partitions of multiple processing units. As will be explained further below, this may require, for example, configuring a new set of address mappings when memory needs to be updated.

[0269] For example, such as Figure 5 As shown, at least some of the functional units within the graphics processing unit may themselves include an address translation cache (e.g., a TLB) 534, which can be used to translate recently stored virtual memory addresses into physical memory addresses used by the functional units. For example, all memory addresses issued by software and operated by the functional units can be "virtual" memory addresses. To access external physical memory, virtual addresses must therefore be translated into physical memory addresses.

[0270] exist Figure 5 In the example shown, it usually corresponds to Figure 4 There are four programmable processing (shader) cores 500 (SC), each with a set of two TLBs 534. The layer tiler 502 (HT) also has a set of two TLBs 534 and a Level 2 cache 504 (L2), and the job manager 514 (JM) each has a corresponding TLB 534. Of course, other arrangements are also possible.

[0271] To facilitate communication between the memory management unit (MMU) 508 and the corresponding TLB 534 within the functional unit of the graphics processing unit 11, various communication networks are provided. Specifically, in Figure 5 In the example shown, the graphics processing unit 11 includes two internal communication networks for such cached communication; a “unicast” network 532 operable to route communication from the memory management unit (MMU) 508 to any single TLB 534 within the graphics processing unit 11, and a separate “multicast” network 530 operable to simultaneously route communication to and from the memory management unit (MMU) 508.

[0272] When the graphics processing unit 11 operates in combination with other graphics processing units, in the respective partitions of the multiple processing units, this means that the partitions will therefore have an associated set of TLBs, which includes all TLBs 534 in each of the graphics processing units 11 within the region.

[0273] Therefore, the cached communication network in this embodiment is extended to allow communication to be transmitted across different graphics processing units 11 within the partition.

[0274] This is Figure 6 The diagram shows an example of a partitioned processing unit, in which a first processing unit 60 on the left side is configured as the main processing unit of the partition, and in which a number of slave processing units 0…N are sequentially connected to the main processing unit, for example, arranged in a daisy chain.

[0275] In this scenario, the cache controller (e.g., MMU 508) main processing unit can be designated as the primary cache controller for processing unit partitions, enabling the designated primary cache controller to operate across the communication bridge to send signals within partitions of multiple processing units and from all caches (e.g., TLBs), such as... Figure 6 As shown.

[0276] Therefore, the processing units within a partition are collectively configured to provide a communication network that allows cache communication to be transferred from the master cache controller to all caches (e.g., TLBs) within the partition (including caches (TLBs) residing on the master processing unit itself and any caches (TLBs) residing on another slave processing unit). Thus, a logical cache communication network is provided on the partition, which includes a corresponding cache communication network 532, 530 for each processing unit within the partition and communication bridges interconnecting adjacent processing units.

[0277] exist Figure 6 In the example shown, the main processing unit 60 is logically located at one end (left side) of the partition. This means that the main processing unit 60 should not receive any communication from the (left) side. Similarly, the final slave processing unit 64 in the partition should logically not receive any communication from the (right) side at the other end of the partition. On the other hand, any intermediate slave processing unit 62 can receive communication from either side and should be configured accordingly.

[0278] To promote this, such as Figure 7 As shown, each processing unit includes a set of isolation switches 70 that can operate to isolate communication to the processing unit.

[0279] Therefore, as Figure 8As shown, when the processing unit is configured as the main processing unit, the isolation switch 70 on the left side can be set to isolate the corresponding network communication interface (because communication should not be received from that side). Correspondingly, as... Figure 10 As shown, when the processing unit is configured as the final slave processing unit in the partition, the isolation switch 70 on the right side can be set to isolate the corresponding network communication interface (because communication should not be received from that side). On the other hand, as Figure 9 As shown, any intermediate processing unit in the partition should open communication from both sides and set the isolation switch 70 accordingly.

[0280] In this way, the processing units can be flexibly configured for different logical arrangements (for different partitions). For example, in the example above, each of the processing units is essentially the same, but can be appropriately configured by setting the relevant isolation switch 70, depending on the location of the processing unit within the partition.

[0281] This implementation specifically relates to facilitating simultaneous multicast cache communication from a designated primary cache controller to all caches (e.g., TLBs) within a partition. For example, in Figure 6 In the example, the cache controller (MMU) of the main processing unit 60 is active, while the cache controllers (MMUs) for all subordinate processing units 62…64 are inactive. However, the cache (e.g., TLB 534) can be active for all processing units within the partition. Therefore, the main cache controller should be able to communicate with all caches within the partition.

[0282] therefore, Figure 7 The processing unit and a controller (MMU) 72 for a set of caches 74 (e.g., TLB 534) are also schematically shown on the processing unit. The graphics processing unit also includes associated logic 76 and / or logic 78 operable to combine cache response signals from the set of caches 74 onto the processing unit, the cache response signals having corresponding combined cache response signals from previously adjacent processing units in the partition, as will be explained further below.

[0283] Figure 11 This is a flowchart illustrating this embodiment. In the first step (step 110), the software prepares the partitions of the processing unit, for example, in the manner described above. Specifically, the software configures the first processing unit as the master processing unit and configures one or more other processing units as slave processing units that will operate under the control of the master processing unit (step 112).

[0284] As part of this operation, the hardware sets partition boundaries that will be appropriately used for multicast buffered communication, for example by setting an isolation switch as described above (step 114).

[0285] Before initiating workload processing, memory mapping is completed, and the MMU (Main Cache Controller) is configured accordingly (step 115). To do this, the software writes page tables into memory and configures the MMU to use the pages. Then, the hardware initializes the TLB and applies the new configuration, for example by simultaneously invalidating the TLB to clear any current cache contents (potentially related to previous configuration transitions) (step 116).

[0286] The workload can then begin processing (step 117). For example, the software can first write a pointer to the beginning of the task chain to cause the hardware to begin executing the job chain (step 118). The software then waits (step 119) until an interruption is signaled (job continuation event). When an interruption is signaled (job ready event), the job processing ends (step 122).

[0287] During processing, a "memory update request" may occur. This request can be initiated, for example, by hardware if a memory allocation failure occurs (e.g., an unmapped page). For instance, this could be a situation where the heap of (software) allocated memory is already running, making it necessary to increase the allocated memory. Alternatively, a memory update request can be initiated by software, for example, for another (next) workload (e.g., software preparing memory for the next frame or for another application to execute immediately). Various arrangements are possible in this regard.

[0288] When a memory update request is signaled, the software should appropriately update the memory mapping and reconfigure the MMU (step 120). Simultaneously, the hardware should be unable to verify the newly configured MMU and TLBs (step 121). To this end, the main MMU should send the cache invalidation request to all TLBs within the partition.

[0289] Figure 12 The following describes in more detail how such memory update requests are handled according to the implementation scheme. Figure 12 Therefore, a more detailed explanation is needed. Figure 11 Steps 115, 116, 120, and 121 in the text. Therefore, as... Figure 12 As shown, the protocol is as follows when a memory update request is received.

[0290] The first step of the software is to lock (stop) the MMU address translation of the address space (there may be multiple address spaces, and each address space can be configured individually) (step 124).

[0291] Locking means that all TLBs and MMUs will stop translating virtual addresses to physical addresses. Locks are needed to ensure that memory can be safely updated, for example, to add new page table entries and / or modify (move / remove) physical memory.

[0292] The software therefore issues a lock request signal. In response, the hardware (MMU) blocks the input conversion request buffer (conversion stop) (step 125) to allow new conversion requests from any TLB in the partition to be processed.

[0293] The hardware (MMU) clears all translation entries from the MMU cache (step 126) and then broadcasts all TLBs in the cache invalidation request signal to all TLBs (e.g., by setting the cache invalidation request signal line high (to "1")), which will invalidate all translation entries from the TLBs and lock them (do not translate any input virtual addresses; block the input transaction interface) (step 127).

[0294] In this implementation, cache invalid requests proceed sequentially through all slices in the configuration partition (and partition boundaries (isolation) will pass directly through the request without causing it to travel through any slices outside the partition).

[0295] Then, the MMU first waits until all responses are set to the AND tree. To facilitate this, as... Figure 7 As shown, each processing unit has associated logic 76, which is operable to combine the corresponding response from all its caches 74 with the combined response signal received by the adjacent previous processing unit in the partition.

[0296] Therefore, when all TLBs of all processing units that transmitted the signal are responding, the signal received at the main processing unit will be set high (only). At this time, the MMU determines that the cache invalidation request has been received by all TLBs, and the signal "Lock Ready" returns to the software (step 129).

[0297] When the MMU is locked (for the address space), the L2 cache (and any other caches that use translated physical addresses) is cleared (invalidated) to ensure that external memory is currently free of physically resolved cached data in the GPU (step 130). External memory can now be safely modified (step 131). For example, page tables can be written to and modified (with new memory mappings). Similarly, physical memory regions can be deleted, added, or moved.

[0298] After updating the external memory, the software configures the MMU (with register write) to update the page pointer (step 132). At this point, the software can unlock the MMU and initiate the unlock routine to allow the system to resume normal operation (step 133).

[0299] This accordingly signals the hardware (MMU) to enable the input conversion request buffer (enable conversion) (step 134).

[0300] The hardware (MMU) can then de-assert cache invalidation requests (e.g., by setting the cache invalidation request signal line low) to remove locks from the TLB and allow them to translate virtual addresses again (step 135). The "invalidate close" request travels through all slices in the configured partition.

[0301] The MMU waits until all responses and the OR gate tree are set low, at which point it determines that all TLBs have received the request. To facilitate this, as... Figure 7 As shown, each processing unit has associated OR logic 78, which is operable to combine the corresponding responses from all its caches 74 with combined response signals received from adjacent previously processed units in the partition. Therefore, after a request has been deasserted, a deassertion signal is transmitted to all caches, and when a cache receives a deassertion request, the corresponding cache response signal is deasserted accordingly. Once the request has been deasserted for all caches, the request is complete, and processing can safely resume, allowing the caches to be unlocked. The MMU signal "unlocked ready" is sent back to the software (step 137), and the memory update is completed (step 138).

[0302] Figure 13 This is a timing diagram illustrating the overall multicast buffered communication manual protocol according to this embodiment over multiple clock cycles t0…t10. The manual scheme is as follows.

[0303] First, the main cache controller broadcasts the request to all caches in the system by setting the request signal high (at t1).

[0304] The request signal is transmitted to all caches in the system via a multicast cache communication network used for partitioning (so that requests are emitted sequentially along the graphics processing unit). When a cache receives the active request signal, it responds to the request by setting its corresponding invalid response line high.

[0305] For example, Figure 13 The first cache that responds under t2 (cache_0_response) and the second cache that responds under t4 (cache_1_response) are shown.

[0306] The response signal from the cache is accumulated by the "AND" and "OR" logic on each graphics processing unit within the partition as described above, thereby generating a combined "AND" and "OR" response signal provided to the main cache controller. A positive edge on the "AND" signal thus indicates to the controller that all caches have now seen the request. The controller waits until the AND signal is set before the request is deasserted. In this way, it can be ensured that all caches within the partition have received the request.

[0307] Once all caches have received the request, the controller can release the assertion request (at t6). The negative edge of the "OR" signal indicates that all caches have completed the request, thus completing the request-response handshake between the controller and the caches. If necessary, the controller can now initiate the next request.

[0308] The specific embodiments described above are presented for illustrative and descriptive purposes only. They are not intended to be exhaustive or to limit the technology described herein to the precise forms disclosed. Many modifications and variations are possible in accordance with the teachings above. The described embodiments were chosen to best explain the principles of the technology described herein and its practical application, thereby enabling others skilled in the art to best utilize the technology described herein in various embodiments and with various modifications suitable for the particular intended use. The scope of the invention is intended to be defined by the appended claims.

Claims

1. A data processing system, the data processing system comprising: Multiple processing units, which can be configured as different corresponding partitions of the processing units, wherein each partition includes a set of one or more processing units of the multiple processing units, wherein at least some of the multiple processing units include one or more caches and corresponding cache controllers; and A configurable multicast communication network is provided for routing communications to and from multiple caches within corresponding partitions of the processing unit. The multicast communication network is configurable such that the cache controller of one of the processing units within a corresponding partition of the plurality of processing units can be configured as the master cache controller for the set of the plurality of caches within the partition. The primary cache controller is operable to simultaneously issue requests to all caches in the set of multiple caches within the partition via the multicast communication network. The multicast communication network is configured to combine the corresponding responses from all caches to which requests from processing units in the partition are published to generate a corresponding response signal for the processing unit. The multicast communication network is further configured to combine corresponding response signals from different processing units within the partition and provide a combined response signal to the main cache controller, the combined response signal representing the overall request-response status of all caches in the set of multiple caches to which the request was published.

2. The data processing system according to claim 1, wherein, The processing units within a partition of multiple processing units are arranged as a sequence of processing units, and the multicast communication network is arranged such that cached communication to and from the processing units in the partition of the multiple processing units is routed sequentially, such that each processing unit in the partition receives a signal from its adjacent processing unit in the sequence of processing units.

3. The data processing system of claim 2, wherein the multicast communication network is configured to, for each processing unit in the sequence of processing units, combine the corresponding responses of all caches from the processing unit to which the request is published with a corresponding combined response signal provided by the preceding adjacent processing unit in the sequence, thereby generating a total combined response signal representing the overall request-response status of all caches in the processing unit and in any preceding processing unit in the sequence, the total combined response signal being provided to the next adjacent processing unit in the sequence.

4. The data processing system of claim 3, wherein each processing unit has associated first logic for generating a first combined response signal, the first combined response signal indicating whether all of the plurality of caches to which the request is published are currently responding to the request.

5. The data processing system of claim 4, wherein each processing unit has associated second logic for generating a first combined response signal, the first combined response signal indicating whether any of the plurality of caches to which the request was published is currently responding to the request.

6. The data processing system of claim 5, wherein the main cache controller is configured to determine, based on the first combined response signal and / or the second combined response signal, the current overall request-response state of all caches in the set of the plurality of caches to which the request was published, the overall request-response state being one or more of the following: (i) none of the caches in the set of the plurality of caches to which the request was published have received the request; (ii) at least some of the caches in the set of the plurality of caches to which the request was published have received the request; (iii) all the caches in the set of the plurality of caches to which the request was published have received the request; (iv) at least some of the caches in the set of the plurality of caches to which the request was published have completed the request; and (v) all the caches in the set of the plurality of caches to which the request was published have completed the request.

7. The data processing system of claim 1, wherein the request issued by the primary cache controller to the set of the plurality of caches within the partition is a request to clear and / or invalidate all of the plurality of caches in the set of the plurality of caches, or wherein the request includes a memory barrier.

8. The data processing system of claim 1, wherein the main cache controller is operable to publish requests to a set of multiple caches using the multicast communication network, including a set of address translation caches, and wherein the main cache controller includes a corresponding memory management unit.

9. The data processing system of claim 1, wherein the communication network is reconfigurable for different partitions of the processing unit to allow cache controllers of different processing units within the group to be configured for the main cache controller of different corresponding partitions.

10. The data processing system of claim 9, wherein each processing unit includes one or more network interfaces for communication from adjacent processing units in the partition, and wherein each processing unit further includes a set of one or more isolating switches that can be configured to isolate communication from the respective network interface to configure the communication network.

11. A method for operating a data processing system, the data processing system comprising: Multiple processing units, which can be configured as different corresponding partitions of the processing units, wherein each partition includes a set of one or more processing units among the multiple processing units. At least some of the plurality of processing units include one or more caches and corresponding cache controllers; This includes a multicast communication network for routing cache communication to and from multiple caches within a partition of a processing unit. The multicast communication network is configured such that the cache controller of one of the processing units within the partition can be configured as a "master" cache controller for a collection of multiple caches within the partition. The method includes: For the corresponding partitions of multiple processing units: The primary cache controller for the partition simultaneously issues requests to all caches in the set of multiple caches within the partition via the multicast communication network. The multicast communication network combines the requests from the processing units in the partition with the corresponding responses published to all the caches to generate a corresponding response signal for the processing unit. The multicast communication network also combines corresponding response signals from different processing units within the partition and provides the combined response signal to the main cache controller; and The primary cache controller uses the combined response signal to determine the overall request-response status of all caches in the set of multiple caches to which the request was published.

12. The method according to claim 11, wherein, The processing units within corresponding partitions of the plurality of processing units are arranged as a sequence of processing units, and the method includes sequentially routing cached communications to and from processing units in the partitions of the plurality of processing units, such that each processing unit in the partition receives a signal from an adjacent processing unit in the sequence of processing units.

13. The method of claim 12, comprising: For each processing unit in the sequence of processing units, the corresponding responses of all caches from the processing unit to which the request was issued are combined with the corresponding combined response signal provided by the preceding adjacent processing unit in the sequence to generate a total combined response signal representing the overall request-response status of all caches in the processing unit and in any preceding processing unit in the sequence, and then the total combined response signal is provided to the next adjacent processing unit in the sequence.

14. The method of claim 13, wherein each processing unit has associated first logic for generating a first combined response signal, the first combined response signal indicating whether all of the plurality of caches to which the request was published are currently responding to the request.

15. The method of claim 14, wherein each processing unit has associated second logic for generating a first combined response signal, the first combined response signal indicating whether any of the plurality of caches to which the request was posted is currently responding to the request.

16. The method of claim 15, further comprising the main cache determining, based on the first combined response signal and / or the second combined response signal, a current overall request-response state of all caches in a set of plurality of caches to which the request was published, the overall request-response state being one or more of the following: (i) none of the caches in the set of plurality of caches to which the request was published have received the request; (ii) at least some of the caches in the set of plurality of caches to which the request was published have received the request; (iii) all of the caches in the set of plurality of caches to which the request was published have received the request; (iv) at least some of the caches in the set of plurality of caches to which the request was published have completed the request; and (v) all of the caches in the set of plurality of caches to which the request was published have completed the request.

17. The method of claim 11, wherein the request issued by the primary cache controller to the set of the plurality of caches within the partition is a request to clear and / or invalidate all of the plurality of caches in the set of the plurality of caches, or wherein the request includes a memory barrier.

18. The method of claim 11, wherein the main cache controller is operable to publish requests to a set of a plurality of caches using the multicast communication network, including a set of address translation caches, and wherein the main cache controller includes a corresponding memory management unit.

19. The method of claim 11, wherein each processing unit includes one or more network interfaces for communication from adjacent processing units in the partition, and wherein each processing unit further includes a set of one or more isolating switches, the one or more isolating switches being configured to isolate communication from the respective network interface, the method comprising: A communication network is configured for the partition of the processing units by setting one or more isolation switches among the isolation switches for the processing units in the partition according to the logical arrangement of the processing units in the partition.

20. A computer program product comprising computer software code, said computer software code, when executed on one or more data processors, performing a method for operating a data processing system, said data processing system comprising: Multiple processing units, which can be configured as different corresponding partitions of the processing units, wherein each partition includes a set of one or more processing units among the multiple processing units. At least some of the plurality of processing units include one or more caches and corresponding cache controllers; This includes a multicast communication network for routing cache communication to and from multiple caches within a partition of a processing unit. The multicast communication network is configured such that the cache controller of one of the processing units within the partition can be configured as the primary cache controller for the set of multiple caches within the partition. The method includes: For the corresponding partitions of multiple processing units: The primary cache controller for the partition simultaneously issues requests to all caches in the set of multiple caches within the partition via the multicast communication network. The multicast communication network combines the requests from the processing units in the partition with the corresponding responses published to all the caches to generate a corresponding response signal for the processing unit. The multicast communication network also combines corresponding response signals from different processing units within the partition and provides the combined response signal to the main cache controller; and The primary cache controller uses the combined response signal to determine the overall request-response status of all caches in the set of multiple caches to which the request was published.