Methods, systems, and media for coherence bunching of rays

By introducing a hierarchical acceleration structure and instance transformation cache into the ray tracing system, coherent aggregation of rays is achieved, solving the problem of low rendering efficiency of ray tracing systems on resource-constrained devices in the prior art, and improving the efficiency of intersection testing and memory usage.

CN116309718BActive Publication Date: 2026-06-19IMAGINATION TECH LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
IMAGINATION TECH LTD
Filing Date
2021-08-19
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing ray tracing systems struggle to perform efficient intersection tests when rendering images in real time, especially on resource-constrained mobile devices. This results in excessively high processing power and storage requirements, making it difficult to achieve real-time rendering.

Method used

By employing a hierarchical acceleration structure and instance transformation cache technology, we reduce memory access and improve intersection testing efficiency by coherently aggregating rays. We also utilize the instance transformation cache to store the transformation information of multiple rays, enabling parallel processing of rays.

🎯Benefits of technology

It improves the intersection testing efficiency of ray tracing systems, reduces memory access overhead, and is suitable for real-time image rendering on resource-constrained devices.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116309718B_ABST
    Figure CN116309718B_ABST
Patent Text Reader

Abstract

This invention relates to methods, systems, and media for coherently focusing rays. A system and method for coherently focusing rays in a ray tracing system are provided. The ray tracing system uses a hierarchical acceleration structure comprising multiple nodes, including upper-level nodes and lower-level nodes. For each instance where one of the lower-level nodes is a child node of one of the upper-level nodes, an instance transformation is defined, specifying the relationship between a first coordinate system of the upper-level node and a second coordinate system of the instance of the lower-level node. The system provides an instance transformation cache for storing multiple instance transformations during intersection testing.
Need to check novelty before this filing date? Find Prior Art

Description

[0001] This application is a divisional application of application number 202110953226.3, entitled "Coherent Aggregation for Ray Tracing", filed on August 19, 2021. Technical Field

[0002] This invention relates to coherent aggregation for ray tracing. Background Technology

[0003] Ray tracing systems can simulate how rays (e.g., beams) interact with a scene. For example, ray tracing techniques can be used in graphics rendering systems configured to generate images from a 3D scene description. These images can be photorealistic or achieve other goals. For instance, animated films can be created using 3D rendering techniques. A 3D scene description typically includes data defining the geometry within the scene. This geometric data is usually defined by primitives, which are typically triangular primitives, but can sometimes be other shapes such as other polygons, lines, or points.

[0004] Ray tracing mimics the natural interaction of light with objects in a scene, and complex rendering features can be naturally generated from ray-traced 3D scenes. Ray tracing can be relatively easily parallelized at the pixel-by-pixel level because pixels are typically independent of each other. However, in cases such as ambient occlusion, reflections, and caustics, the distribution and varying positions and directions of rays in a 3D scene make it difficult to pipeline the processing involved in ray tracing. Ray tracing allows for the rendering of realistic images, but it typically requires high levels of processing power and large working memory, making it potentially difficult to implement for real-time rendering (e.g., for gaming applications), especially on devices with strict limitations on silicon area, cost, and power consumption, such as mobile devices (e.g., smartphones, tablets, laptops, etc.).

[0005] At a very broad level, ray tracing involves: (i) identifying intersections between rays in the scene and geometry (e.g., primitives), and (ii) performing certain processes (e.g., by executing a shader procedure) in response to identifying the intersections to determine how the intersections contribute to the image being rendered. The execution of the shader procedure may cause additional rays to be emitted into the scene. These additional rays may be referred to as “secondary rays.”

[0006] Identifying intersections between rays and geometry in a scene involves a great deal of processing. In a very simple approach, every ray can be tested for every primitive in the scene, and then the closest intersection can be identified once all intersection hits have been determined. This approach is impractical for scenes that may have millions or billions of primitives, where the number of rays to process could also be millions. Therefore, ray tracing systems typically use acceleration structures that characterize geometry in a scene in a way that reduces the work required for intersection testing. However, even with existing acceleration structures, it is difficult to perform intersection testing at rates suitable for real-time rendering (e.g., for gaming applications), especially on devices with strict limitations on silicon area, cost, and power consumption, such as mobile devices (e.g., smartphones, tablets, laptops, etc.).

[0007] Modern ray tracing architectures typically use accelerated structures based on bounding volume hierarchies—specifically, bounding box hierarchies. Primitives are grouped together into bounding boxes that surround them. These bounding boxes are then grouped together to form larger bounding boxes that surround them. Intersection testing becomes easier because if a ray misses a bounding box, it doesn't need to be tested against any of the bounding box's child nodes.

[0008] In a typical layered approach, two types of acceleration structures can be identified: Low-Level Accelerated Assemblies (BLAS) and High-Level Accelerated Assemblies (TLAS). BLAS groups primitives together, meaning a BLAS has leaf nodes (typically triangles, but other geometries are possible) that act as object primitives. The top level of a BLAS is a single root node. For example, a BLAS can be used to describe a single object in a scene. TLAS describes a higher level of the scene, starting from the root node at the top level and terminating at the lowest level of the BLAS.

[0009] Intersection testing is performed by traversing the hierarchical structure. If a given ray “hits” a bounding box (node), the ray needs to be tested against every child node of that bounding box (node). This continues down the hierarchical structure until the ray does not hit all child nodes of a node, or hits at least one primitive. Testing a ray against a node requires retrieving from memory (i) a description of the ray (typically defined by the origin and direction) and (ii) a description of the node's geometry (bounding box coordinates or primitive coordinates). Summary of the Invention

[0010] This summary is provided to introduce, in a simplified form, a series of concepts further described below in the detailed description. This summary is not intended to identify key or essential features of the claimed subject matter, nor is it intended to limit the scope of the claimed subject matter.

[0011] A system and method are provided for coherently focusing rays in a ray tracing system. The ray tracing system uses a hierarchical acceleration structure comprising multiple nodes, including upper-level nodes and lower-level nodes. For each instance where one of the lower-level nodes is a child node of one of the upper-level nodes, an instance transformation is defined, specifying the relationship between a first coordinate system of the upper-level node and a second coordinate system of the instance of the lower-level node. The system provides an instance transformation cache for storing multiple instance transformations during intersection testing.

[0012] Based on one aspect, a coherent aggregation method is provided.

[0013] Each lower-level node can be a descendant (child node, grandchild node, etc.) of at least one of the upper-level nodes. A lower-level node can include a root lower-level node. A root lower-level node can have a parent node that is also an upper-level node, where all nodes above that parent node in the hierarchical structure (i.e., its ancestor nodes, such as grandparent nodes) are upper-level nodes. A root lower-level node can have at least one child node that is also a lower-level node, where all nodes below that child node in the hierarchical structure are lower-level nodes.

[0014] There can exist at least one root child node that is a descendant (e.g., a grandchild node) of two or more parent nodes. That is, the root child node can be instantiated twice (or more) by two (or more) different parent nodes. Alternatively or additionally, there can exist at least one root child node that is instantiated twice (or more) by a single parent node.

[0015] The first coordinate system can be the global coordinate system (also known as "world space"). The second coordinate system can be the local coordinate system associated with BLAS. The geometric information of all descendant nodes of a given root node can be defined in the same local coordinate system.

[0016] The method may further include retrieving geometric information of the selected lower-level nodes before submitting the selected ray group for intersection testing. The method may also include retrieving ray information for the selected ray group. Retrieving geometric information may include retrieving the geometric information from memory. Retrieving ray information may include retrieving the ray information from a ray storage device. Retrieving instance transformations may include retrieving the instance transformations from memory. Submitting the selected group may include transforming the ray information using instance transformations.

[0017] Instance transformations can be defined for the root node and all its descendant nodes. The root node and its descendants can form a BLAS and represent a model of an object. Objects are typically rigid objects, so instance transformations apply equally to all parts of the object.

[0018] The ray information for each ray can include its position and direction in the global coordinate system. The direction is the ray's orientation. The position can be the ray's origin. Ray information can also include the ray's minimum and maximum path lengths.

[0019] The geometric information for each upper-level node may include boundary volumes, such as bounding boxes, for example, axis-aligned bounding boxes. A boundary volume (or bounding box) can be the volume that surrounds all child nodes of the node in question. The geometric information for each lower-level node may include boundary volumes (similar to those of upper-level nodes), or it may include descriptions of one or more geometric primitives. Primitives can be geometric shapes, such as triangles.

[0020] When an instance transformation is not found in the instance transformation cache, retrieving the instance transformation may include: requesting (724) the instance transformation; monitoring whether the instance transformation has been returned; and after detecting that the instance transformation has been returned, continuing to submit the selected ray group for the intersection test.

[0021] Requesting an instance transformation may include requesting the instance transformation from memory (optionally via an acceleration structure cache). The request may be satisfied when the requested instance transformation (from memory, optionally via an acceleration structure cache) is returned.

[0022] The method may continue to request a second instance transformation while waiting for the request for the first instance transformation to be satisfied. Requests may be satisfied in an order different from the order in which they were requested (i.e., instance transformations may be returned). For example, the method may include requesting a first instance transformation, followed by requesting a second instance transformation; monitoring whether these instance transformations have been returned; detecting that the second instance transformation has been returned; submitting a ray group associated with the second instance transformation for an intersection test; subsequently detecting that the first instance transformation has been returned; and submitting a ray group associated with the first instance transformation for an intersection test.

[0023] An intersection testing method is also provided, including the coherence aggregation method described above, the method further including performing an intersection test on each ray of the selected ray group for the instance of the lower-level node.

[0024] A ray tracing method is also provided, including an intersection test method, and also includes calling a shader program to calculate the intersection effect between the ray and the (primitive) node.

[0025] According to another aspect, a system for coherently focusing rays in a ray tracing system is provided.

[0026] The coherence aggregation unit can be configured to retrieve the geometric information of the selected lower-level nodes for testing. The system may also include a scheduler unit configured to retrieve ray information of the selected ray group from a ray storage device. The system can be implemented in a fixed-function circuit system.

[0027] The system may further include an instance transformation unit configured to use instance transformation to transform ray information, wherein the coherence aggregation unit is configured to submit the ray and associated instance transformation to the instance transformation unit when a selected ray group is submitted for intersection testing.

[0028] If the system further includes a scheduler unit, then the instance transformation unit may be a component of the scheduler unit.

[0029] When an instance transformation is not found in the instance transformation cache, the coherence aggregation unit can be configured to retrieve the instance transformation by: requesting the instance transformation; monitoring whether the instance transformation has been returned; and, after detecting that the instance transformation has been returned, continuing to submit the selected ray group for intersection testing.

[0030] The coherence aggregation unit can be configured to submit the selected ray group to the scheduler unit (see below) for intersection testing.

[0031] The system may also include one or more tester units configured to perform intersection tests.

[0032] The nodes in the accelerated structure can include primitive nodes and bounding box nodes. The tester unit can include: one or more box tester units for performing intersection tests on bounding box nodes; and one or more primitive tester units for performing intersection tests on primitive nodes.

[0033] The instance translation cache may include content-addressable memory (CAM) and random access memory (RAM).

[0034] The CAM can be a component of the coherence aggregation unit. The system may also include a scheduler unit, wherein RAM and an optional instance transformation unit are components of the scheduler unit.

[0035] CAM can be configured to store a reference counter for each instance transform, which records the number of ray groups currently being tested that reference that instance transform.

[0036] The coherence aggregation unit can be configured to increase the reference counter when nodes (and associated ray groups) using the corresponding instance transformation are submitted to increase the value used for intersection testing. It can also be configured to decrease the reference counter when intersection testing is completed for nodes (and ray groups) using the instance transformation.

[0037] CAM can be configured to store a validity flag for each instance change in the instance change cache, indicating whether the instance change is currently valid.

[0038] The ray storage device and memory can be provided in separate hardware units. The ray storage device can be local to the coherence aggregation unit. The memory can be external to the coherence aggregation unit. (It can also be external to the scheduler unit and one or more tester units.) An acceleration architecture cache can act as an intermediary between the coherence aggregation unit and the memory.

[0039] The coherence aggregation unit can be configured to store instance changes at an index location whose validity flag indicates they are currently invalid when they are stored in the instance change cache. If the validity flag indicates all index locations are currently valid, the coherence aggregation unit can be configured to store instance changes at an index location for which a reference counter indicates that the instance change is not referenced by any ray group currently being tested.

[0040] A graphics processing system is also provided, which is configured to perform the methods described above.

[0041] A graphics processing system is also provided, including a system for coherence aggregation as outlined above.

[0042] Coherence focusing systems, ray tracing systems, or graphics processing systems can be implemented using hardware on integrated circuits.

[0043] According to another aspect, a method is provided for manufacturing the system or graphics processing system as outlined above using an integrated circuit manufacturing system.

[0044] A method is also provided for manufacturing the coherent aggregation system, ray tracing system, or graphics processing system as described above using an integrated circuit manufacturing system, the method comprising: processing a computer-readable description of the coherent aggregation system, ray tracing system, or graphics processing system using a layout processing system to generate a circuit layout description of an integrated circuit embodying the coherent aggregation system, ray tracing system, or graphics processing system; and manufacturing the coherent aggregation system, ray tracing system, or graphics processing system according to the circuit layout description using an integrated circuit generation system.

[0045] A computer-readable code is also provided, configured to cause the methods outlined above to be executed when the code is run; and a computer-readable storage medium on which the computer-readable code is encoded above. The storage medium may be a non-transitory computer-readable storage medium. When the computer-readable code is executed on a computer system, it can cause the computer system to perform any of the methods described herein.

[0046] A computer-readable storage medium is also provided, on which an integrated circuit definition dataset is encoded, wherein when the integrated circuit definition dataset is processed in an integrated circuit manufacturing system, the integrated circuit manufacturing system is configured to manufacture a graphics processing system as described above.

[0047] A non-transitory computer-readable storage medium is also provided, on which a computer-readable description of the coherent aggregation system, ray tracing system, or graphics processing system as described above is stored, wherein when processed in an integrated circuit manufacturing system, the computer-readable description causes the integrated circuit manufacturing system to manufacture an integrated circuit embodying the coherent aggregation system, ray tracing system, or graphics processing system.

[0048] A non-transitory computer-readable storage medium is also provided, on which a computer-readable description of a coherent aggregation system, ray tracing system, or graphics processing system as outlined above is stored. When processed in an integrated circuit manufacturing system, the computer-readable description causes the integrated circuit manufacturing system to: process the computer-readable description of the coherent aggregation system, ray tracing system, or graphics processing system using a layout processing system to generate a circuit layout description of an integrated circuit embodying the coherent aggregation system, ray tracing system, or graphics processing system; and manufacture the coherent aggregation system, ray tracing system, or graphics processing system according to the circuit layout description using an integrated circuit generation system.

[0049] An integrated circuit manufacturing system is also provided, which is configured to manufacture a graphics processing system as described above.

[0050] The integrated circuit manufacturing system may include: a non-transitory computer-readable storage medium storing a computer-readable description of a coherent aggregation system, ray tracing system, or graphics processing system as outlined above; a layout processing system configured to process the computer-readable description to generate a circuit layout description of an integrated circuit embodying the coherent aggregation system, ray tracing system, or graphics processing system; and an integrated circuit generation system configured to manufacture the coherent aggregation system, ray tracing system, or graphics processing system based on the circuit layout description. The layout processing system may be configured to determine the location information of logic components of a circuit derived from the integrated circuit description to generate a circuit layout description of an integrated circuit embodying the coherent aggregation system, ray tracing system, or graphics processing system.

[0051] As will be apparent to those skilled in the art, the above features can be appropriately combined, and can be combined with any aspect of the examples described herein. Attached Figure Description

[0052] The example will now be described in detail with reference to the accompanying drawings, in which:

[0053] Figure 1a The scene is shown as divided according to the boundary volume structure;

[0054] Figure 1b Indicates used for Figure 1a The layered acceleration structure of the boundary volume structure shown;

[0055] Figure 2 A hierarchical acceleration structure including top-level nodes and bottom-level nodes is shown according to an example;

[0056] Figure 3 This is a simplified block diagram of an example system for coherently focusing rays;

[0057] Figure 4 Showing more details Figure 3 The system;

[0058] Figure 5 It is possible to be Figure 3 The flowchart of the coherent aggregation method implemented in the system;

[0059] Figure 6 This is a more detailed processing flowchart based on an example, which shows the procedure for retrieving geometric information and instance transformations;

[0060] Figure 7a and Figure 7b A flowchart illustrating the cache process of instance transformation is shown according to an example;

[0061] Figure 8 The data structure stored in the instance transformation cache is shown according to an example;

[0062] Figure 9 This illustrates a computer system in which a graphics processing system is implemented; and

[0063] Figure 10 This illustrates a manufacturing system for generating integrated circuits that embody a graphics processing system.

[0064] The accompanying drawings illustrate various examples. Those skilled in the art will understand that the element boundaries (e.g., boxes, groups of boxes, or other shapes) shown in the drawings represent one example of a boundary. In some examples, it may be that one element can be designed as multiple elements, or multiple elements can be designed as one element. Where appropriate, common reference numerals are used throughout the drawings to indicate similar features. Detailed Implementation

[0065] The following description is given by way of example to enable those skilled in the art to make and use the invention. The invention is not limited to the embodiments described herein, and various modifications to the disclosed embodiments will be readily apparent to those skilled in the art.

[0066] The embodiments will now be described by way of example only.

[0067] In typical hardware architectures, memory access is a relatively expensive operation (in terms of time and / or energy consumption). It is desirable to minimize any redundancy in requests to read data from memory. Therefore, it is beneficial to aggregate and group rays that need to be tested against the same part of the hierarchical structure. This is referred to herein as coherent aggregation. It allows geometric information to be read once and tested against multiple rays. This also facilitates parallel implementation, for example, using a Single Instruction Multiple Data (SIMD) model, whereby a single hardware unit processes different rays (in the same group) against the same geometric information in parallel. The examples disclosed herein can use coherent aggregation to facilitate more efficient intersection testing for ray tracing. In particular, it is desirable to improve the efficiency of intersection testing for BLAS nodes.

[0068] TLAS is defined in world space, that is, the scene's global coordinate system. The global coordinate system is an example of the first coordinate system. Rays are also defined in world space.

[0069] Because an object can appear in multiple different locations and orientations within a scene, the BLAS representing that object can be instantiated multiple times. For example, a BLAS describing car wheels can be instantiated four times, once for each wheel. For instance, the BLAS can have a hierarchical structure with 1,000 to 10,000 nodes. The wheel model remains the same in every case, but each wheel is located in a different position within the scene, and the front wheels can be oriented differently from the rear wheels.

[0070] While this could be handled by creating four separate copies of the “wheel” BLAS in memory (where the geometry of each wheel is defined in world space), this results in relatively inefficient memory usage. Instead, a TLAS can reference (instantiate) a single copy (BLAS) of the model multiple times. Using the latter approach, each BLAS defines its geometry in an “instance space”—the local coordinate system of the described object. The local coordinate system is an example of a second coordinate system. In the car example, each wheel is identical within the local coordinate system (instance space). The origin and axes of the local coordinate system can be defined in any convenient way. For example, the origin of the local coordinate system could be set to the centroid of the object, or the end of the object. The orientation of the axes in the local coordinate system can be defined based on one or more principal axes of the object, or they can be chosen substantially arbitrarily. The object is described hierarchically within the BLAS. For example, a BLAS describing a seat could include nodes describing the seat bottom, seat back, and legs. All nodes in a given BLAS use the same local coordinate system.

[0071] The "world-to-instance transformation" (or simply "instance transformation") defines the position and orientation of each instance of the BLAS within the scene. In this method, the geometric information of the BLAS is stored once (in instance space), and the instance transformation is stored for each instance—that is, for each individual reference to the BLAS. The instance transformation associates the local (instance space) geometric information of the BLAS with world space for each instance of the BLAS. This has the potential to significantly reduce the storage requirements for geometric information.

[0072] For example, a TLAS describing a car might reference a “wheel” BLAS (and many other BLAS to represent other parts of the car) four times. The bounding box and the geometric information of the primitives describing the wheel are stored once. Within the TLAS, each instance (i.e., reference instance) of the “wheel” BLAS is associated with a different instance transformation that positions and orients that particular wheel in world space.

[0073] To test a ray for the geometry of a specific BLAS instance, the ray needs to be transformed into the instance space of that instance. (Optionally, the geometry information can be transformed into world space.) The instance transformation is applied to all nodes in the BLAS; therefore, if a ray hits a parent node within the BLAS, the same instance transformation needs to be applied again to test the ray against the child nodes of that parent node. Typically, the transformation can be provided by the control software application in the form of an instance-to-world transformation. This can be inverted by the ray tracing system to obtain the world-to-instance transformation. When the ray is transformed into instance space, the world-to-instance transformation needs to be applied repeatedly (i.e., for each intersection test); therefore, it makes sense to store the transformation in this form. Conversely, if the geometry information is to be transformed into world space to perform intersection tests, it makes sense to store the instance-to-world transformation.

[0074] The inventors recognized that it would be desirable for a coherent clustering algorithm to efficiently handle BLAS instances. Instead of clustering rays based on the BLAS node to be tested, rays should be clustered based on the specific instance of the BLAS node to be tested. In other words, ray coherence clustering should be instance-aware. By clustering rays based on each specific instance of each BLAS node, the system can schedule groups of rays sharing the same transformation and the same BLAS node to be used together for testing. Therefore, at most one memory request is needed to retrieve the transformation used for intersection testing of a given group of rays. According to the example, this is further facilitated by using an instance transformation cache. When an instance transformation is needed for the first time, it is loaded into the instance transformation cache. The next time the same instance transformation is used for intersection testing, it can be expected that the instance transformation can be retrieved from the instance transformation cache without needing to load it from external memory. This reduces memory access overhead.

[0075] The instance transformation may be reused later when testing other rays against the same node. Alternatively, it can occur when testing a given ray against child nodes (and grandchild nodes, etc.) within a hierarchy. As mentioned above, the same instance transformation applies to all nodes in a given instance of BLAS, and there may be thousands of such nodes; therefore, instance transformations can be reused multiple times while traversing the hierarchy of a single BLAS instance.

[0076] Before explaining the examples of coherent aggregation systems in detail, it will be helpful to explain the examples of the acceleration structures used. Figure 1a and Figure 1b This involves a layered structure with boundary volume structure. Figure 1a Scene 400 is shown, which includes three objects: 402, 404, and 406. Figure 1bThe nodes of the hierarchical acceleration structure are shown, where the root node 410 represents the entire scene 400. Figure 1a The area in the scene shown has the same Figure 1b The references of the corresponding nodes in the hierarchical structure shown match the references, but Figure 1a The reference to the region in the diagram includes an additional apostrophe ('). Objects in the scene are analyzed to construct a hierarchical structure, and two nodes 4121 and 4122 are defined within node 410, each defining a region containing an object. In this example, nodes in the boundary volume hierarchy represent axis-aligned bounding boxes (AABBs), but in other examples, nodes may represent regions taking other forms, such as spheres or other simple shapes. Node 4121 represents the box 4121' covering objects 404 and 406. Node 4122 represents the box 4122' covering object 402. Node 4121 is subdivided into two nodes 4141 and 4142, which represent the AABBs (4141' and 4142') that define objects 404 and 406, respectively. Methods for determining the AABBs of nodes used to construct the hierarchical structure are known in the art and can be performed in a top-down manner (e.g., starting at the root node and working down the hierarchy) or in a bottom-up manner (e.g., starting at the leaf node and working up the hierarchy). Figure 1a and Figure 1b In the example shown, the object does not span more than one leaf node.

[0077] The leaf nodes of the hierarchical structure are object primitives. In this example, the objects (circle 404, triangle 406, and square 402) are simple geometric shapes; therefore, each of them can be described using a single primitive. More complex objects can be described by multiple primitives. As is known to those skilled in the art, triangular primitives are common in graphics applications. However, the scope of this disclosure is not limited to triangular primitives. Figure 1a and Figure 1b It can be clearly seen that a distinction can be made between nodes representing bounding boxes (“box” nodes) and (leaf) nodes representing object primitives (“primitive” nodes).

[0078] In this context, a BLAS consists of initial leaf nodes and boxes describing the hierarchical structure up to the root node. BLAS nodes are also referred to here as "lower-level nodes," and the root node of a BLAS is called the "root lower-level node." A TLAS references at least one BLAS and typically groups multiple BLAS hierarchies together for traversal. A BLAS can be referenced multiple times within a TLAS structure through different instance transformations. This allows the hierarchical constructor to write to a BLAS once but reference it multiple times from different angles / positions without rewriting it, thus saving memory bandwidth and overhead. TLAS nodes are also referred to here as "upper-level nodes."

[0079] Figure 2 The image shows an example hierarchical structure using BLAS and TLAS structures. Root node (TLAS root box)

[0080] 210 has two child nodes, namely TLAS boxes 2121 and 2122. TLAS box 2121 has two child nodes, namely TLAS instance format blocks 2141 and 2142. Each instance format block defines a different world-to-instance transformation. BLAS root box 2161 is referenced twice, and each TLAS instance format block 2141 and 2142 references it once. That is, there are two "instances" of BLAS root box 2161. BLAS root box 2161 has two child nodes, namely BLAS boxes 2181 and 2182. BLAS box 2181 has child nodes as primitive nodes, namely triangles 201 and 202. Similarly, box 2182 has primitive child nodes as triangles 203 and 205. Returning to root node 210, its other child node, TLAS box 2122, has a single child node as instance format block 2143. This instance format block 2143 references BLAS root box 2162. Therefore, BLAS root box 2162 is instantiated only once.

[0081] BLAS root box 2162 has two child nodes – BLAS boxes 2183 and 2184. These have corresponding child nodes as procedural primitives 206 and 207. Procedural primitives 206 and 207 have procedurally defined shapes, which allows for greater flexibility. For example, procedural primitives can be used where terrain or wave models can be mathematically represented and directly evaluated, thus avoiding the need to generate large amounts of geometric data. Therefore, it will be understood that the geometric information of a node can include boundary volumes, or it can include descriptions of one or more primitives.

[0082] Figure 3A block diagram of a system 100 for coherent aggregation of rays in a ray tracing system, according to an example, is shown. It will be understood that this block diagram is part of a ray tracing system, and its other components are outside the scope of this disclosure. System 100 includes a ray storage device (RS) 110; an external memory 112; an accelerated structure cache (ASC) 114; a coherent aggregation unit (CGU) 120; and a frame / element scheduler unit (BPS) 130. In this example, the ray storage device 110 is local to the ray tracing system, and the memory 112 is external to the ray tracing system. For example, the memory may be on a semiconductor die separate from the ray tracing system. The ray storage device 110 may be on the same semiconductor die as the ray tracing system, and particularly, with... Figure 3 The remaining components shown are on the same semiconductor die. This makes ray information retrieval faster and easier. BPS 130 includes BPS frame unit 131 for scheduling intersection tests of frame nodes; and BPS primitive unit 135 for scheduling intersection tests of primitive nodes. BPS primitive unit 135 is configured to communicate with one or more primitive test units (PTUs) 145 for primitive node intersection tests. BPS frame unit 131 is configured to communicate with one or more frame test units (BTUs) 141 for frame node intersection tests. BPS 130 is configured to communicate with ray storage device 110 to retrieve ray information. CGU 120 is configured to communicate with ASC 114 to retrieve geometry information and instance transformations from external memory 112 via ASC. ASC 114 is configured to communicate with external memory 112. Typically, geometry information and instance transformations will involve too much data to be stored entirely within the ray tracing system. The CGU 120 provides initial ray IDs that identify the rays to be tested. Ray information for these rays is stored in ray storage device 110. The CGU 120 performs coherence aggregation—aggregating rays together for testing against corresponding nodes (specifically, against a given instance of a BLAS node). Any suitable data structure can be used to associate aggregated rays with their respective nodes. In this example, rays are aggregated into groups. Groups contain rays to be tested against the same node. Additionally, a list of groups associated with a specific node can be maintained. A group can contain 8 rays and is the smallest unit that can be scheduled for testing. In other examples, other group sizes (e.g., 1, 4, 6, or 16 rays) can be used. The BPS 130 is configured to communicate with the CGU 120 so that the BPS schedules testing of aggregated ray groups.

[0083] Figure 4 yes Figure 3 A more detailed version of the block diagram. Specifically, Figure 4The elements of an instance transformation cache are shown. The instance transformation cache includes content-addressable memory (CAM) and random access memory (RAM). In this example, the CAM is included in CGU 120. It is provided in two parts: instance CAM 122 and instance CAM 126. The RAM is also provided in two parts: instance RAM 132 is included in BPS frame cell 131, and instance RAM 136 is included in BPS primitive cell 135. Instance CAM 122 and instance RAM 132 form an instance transformation cache for frame nodes. Instance CAM 126 and instance RAM 136 form an instance transformation cache for primitive nodes. BPS frame cell 131 also includes an instance transformation unit (ITU) 133 and a geometry RAM 134. Similarly, BPS primitive cell 135 also includes an ITU 137 and a geometry RAM 138. Each instance RAM 132, 136 contains, for example, transformation coefficients for a transformation currently used for intersection testing. During intersection testing, the frame is tested before the primitives below the frame in the hierarchical structure; therefore, it can be beneficial to separately cache instance transformations for frames and primitives (as done in this example). A given instance transformation will be needed for the frame test earlier than it is for the primitive test. Similarly, the frame test will complete the instance transformation before the primitive test has completed it. Each instance CAM 122, 126 serves as an index to the corresponding instance RAM 132, 136. Each ITU 133, 137 receives ray information from ray storage device 110 and instance transformation coefficients from the corresponding instance RAM 132, 136. The ITU uses the transformation coefficients to transform the ray from world space to instance space. For frame nodes, ITU 133 uses the transformation coefficients from instance RAM 132 to transform the ray received from ray storage device 110 to instance space. Transformed rays are provided from ITU 133 to BTU 141. BTU 141 also receives the geometry information of the frame nodes being intersected from geometry RAM 134. For primitive nodes, ITU 137 uses transformation coefficients from instance RAM 136 to transform rays received from ray storage device 110 to instance space. The transformed rays are provided by ITU 137 to PTU 145. PTU 145 also receives geometric information of primitive nodes that are intersected from geometry RAM 138. The geometric information in each geometry RAM 134, 138 is indexed by a geometry ID.

[0084] In this example, Figure 4Each unit shown is implemented using fixed-function logic in the hardware. This allows each unit to perform its function on a sequential basis while other units continue to perform their functions. This allows for parallel pipeline implementation. The system is designed to manage the data flow through the individual units in order to minimize situations where any unit is overloaded or lacks sufficient data for processing.

[0085] Figure 5 The illustration shows the result based on an example. Figure 3 and Figure 4 The flowchart illustrates the system execution method. Multiple rays are defined, each with associated ray information, including the ray origin and direction defined in world space. A hierarchical acceleration structure is also defined, comprising multiple upper-layer (TLAS) nodes and multiple lower-layer (BLAS) nodes. Each node has associated geometric information. As described above, this geometric information is defined in the world space of the TLAS nodes and in the instance space of the BLAS nodes. For each instance where one of the BLAS nodes is a child node of one of the TLAS nodes, a world-to-instance transformation is defined.

[0086] In step 710, the system stores the ray information in the (internal) ray storage device 110. In step 712, the system stores the geometric structure information and instance transformations in external memory 112. In step 714, the CGU 120 performs coherent aggregation of multiple rays, where each ray requires intersection testing against the corresponding node of the hierarchical structure. Coherent aggregation can be performed by maintaining a list of rays (e.g., by forming a list of cumulative ray groups in the CGU), which needs to be tested against the corresponding node as a ray traverses the hierarchical acceleration structure. The hierarchical structure can be traversed in any order. Various traversal strategies are known in the art and are outside the scope of this disclosure.

[0087] In step 716, CGU 120 selects one or more cumulative ray groups to form a ray group for testing. Typically, CGU 120 selects a node and then forms a ray group from one or more ray groups associated with that node. In some cases, CGU will form a ray group from all groups associated with the selected node (i.e., the entire list of groups). Typically, instances of TLAS and BLAS nodes will be selected for testing over time. However, for the purposes of this example, we will assume that an instance of a BLAS node is selected. According to this example, when a node is “evicted” from CGU 120, that node is selected for the intersection test. A node can be evicted under any of the following conditions:

[0088] ● When the number of rays aggregated for a node exceeds a first threshold (e.g., the number of groups associated with a node in the list exceeds a threshold);

[0089] ● When the total number of rays in all groups maintained by CGU exceeds the second threshold (to avoid running out of memory in CGU used to store lists);

[0090] ● When the tester units (BTU 141 and / or PTU 145) are idle, indicate that they have available capacity to perform intersection tests (to avoid underutilization of computing resources).

[0091] In step 718, CGU 120 retrieves the geometry information of the BLAS node selected for testing. This involves CGU 120 requesting the geometry information from ASC 114. ASC 114 is the local memory of the ray tracing system, used to cache geometry information and instance transformations that would otherwise need to be read from external memory 112. When CGU 120 requests geometry information, ASC 114 checks whether the geometry information already exists in the cache. If it exists, ASC provides it to CGU 120 without needing to read it from external memory 112. If it does not exist, ASC 114 reads it from external memory 112 before providing it to CGU 120. In this way, ASC 114 acts as an intermediary between CGU 120 and external memory 112. The purpose is to reduce the required memory bandwidth by reducing the number of repeated reads from external memory.

[0092] In step 720, CGU 120 searches the instance transformation cache for the instance transformation associated with the currently selected BLAS node instance. This will be described in more detail below. In short, however, CGU 120 searches for the address of the desired instance transformation in the relevant instance CAM 122 or 126. If the node is a box node, CGU searches in instance CAM 122; if the node is a primitive node, CGU searches in instance CAM 126. If the instance transformation is already stored in the instance transformation cache, instance CAM 122 or 126 returns an index indicating the location of the instance transformation coefficient in the corresponding instance RAM 132 or 136. If the instance transformation exists in the cache (see step 722), CGU proceeds to submit the selected ray group for an intersection test (in step 726). If the desired instance transformation does not exist in the cache, CGU 120 retrieves the instance transformation in step 724 and loads it into the cache in step 725. In retrieval step 724, CGU 120 retrieves the instance transformation by requesting an instance transformation from ASC 114. ASC 114 processes the request in essentially the same manner as processing geometry information requests (as described above). If the instance transformation already exists in ASC 114, it is provided to CGU 120 without needing to read anything from external memory 112. If the instance transformation does not exist in ASC 114, ASC reads it from external memory 112 before providing it to CGU 120. In loading step 725, CGU 120 loads the retrieved instance transformation into the instance transformation cache. Specifically, it stores the memory address of the instance transformation in the associated instance CAM 122 or 126, and it stores the transformation coefficients of the instance transformation in the corresponding instance RAM 132 or 136. (If the node in question is a box node, the coefficients are stored in instance RAM 132; if the node is a primitive node, the coefficients are stored in instance RAM 136.) In this example, the boxes in the BLAS are traversed first; therefore, the given transformation will first be stored in instance RAM 132. Subsequently, when traversing the first leaf (primitive) node, the same transformation is retrieved from ASC 114 and loaded into instance RAM 136 in preparation for primitive intersection testing.

[0093] In step 726, CGU 120 submits the selected ray group for intersection testing. Specifically, CGU 120 submits the ray group to BPS 130. To do this, CGU 120 passes one or more packets containing geometric information of the selected ray group and the selected BLAS node to BPS 130. Depending on whether the node in question is a box node or a primitive node, the geometric information is stored in the geometric RAM 134 of BPS box unit 131 or the geometric RAM 138 of BPS primitive unit 135. In step 729, BPS 130 requests ray information of the selected one or more ray groups from ray storage device 110. BPS 130 schedules intersection testing for tester units (BTU 141 and PTU 145). In step 730, the intersection test is performed by the tester units (BTU 141 and PTU 145).

[0094] As described above, when submitting a ray group for testing, the CGU 120 has ensured that the required instance transform coefficients exist in the relevant instance RAM 132 / 136. This means that the required coefficients are locally available with minimal latency and no power consumption or latency involved in external memory read operations. This helps accelerate the process of scheduling and testing ray groups for nodes. It also helps avoid repeated, redundant accesses to external memory to read the same transform coefficients multiple times. Geometry information is also prepared in the relevant geometry RAM 134, 138. Note that performing step 718 (requesting geometry information) before step 720 (searching the instance transform cache) is not required. In some examples, the instance transform cache is searched first (step 720). If the instance transform is in the cache, only the geometry information is retrieved; otherwise, if the instance transform is not in the cache, both the geometry information and the instance transform are retrieved.

[0095] In principle, a geometry CAM could be provided to index the geometry RAM, similar to how an instance CAM is used to index the instance RAM. However, this is not implemented in this example. This is because, in typical scenarios, there are far more nodes than instance transformations—each BLAS root node has one instance transformation, but there will typically be a large number of nodes below that root node. Given a large number of nodes, the probability that the geometry information of a given node will still be in the geometry RAM when it is requested again is relatively low. Therefore, the benefit of caching geometry information in a (relatively small) geometry RAM is limited. ASC already provides relatively fast access to geometry data.

[0096] The BPS unit schedules the intersection test. For this purpose, ITU 133, 137 acquires the ray information provided by the ray storage device 110 and transforms the ray using transformation coefficients read from instance RAMs 132, 136. To perform the intersection test (step 730), the tester units (BTU 141 and PTU 145) acquire the transformed ray provided by ITU 133, 137, acquire the node geometry read from the geometry RAMs 134, 138, and test whether the transformed ray intersects with the relevant nodes. Therefore, the method of intersection testing is known to those skilled in the art and is outside the scope of this disclosure.

[0097] The results of the intersection test are returned to BPS 130 and CGU 120. For each ray in the group, the result indicates whether the ray intersects with the BLAS node in question. Further processing is performed based on the result. If the BLAS node is a box node and the ray intersects with it, the CGU adds the ray to the ray group maintained by the CGU for the child nodes of the intersecting box node. This means the ray is ultimately tested against these child nodes (when the relevant group is selected for testing, e.g., when a child node is expelled from CGU 120). Alternatively, if the BLAS node is a primitive node and the ray intersects with it, this fact is recorded (e.g., in the ray storage device), and the system continues traversing the hierarchy. Finally, if necessary, the shader procedure (in step 740) can be invoked to determine the effect of the intersection on the ray, e.g., whether the ray is reflected, refracted, absorbed, etc., by the object primitive. For example, in the case of reflection or refraction, a new ray can be emitted. In this case, the ray information of the new ray will be written to the ray storage device 110.

[0098] The system operates in this manner until all rays have been tested for all necessary nodes in the hierarchical structure.

[0099] Figure 6 It is a more detailed flowchart explaining how to retrieve geometric information and instance transformations based on an example.

[0100] The CGU tracks the current state of all nodes for which geometric information and (if necessary) instance transformations have been requested from ASC 114. ASC 114 may return data out of order. That is, ASC 114 may return data in an order different from the order in which the data was requested. This can happen, especially because some data already exists in the ASC and can therefore be returned quickly, while other data is not currently stored in the ASC and must be retrieved from external memory 112 before it can be returned. This other data may be returned more slowly.

[0101] Information associated with a ray group (directly or indirectly) includes the instance address, which is the memory address of the instance transform. In this example, the instance address is stored for each node and thus indirectly associated with one or more groups associated with that node. Alternatively, the instance address can be explicitly stored for each group, i.e., directly associated with the group. The requester module 306 checks instance CAM 122 / 126 to determine if the instance address is associated with a transform ID; in other words, to determine if the instance transform is already stored in the instance transform cache. If the instance address is not associated with a transform ID (i.e., the instance transform does not exist in the cache), the requester module 306 allocates a new transform ID and updates the CAM entry for that transform ID with the instance address. (If no transform ID is freely available, the system must stall at that point and wait until a transform ID becomes available.) The requester module 306 then requests instance transform coefficients from ASC 114. It sets a flag associated with the transform ID in the "Requested Transform List" 312. The flag in the Requested Transform List 312 indicates that transform coefficients have been requested from ASC 114 but have not yet been returned. The CGU 120 monitors the requested transformation list 312 to detect when instance transformation coefficients have been returned. This can be done by periodically checking the requested transformation list 312.

[0102] At a later time, ASC 114 returns the requested transform coefficients, which are received by response module 316. Response module 316 stores the transform coefficients in instance RAM 132 / 136. Response module 316 also clears the relevant flags of the requested transform list 312. This indicates that the transform coefficients have been returned and that intersection tests of the node and one or more ray groups (as well as any other nodes that may have been queued and also depend on the instance transform) can now be performed. Requester module 306 also requests geometry information from ASC 114. This is returned by ASC 114 to response module 316, which then writes it to geometry RAM 134 / 138. Another process (not shown) tracks when the geometry information has been returned.

[0103] When the required instance transformation and geometry data are available, CGU 120 releases packets to BPS unit 130. That is, in response to detecting that ASC 114 has returned instance transformation and geometry information, CGU continues to submit one or more packets (and associated nodes) to BPS 130 for testing. As mentioned above, this does not need to happen in the same order as the requested data. By tracking data availability and releasing packets when data is available (regardless of the order in which they were requested), the system helps maximize the utilization of CGU 130 and tester units 141, 145.

[0104] Figure 7a and Figure 7b Processing flowchart and Figure 8 The data structure shown illustrates the cache of instance transformations. In step 606, the node address and instance address are read. In step 608, the requester module 306 checks if the instance address is a special instance address (hexadecimal zero address), in which case “h0” is used as the special address. The special instance address “h0” indicates that the node is a TLAS node without associated instance data; therefore, there is no need to query instance CAM 122 / 126. In this case, the process proceeds to step 610, where the corresponding special transformation ID is assigned in step 612, and geometry data is requested only from ASC 114. In this example, hexadecimal zero “h0” is used as the special transformation ID used by all TLAS nodes. The instance CAM entry for transformation ID h0 always contains instance address h0, and the transformation coefficients at address h0 in instance RAM 132 / 136 are always those of the identity (or empty) matrix. If it is determined in step 608 that the instance address is not “h0”, the method proceeds to step 614, where the requester module 306 checks instance CAM 122 / 126 using the instance address. If a cache hit occurs—that is, if the instance address exists in the instance CAM (either instance CAM 122 or instance CAM 126 depending on the node type)—then instance CAM 122 or 126 will return a transformation ID that indicates the instance RAM.

[0105] Slots 132 / 136 store the transformation coefficients. The method proceeds to step 616. Here, the value of the instance CAM is increased.

[0106] The transformation ID returned by 122 / 126 is associated with a reference counter called "InFlightCount". This reference counter records the number of nodes currently "in flight" (i.e., currently undergoing an intersection test) and depends on that instance transformation. From step 616, the method proceeds to step 612, where only geometry data is requested from ASC 114.

[0107] If it is determined in step 614 that the instance address does not exist in instance CAM 122 / 126 (i.e., if a cache miss exists), the method proceeds to step 618. Here, a new transform ID is allocated by the requester module 306 (if the transform ID is available, the node stops at this point if it is not). Next, in step 620, the requester module 306 writes the instance address of the instance transform to instance CAM 122 / 126 in the slot corresponding to the newly allocated transform ID. The reference counter "InFlightCount" for this transform ID is incremented (in step 621) to indicate that a node currently being tested (and one or more associated ray groups) is using this instance transform. Finally, in step 622, the requester module 306 requests geometry data and instance transform coefficients from ASC 114.

[0108] Figure 8 This example demonstrates the data structures used in instances CAM 122, 126 and instances RAM 132, 136 in this instance. Each instance CAM has S slots 8010-801 for transformation IDs. S-1 The slots are indexed by the transformation ID. Separate ranges of transformation IDs are used in the two corresponding CAMs 122 and 126. Each slot stores an instance address and two additional data fields. The first is a reference counter "InFlightCount" associated with the instance transformation, and the second is a "valid" flag indicating whether the transformation ID is currently valid.

[0109] When instances CAM 122 and 126 are first initialized, the "valid" bit for transform ID = 0 is set to 1, its instance address is set to "h0", and its "InFlightCount" is set to 0. All other "valid" bits are set to 0, indicating that the corresponding transform ID is invalid and unused. When instances CAM 122 and 126 are filled with instance addresses, the corresponding "valid" flag bits are set to 1, indicating that the corresponding transform is valid. By maintaining the flag bits and referencing the counter, the system can distinguish between slots in the instance transform cache that are (so far) empty (valid = 0) and slots containing data (valid = 1) but the data is not currently in use (counter = 0). This allows the system to prioritize the allocation of transform IDs corresponding to slots that are not yet in use. Only when all slots are "valid" will the system reallocate valid transform IDs that are not currently used by flying nodes. This helps to keep instance transforms in the instance transform cache for as long as possible, thereby increasing the likelihood of cache hits and thus reducing unnecessary accesses to ASCII 114 and / or external memory 112.

[0110] Instance RAMs 132 and 136 have the same number of slots 8020-802 as the corresponding instances CAMs 122 and 126. S-1 And they are similarly indexed by transform IDs. Each slot stores the transform coefficients of the world-to-instance transform associated with the corresponding transform ID. The entries in the CAM are organized in the same order as the entries in the RAM. Thus, for example, if the address of a particular instance transform is stored in the 5th entry (transformer ID=4) in the CAM, then the transform coefficients of that instance transform are stored in the 5th entry (transformer ID=4) in the RAM.

[0111] In this context, separating the cache into CAM and RAM helps make it more efficient than a conventional cache. With a traditional associative cache, data (i.e., transformation coefficients) is stored within the cache itself, associated with an instance address. In the event of a cache hit, when the cache is queried at that address, the data is returned by that cache and stored in another memory from which the test unit will access that data.

[0112] By using a CAM+RAM layout, the cache does not need to be queried when the tester is performing intersection tests. The system ensures that all the transform data required by the tester is present in the instance transform RAM via a reference counter. The BPS simply has an index (transformation ID), and it can schedule tests by directly accessing RAM without querying the CAM, and no additional storage is required between the cache and the tester.

[0113] Figure 7b It shows Figure 7a The remainder of the process flowchart. When the geometry ID is deallocated in step 630, this indicates that an intersection test has been completed for the node. Therefore, the reference counter (“InFlightCount”) for the corresponding instance transform (identified by the transform ID) is decremented by one. This indicates that one less node (and associated ray group) is currently using the instance transform. In step 632, the requester module 306 checks whether the reduced reference counter for the transform ID is now equal to zero. If so, this indicates that no flying node is using the instance transform. Therefore, if the requester module 306 needs to assign a new transform ID and there is no free transform ID, the transform ID can be reassigned (in step 634). On the other hand, if it is determined in step 632 that the reduced reference counter is not equal to zero, this indicates that the transform ID is still in use (see step 636) and cannot be reassigned yet.

[0114] When there are no free transform IDs, the requester module 306 must wait for a transform ID to be allocated until one becomes available (i.e., until one of the reference counters has been reduced to zero so that no flying node is using the corresponding transform ID).

[0115] The coherence aggregation system according to this disclosure can be provided as part of a ray tracing system. The ray tracing system may include one or more systems for coherence aggregation, one or more tester units for intersection testing, and may implement one or more shader programs. The ray tracing system may be provided as part of a graphics processing system.

[0116] It should be understood that the scope of this disclosure is not limited to the examples described above. Various possible modifications will now be apparent to those skilled in the art. For example, although... Figure 4 The instances in the model use separate instances CAM122, 126 and instances RAM132, 136 for frame nodes and primitive nodes, respectively. In other implementations, there may be only a single instance RAM and a single instance CAM, which are used to store instance transformations of frame nodes and primitive nodes. In other examples, there may be more than two CAMs and more than two RAMs.

[0117] Figure 9 A computer system in which such a graphics processing system can be implemented is shown. The computer system includes a CPU 902, a GPU 904, memory 906, and other devices 914, such as a display 916, a speaker 918, and a camera 919. A processing block 910 (corresponding to the coherence aggregation system 100) is implemented on the GPU 904. In other examples, the processing block 910 may be implemented on the CPU 902. The components of the computer system can communicate with each other via a communication bus 920. A storage device 912 (corresponding to memory 112) is implemented as part of memory 906.

[0118] Figures 3 to 4 The coherent aggregation system is shown as comprising a number of functional blocks. This is merely illustrative and not intended to define a strict division between the different logical elements of such an entity. Each functional block may be provided in any suitable manner. It should be understood that the intermediate values ​​described herein formed by the coherent aggregation system do not need to be physically generated by the coherent aggregation system at any point in time, and may only represent logical values ​​that conveniently describe the processing performed by the coherent aggregation system between its inputs and outputs.

[0119] The coherent aggregation systems (and ray tracing systems and / or graphics processing systems incorporating them) described herein can be embodied in hardware on integrated circuits. The systems described herein can be configured to perform any of the methods described herein. Generally, any of the functions, methods, techniques, or components described above can be implemented in software, firmware, hardware (e.g., fixed logic circuitry systems), or any combination thereof. The terms “module,” “function,” “component,” “element,” “cell,” “block,” and “logic” are used herein to generally denote software, firmware, hardware, or any combination thereof. In the case of a software implementation, a module, function, component, element, cell, block, or logic represents program code that, when executed on a processor, performs a specified task. The algorithms and methods described herein can be executed by one or more processors executing code that causes the processor to execute the algorithm / method. Examples of computer-readable storage media include random access memory (RAM), read-only memory (ROM), optical disk, flash memory, hard disk storage, and other memory devices that can use magnetic, optical, and other techniques to store instructions or other data and can be accessed by a machine.

[0120] As used herein, the terms computer program code and computer-readable instructions refer to any kind of executable code for processor execution, including code expressed in machine language, interpreted language, or scripting language. Executable code includes binary code, machine code, bytecode, code defining integrated circuits (e.g., hardware description languages ​​or netlists), and code expressed in, for example, C++. Executable code can be expressed in programming languages ​​such as OpenCL. Executable code can be, for example, any kind of software, firmware, script, module, or library that, when properly executed, processed, interpreted, compiled, or run in a virtual machine or other software environment, causes the processor of a computer system that supports the executable code to perform the tasks specified by the code.

[0121] A processor, computer, or computer system can be any kind of device, machine, or special-purpose circuit, or a collection or part thereof, that has the processing power to execute instructions. A processor can be any kind of general-purpose or special-purpose processor, such as a CPU, GPU, neural network accelerator (NNA), system-on-a-chip, state machine, media processor, application-specific integrated circuit (ASIC), programmable logic array, field-programmable gate array (FPGA), etc. A computer or computer system may include one or more processors.

[0122] This invention also intends to cover software defining the configuration of hardware as described herein, such as hardware description language (HDL) software, for designing integrated circuits or for configuring programmable chips to perform desired functions. That is, a computer-readable storage medium on which computer-readable program code in the form of an integrated circuit definition dataset is encoded may be provided, which, when processed (i.e., run) in an integrated circuit manufacturing system, configures the system to manufacture a coherent aggregation system (or ray tracing system or graphics processing system) configured to perform any of the methods described herein, or to manufacture a coherent aggregation system (or ray tracing system or graphics processing system) including any of the means described herein. The integrated circuit definition dataset may be, for example, an integrated circuit description.

[0123] Therefore, a method for manufacturing a coherent aggregation system (or ray tracing system or graphics processing system) as described herein can be provided in an integrated circuit manufacturing system. Furthermore, an integrated circuit definition dataset can be provided, which, when processed in an integrated circuit manufacturing system, enables the method for manufacturing the coherent aggregation system (or ray tracing system or graphics processing system) to be executed.

[0124] Integrated circuit definition datasets can be in the form of computer code, such as netlists, code for configuring programmable chips, or hardware description languages ​​suitable for manufacturing at any level in integrated circuits, including register-transfer level (RTL) code, high-level circuit representations (such as Verilog or VHDL), and low-level circuit representations (such as OASIS(RTM) and GDSII). Higher-level representations (such as RTL) that logically define hardware suitable for manufacturing in integrated circuits can be processed on a computer system configured to generate manufacturing definitions of integrated circuits within the context of a software environment that includes definitions of circuit elements and rules for combining these elements to generate the manufacturing definitions of the integrated circuits defined by the representations. As is typically the case where software executes at a computer system to define a machine, one or more intermediate user steps (e.g., providing commands, variables, etc.) may be required to configure the computer system to generate the manufacturing definitions of the integrated circuits, executing code that defines the integrated circuits in order to generate the manufacturing definitions of the integrated circuits.

[0125] Now refer to Figure 10 This describes an example of processing an integrated circuit definition dataset at an integrated circuit manufacturing system in order to configure the system for manufacturing a coherent aggregation system (or a ray tracing system or a graphics processing system).

[0126] Figure 10An example of an integrated circuit (IC) manufacturing system 1002 is shown, configured to manufacture a coherent aggregation system (or ray tracing system or graphics processing system) as described in any example herein. Specifically, the IC manufacturing system 1002 includes a layout processing system 1004 and an integrated circuit generation system 1006. The IC manufacturing system 1002 is configured to receive an IC definition dataset (e.g., defining a coherent aggregation system as described in any example herein), process the IC definition dataset, and generate an IC based on the IC definition dataset (e.g., which embodies a coherent aggregation system as described in any example herein). Through the processing of the IC definition dataset, the IC manufacturing system 1002 is configured to manufacture integrated circuits embodying a coherent aggregation system (or ray tracing system or graphics processing system) as described in any example herein.

[0127] The layout processing system 1004 is configured to receive and process an IC definition dataset to determine a circuit layout. Methods for determining a circuit layout based on an IC definition dataset are known in the art and may involve, for example, synthesizing RTL code to determine the gate-level representation of the circuit to be generated, for example, in relation to logic components (e.g., NAND, NOR, AND, OR, MUX, and FLIP-FLOP components). By determining the location information of the logic components, the circuit layout can be determined based on the gate-level representation of the circuit. This can be done automatically or with user intervention to optimize the circuit layout. Once the layout processing system 1004 has determined the circuit layout, it can output the circuit layout definition to the IC generation system 1006. The circuit layout definition may be, for example, a circuit layout description.

[0128] As is known in the art, IC generation system 1006 generates ICs according to a circuit layout definition. For example, IC generation system 1006 may implement a semiconductor device manufacturing process for generating ICs, which may involve a multi-step sequence of photolithography and chemical processing steps, during which electronic circuits are gradually formed on a wafer made of semiconductor material. The circuit layout definition may be in the form of a mask, which can be used in the photolithography process to generate ICs according to the circuit definition. Alternatively, the circuit layout definition provided to IC generation system 1006 may be in the form of computer-readable code, which IC generation system 1006 can use to form a suitable mask for generating ICs.

[0129] The various processes performed by the IC manufacturing system 1002 can all be implemented in one location, for example, by one party. Alternatively, the IC manufacturing system 1002 can be a distributed system, allowing some processes to be performed in different locations and by different parties. For example, some of the following stages can be performed in different locations and / or by different parties: (i) synthesizing RTL code representing an IC definition dataset to form a gate-level representation of the circuit to be generated; (ii) generating a circuit layout based on the gate-level representation; (iii) forming a mask based on the circuit layout; and (iv) using the mask to manufacture the integrated circuit.

[0130] In other examples, processing of an integrated circuit definition dataset in an integrated circuit manufacturing system can configure the system to manufacture a coherent aggregation system (or ray tracing system or graphics processing system) without processing the IC definition dataset to determine circuit layout. For example, an integrated circuit definition dataset can define the configuration of a reconfigurable processor such as an FPGA, and processing of that dataset can configure the IC manufacturing system (e.g., by loading configuration data into the FPGA) to generate a reconfigurable processor with that defined configuration.

[0131] In some implementations, when processed in an integrated circuit manufacturing system, an integrated circuit manufacturing definition dataset can enable the integrated circuit manufacturing system to generate devices as described herein. For example, using an integrated circuit manufacturing definition dataset, as referenced above... Figure 10 The described method allows for the configuration of an integrated circuit manufacturing system to produce equipment as described in this article.

[0132] In some examples, an integrated circuit definition dataset may include software running on hardware defined at the dataset, or software running in combination with hardware defined at the dataset. Figure 10 In the example shown, the IC generation system can also be further configured by the integrated circuit definition dataset to load firmware onto the integrated circuit according to the program code defined in the integrated circuit definition dataset during the manufacturing of the integrated circuit, or otherwise provide the integrated circuit with program code to be used with the integrated circuit.

[0133] Compared to known implementations, the implementation of the concepts set forth in this application in devices, apparatuses, modules, and / or systems (and in the methods implemented herein) can lead to performance improvements. Performance improvements may include one or more of increased computational performance, reduced latency, increased throughput, and / or reduced power consumption. During the manufacture of such devices, apparatuses, modules, and systems (e.g., in integrated circuits), trade-offs can be made between performance improvements and physical implementations, thereby improving manufacturing methods. For example, a trade-off can be made between performance improvements and layout area to match the performance of known implementations but using less silicon. This can be accomplished, for example, by reusing functional blocks serially or sharing functional blocks among elements of the device, apparatus, module, and / or system. Conversely, the concepts set forth in this application that lead to improvements in the physical implementation of devices, apparatuses, modules, and systems (such as reduced silicon area) can be traded off against performance improvements. This can be accomplished, for example, by manufacturing multiple instances of the module within a predefined area budget.

[0134] The applicant has independently disclosed each individual feature described herein, as well as any combination of two or more such features, to the extent that such features or combinations can be implemented based on the specification as a whole, in accordance with the common knowledge of those skilled in the art, regardless of whether such features or combinations of features solve any problem disclosed herein. In view of the foregoing description, those skilled in the art will understand that various modifications can be made within the scope of this invention.

Claims

1. A method for coherently focusing rays in a ray tracing system, the method comprising: Define multiple rays, each ray having associated ray information defined in a first coordinate system. A hierarchical acceleration structure is defined, comprising multiple nodes, including upper-level nodes and lower-level nodes. Each node in the acceleration structure has associated geometric structure information, wherein the geometric structure information of the upper-level nodes is defined in a first coordinate system, and the geometric structure information of each node in the lower-level nodes is defined in a second coordinate system different from the first coordinate system. The lower-level nodes are instantiated as one or more instances within the acceleration structure, each instance being associated with an instance transformation that specifies the relationship between the first coordinate system and a corresponding second coordinate system for that instance. The method further includes: The geometric structure information and the instance transformation are stored (712) in the memory (112); Multiple ray groups are grouped together (714), where each group requires an intersection test for the corresponding node in the hierarchical acceleration structure; Select one of the groups described in (716) for the intersection test, wherein the corresponding node to be tested is an instance of the lower-level node; Search for the instance transformation of the instance in the instance transformation cache (122, 126, 132, 136) (720); If the instance transformation is found in the instance transformation cache Submit (726) the selected ray group for the intersection test, and If the instance transformation is not found in the instance transformation cache: Retrieve (724) the instance transformation and load (725) it into the instance transformation cache; and Submit (726) the selected ray group for the intersection test. The instance transformation cache includes a first instance content addressable memory (CAM) (122) and a first instance random access memory (RAM) (132), as well as a second instance CAM (126) and a second instance RAM (136); The first instance CAM (122) is configured to store the memory address of each of the plurality of instance transformations of the bounding box node, and the first instance RAM (132) is configured to store a set of transformation coefficients of each of the plurality of instance transformations of the bounding box node; and The second instance CAM (126) is configured to store the memory address of each of the plurality of instance transformations of a primitive node, and the second instance RAM (136) is configured to store a set of transformation coefficients of each of the plurality of instance transformations of a primitive node.

2. The method according to claim 1, wherein: (A) Retrieving the instance transformation includes requesting (724) the instance transformation from the acceleration structure cache (114); and / or (B) The method further includes retrieving geometric information of the selected lower-level nodes from the cache before submitting the selected ray group for the intersection test.

3. The method of claim 2, wherein retrieving the geometry information includes requesting (718) the geometry information from the acceleration structure cache (114).

4. The method of claim 2, wherein the accelerated structure cache (114) retrieves any requested geometry information and / or instance transformations not yet stored in the memory (112) from the memory (112), and returns the requested geometry information and / or instance transformations.

5. The method according to any one of the preceding claims, the method comprising selecting (716) the ray group for the intersection test based on one or more of the following criteria: The number of rays detected in the group exceeds a first predetermined threshold; The total number of rays detected in all said groups exceeds a second predetermined threshold; as well as The computational resources used to perform the intersection test were found to be underutilized.

6. The method of claim 1, wherein retrieving the instance transformation when it is not found in the instance transformation cache comprises: The instance transformation requested in (724) is as follows; Monitor whether the instance transformation has been returned; as well as After detecting that the instance transformation has been returned, continue to submit (726) the selected ray group for the intersection test.

7. The method of claim 6, wherein monitoring whether the instance transformation has been returned comprises: When requesting the instance transformation, set the flag bit (312) associated with the instance transformation; as well as Upon receiving the returned instance transformation, clear the flag bit (312).

8. A system (100) for coherently focusing rays in a ray tracing system, the system comprising: A ray storage device (110) is configured to store ray information of a plurality of rays, wherein the ray information of each ray defines the ray in a first coordinate system; Memory (112), the memory being configured to store: Geometric information associated with each of a plurality of nodes in a hierarchical acceleration structure, including upper-level nodes and lower-level nodes, wherein the geometric information of the upper-level nodes is defined in a first coordinate system, and the geometric information of each of the lower-level nodes is defined in a second coordinate system different from the first coordinate system, wherein the lower-level nodes are instantiated within the acceleration structure as one or more instances, each instance being associated with an instance transformation specifying the relationship between the first coordinate system and a corresponding second coordinate system for that instance. The memory is also configured to store the instance transformations; Instance transformation cache (122, 126, 132, 136), which is configured to temporarily store instance transformations; as well as Coherence aggregation unit (120), the coherence aggregation unit being configured to: Multiple ray groups are grouped together (714), where each group requires an intersection test for the corresponding node in the hierarchical acceleration structure; Select one of the groups described in (716) for the intersection test, wherein the corresponding node to be tested is an instance of the lower-level node; Search (720) the instance transformation cache (122, 126, 132, 136) to find the instance transformation of the lower-level node; If the instance transformation is found in the instance transformation cache Submit (726) the selected ray group for the intersection test, and If the instance transformation is not found in the instance transformation cache: Retrieve (724) the instance transformation and load (725) it into the instance transformation cache (122, 126, 132, 136); and Submit (726) the selected ray group for the intersection test. The instance transformation cache includes a first instance content addressable memory (CAM) (122) and a first instance random access memory (RAM) (132), as well as a second instance CAM (126) and a second instance RAM (136); The first instance CAM (122) is configured to store the memory address of each of the plurality of instance transformations of the bounding box node, and the first instance RAM (132) is configured to store a set of transformation coefficients of each of the plurality of instance transformations of the bounding box node; and The second instance CAM (126) is configured to store the memory address of each of the plurality of instance transformations of a primitive node, and the second instance RAM (136) is configured to store a set of transformation coefficients of each of the plurality of instance transformations of a primitive node.

9. The system of claim 8, further comprising an instance transformation unit (133, 137) configured to use instance transformation to transform ray information, wherein the coherence aggregation unit (120) is configured to submit the ray and associated instance transformation to the instance transformation unit (133, 137) when submitting a selected ray group for intersection testing.

10. The system of claim 8 or claim 9, further comprising at least one acceleration structure cache (114), the at least one acceleration structure cache being configured to temporarily store at least one of: geometric structure information; and the instance transformation, and wherein the coherence aggregation unit (120) is configured to retrieve one or both of: (A) Retrieving the geometric structure information by requesting the geometric structure information from the at least one acceleration structure cache (114); and (B) The instance transformation is retrieved by requesting (724) the instance transformation from the at least one acceleration structure cache (114).

11. The system of claim 10, wherein the acceleration structure cache (114) is configured to retrieve any requested geometry information and / or instance transformations not yet stored in the memory (112) from the memory (112), and return the requested geometry information and / or instance transformations to the coherence aggregation unit (120).

12. The system according to claim 8, wherein: (A) The coherence aggregation unit (120) is configured to select the ray group for intersecting tests based on one or more of the following criteria: The number of rays detected in the group exceeds a first predetermined threshold; The total number of rays detected in all said groups exceeds a second predetermined threshold; as well as The computational resources used to perform the intersection test are not being fully utilized; and / or (B) Wherein, when the instance transformation is not found in the instance transformation cache (122, 126, 132, 136), the coherence aggregation unit (120) is configured to retrieve the instance transformation by: Request the instance transformation; Monitor whether the instance transformation has been returned; and After detecting that the instance transformation has been returned, continue submitting the selected ray group for the intersection test.

13. The system of claim 8, wherein the CAM (122, 126) is configured to store the memory address of the instance transformation at a corresponding index location in the CAM for each of the plurality of instance transformations, and the RAM (132, 136) is configured to store the transformation coefficients of the instance transformation at a corresponding index location in the RAM (132, 136) for each of the plurality of instance transformations. Therefore, when the memory address of the instance transformation is queried, the CAM (122, 126) is returned as an index of the location in the RAM (132, 136) where the corresponding transformation coefficients are stored.

14. The system of claim 8, 9, 12 or 13, wherein the CAM is configured to store a reference counter for each instance transform, the reference counter recording the number of ray groups currently being tested that reference the instance transform, and / or wherein the CAM is configured to store a validity flag for each instance transform in the instance transform cache, the validity flag indicating whether the instance transform is currently valid.

15. A graphics processing system comprising the system according to any one of claims 8 to 14 or configured to perform the method according to any one of claims 1 to 7.

16. A method of manufacturing the system of any one of claims 8 to 14 or the graphics processing system of claim 15 using an integrated circuit manufacturing system, the method comprising: A layout processing system is used to process a computer-readable description of the system or graphics processing system to produce a circuit layout description of an integrated circuit embodying the system or graphics processing system. as well as The system or graphics processing system is manufactured using an integrated circuit manufacturing system based on the circuit layout description.

17. A computer-readable storage medium having computer-readable code encoded thereon, the computer-readable code being configured to cause the method of any one of claims 1 to 7 to be performed when the code is executed.

18. A non-transitory computer-readable storage medium storing a computer-readable description of a system according to any one of claims 8 to 14 or a graphics processing system according to claim 15, wherein when the computer-readable description is processed in an integrated circuit manufacturing system, the computer-readable description causes the integrated circuit manufacturing system to: The computer-readable description of the system or the graphics processing system is processed using a layout processing system to produce a circuit layout description embodying the integrated circuits of the system or the graphics processing system; and The system or the graphics processing system is manufactured using an integrated circuit manufacturing system based on the circuit layout description.

19. An integrated circuit manufacturing system, the integrated circuit manufacturing system comprising: A non-transitory computer-readable storage medium storing a computer-readable description of the system according to any one of claims 8 to 14 or the graphics processing system according to claim 15. A layout processing system configured to process the computer-readable description to produce a circuit layout description of an integrated circuit embodying the system or the graphics processing system; as well as An integrated circuit manufacturing system configured to manufacture the system or the graphics processing system according to the circuit layout description.

Citation Information

Patent Citations

  • Method for picking up three-dimensional geometric primitive based on GPU

    CN103473814A

  • Method and apparatus for graphic processing using parallel pipeline

    CN103593817A