Mixed reality encoding using superimposition
By combining overlay auxiliary image technology with the collaborative work of parallel processors, the problem of insufficient flexibility in existing video coding technologies has been solved, enabling efficient encoding and display control of mixed reality content.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- INTEL CORP
- Filing Date
- 2018-04-23
- Publication Date
- 2026-06-19
AI Technical Summary
Existing video coding technologies have low flexibility after synthesizing video content, making it difficult to achieve flexible control over the displayed content.
The mixed reality content is encoded using overlay auxiliary image technology. Through the collaborative work of parallel processors and graphics processing units, the real world and the rendered content are encoded and decoded independently.
It improves the display flexibility and encoding efficiency of video content, and supports flexible control and efficient processing of mixed reality content.
Smart Images

Figure CN108737829B_ABST
Abstract
Description
Technical Field
[0001] The embodiments generally relate to video coding, and more specifically, to mixed reality coding using overlay. Background Technology
[0002] Currently, video content and rendered content are composited together on the front end before encoding. Once encoded, there is very little flexibility regarding what will be displayed. Anything sent to the monitor will be displayed. Attached Figure Description
[0003] The various advantages of the embodiments will become apparent to those skilled in the art from the following description and appended claims, and from the following drawings, in which:
[0004] Figure 1 This is a block diagram illustrating a computer system configured to implement one or more aspects of the embodiments described herein;
[0005] Figures 2A to 2D A parallel processor component according to an embodiment is shown;
[0006] Figures 3A to 3B This is a block diagram of a graphics multiprocessor according to an embodiment;
[0007] Figures 4A to 4F An exemplary architecture is shown, in which multiple GPUs are communicatively coupled to multiple multi-core processors;
[0008] Figure 5 Demonstrates a graphics processing pipeline according to an embodiment;
[0009] Figure 6 This diagram illustrates the general framework used for encoding real-world content as well as rendered content.
[0010] Figure 7 A block diagram illustrating an example of a device for encoding mixed reality content using overlaid auxiliary images, according to an embodiment;
[0011] Figure 8 A flowchart illustrating an example of a method for encoding mixed reality content using overlaid auxiliary images according to an embodiment;
[0012] Figure 9 A block diagram illustrating an example of a device for decoding mixed reality content using overlaid auxiliary images, according to an embodiment;
[0013] Figure 10 A flowchart illustrating an example of a method for decoding mixed reality content using overlaid auxiliary images according to an embodiment;
[0014] Figure 11 This is a block diagram of an example of a display with local backlight capability according to an embodiment;
[0015] Figure 12A This is a block diagram of an example of a data processing apparatus according to an embodiment;
[0016] Figure 12B This is a demonstration of an example of distance determination based on an embodiment;
[0017] Figure 13 This is a block diagram illustrating an example of a layered display architecture according to an embodiment;
[0018] Figure 14 This is a block diagram of an example display architecture according to an embodiment, the display architecture including a plurality of display units;
[0019] Figure 15 This is a block diagram illustrating an example of a cloud-assisted media delivery architecture according to an embodiment;
[0020] Figures 16 to 18 This is a block diagram illustrating an example overview of a data processing system according to an embodiment;
[0021] Figure 19 This is a block diagram of an example of a graphics processing engine according to an embodiment;
[0022] Figures 20 to 22 This is a block diagram of an example execution unit according to an embodiment;
[0023] Figure 23 This is a block diagram illustrating an example of a graphical pipeline according to an embodiment;
[0024] Figures 24A to 24B This is a block diagram illustrating an example of a graphical pipeline according to an embodiment;
[0025] Figure 25 This is a block diagram illustrating an example of a graphical software architecture according to an embodiment;
[0026] Figure 26 This is a block diagram of an example intellectual property (IP) core development system according to an embodiment; and
[0027] Figure 27 This is a block diagram of an example of a system-on-chip integrated circuit according to an embodiment. Detailed Implementation
[0028] In the following description, numerous specific details are set forth to provide a more thorough understanding of the invention. However, it will be apparent to those skilled in the art that the invention may be practiced without one or more of these specific details. In other examples, well-known features have not been described to avoid obscuring the invention.
[0029] System Overview
[0030] Figure 1 This is a block diagram illustrating a computing system 100 configured to implement one or more aspects of the embodiments described herein. The computing system 100 includes a processing subsystem 101 having one or more processors 102 and a system memory 104, the processors communicating with the system memory via an interconnect path that may include a memory hub 105. The memory hub 105 may be a separate component within a chipset assembly or may be integrated within one or more processors 102. The memory hub 105 is coupled to an I / O subsystem 111 via a communication link 106. The I / O subsystem 111 includes an I / O hub 107 that enables the computing system 100 to receive input from one or more input devices 108. Additionally, the I / O hub 107 enables a display controller to provide output to one or more display devices 110A, the display controller being included within one or more processors 102. In one embodiment, the one or more display devices 110A coupled to the I / O hub 107 may include local, internal, or embedded display devices.
[0031] In one embodiment, the processing subsystem 101 includes one or more parallel processors 112 coupled to a memory hub 105 via a bus or other communication link 113. The communication link 113 may be any number of standards-based communication link technologies or protocols (e.g., but not limited to, PCI Fast Bus), or it may be a vendor-specific communication interface or communication architecture. In one embodiment, the one or more parallel processors 112 form a computationally centralized parallel or vector processing system comprising a large number of processing cores and / or processing clusters (e.g., integrated many-core (MIC) processors). In one embodiment, the one or more parallel processors 112 form a graphics processing subsystem that can output pixels to one or more display devices 110A coupled via an I / O hub 107. The one or more parallel processors 112 may also include a display controller and a display interface (not shown) to enable direct connection to one or more display devices 110B.
[0032] Within the I / O subsystem 111, system storage unit 114 can be connected to I / O hub 107 to provide a storage mechanism for computing system 100. I / O switch 116 can be used to provide an interface mechanism for connecting I / O hub 107 to other components (e.g., network adapter 118 and / or wireless network adapter 119 that can be integrated into the platform, and various other devices that can be added via one or more plug-in devices 120). Network adapter 118 can be an Ethernet adapter or another wired network adapter. Wireless network adapter 119 can include one or more of the following: Wi-Fi, Bluetooth, Near Field Communication (NFC), or other network devices including one or more radio devices.
[0033] The computing system 100 may include other components not explicitly shown, including USB or other port connectors, optical storage drives, video capture devices, etc., which may also be connected to the I / O hub 107. Figure 1 The communication paths for interconnecting the various components can be implemented using any suitable protocol, such as PCI (Peripheral Component Interconnect) based protocols (e.g., PCI Fast Bus) or any other bus or point-to-point communication interface and / or protocol (e.g., NV-Link High-Speed Interconnect, or interconnect protocols known in the art).
[0034] In one embodiment, one or more parallel processors 112 include circuitry optimized for graphics and video processing (including, for example, video output circuitry) and constitute a graphics processing unit (GPU). In another embodiment, one or more parallel processors 112 include circuitry optimized for general-purpose processing while maintaining the underlying computing architecture described in more detail herein. In yet another embodiment, components of the computing system 100 may be integrated on a single integrated circuit along with one or more other system elements. For example, one or more parallel processors 112, memory hub 105, processor 102, and I / O hub 107 may be integrated into a system-on-a-chip (SoC) integrated circuit. Alternatively, components of the computing system 100 may be integrated into a single package to form a system-in-package (SIP) configuration. In one embodiment, at least a portion of the components of the computing system 100 may be integrated into a multi-chip module (MCM), which may interconnect with other multi-chip modules to a modular computing system.
[0035] It will be appreciated that the computing system 100 shown herein is illustrative, and various variations and modifications are possible. The connection topology can be modified as needed, including the number and rows of bridges, the number of processors(102), and the number of parallel processors(112). For example, in some embodiments, system memory 104 is connected directly to processors(102) rather than via bridges, while other devices communicate with system memory 104 via memory hub 105 and processors(102). In other alternative topologies, parallel processors(112) are connected to I / O hub 107 or directly to one or more processors(102), rather than to memory hub 105. In other embodiments, I / O hub 107 and memory hub 105 may be integrated into a single chip. Some embodiments may include two or more sets of processors(102) attached via multiple sockets, which may be coupled to two or more instances of parallel processors(112).
[0036] Some specific components shown in this document are optional and may not be included in all implementations of the computing system 100. For example, any number of plug-in cards or peripherals may be supported, or some components may be eliminated. Furthermore, some architectures may be adapted to... Figure 1 The components shown in the diagram use different terminology for similar components. For example, in some architectures, the memory hub 105 may be called the Northbridge, while the I / O hub 107 may be called the Southbridge.
[0037] Figure 2A A parallel processor 200 according to an embodiment is illustrated. Various components of the parallel processor 200 can be implemented using one or more integrated circuit devices, such as a programmable processor, an application-specific integrated circuit (ASIC), or a field-programmable gate array (FPGA). According to the embodiment, the illustrated parallel processor 200 is... Figure 1 One or more variants of the parallel processor 112 shown.
[0038] In one embodiment, the parallel processor 200 includes a parallel processing unit 202. The parallel processing unit includes an I / O unit 204 that enables communication with other devices, including other instances of the parallel processing unit 202. The I / O unit 204 may be directly connected to other devices. In one embodiment, the I / O unit 204 is connected to other devices via a hub or switch interface (e.g., a memory hub 105). The connection between the memory hub 105 and the I / O unit 204 forms a communication link 113. Within the parallel processing unit 202, the I / O unit 204 is connected to a host interface 206 and a memory crossbar 216, wherein the host interface 206 receives commands relating to performing processing operations, and the memory crossbar 216 receives commands relating to performing memory operations.
[0039] When host interface 206 receives a command buffer via I / O unit 204, host interface 206 can direct work operations for executing those commands to front end 208. In one embodiment, front end 208 is coupled to scheduler 210, which is configured to distribute commands or other work items to processing cluster array 212. In one embodiment, scheduler 210 ensures that processing cluster array 212 is properly configured and active before tasks are distributed to the processing clusters of processing cluster array 212. In one embodiment, scheduler 210 is implemented via firmware logic executed on a microcontroller. The microcontroller-implemented scheduler 210 can be configured to perform complex scheduling and work distribution operations at both coarse and fine granular levels, enabling fast preemption and context switching of threads executing on processing array 212. In one embodiment, host software can validate workloads for scheduling on processing array 212 via one of a plurality of image processing doorbells. The workload can then be automatically distributed across processing array 212 by scheduler 210 logic within the scheduler microcontroller.
[0040] Processing cluster array 212 may include up to "N" processing clusters (e.g., cluster 214A, cluster 214B, up to cluster 214N). Each cluster 214A-214N of processing cluster array 212 can execute a large number of concurrent threads. Scheduler 210 may use various scheduling and / or work distribution algorithms to allocate work to clusters 214A-214N of processing cluster array 212, and these algorithms may vary depending on the workload generated for each type of program or computation. Scheduling may be handled dynamically by scheduler 210 or may be partially assisted by compiler logic during compilation of the program logic configured for execution by processing cluster array 212. In one embodiment, different clusters 214A-214N of processing cluster array 212 may be assigned to process different types of programs or to perform different types of computations.
[0041] The processing cluster array 212 can be configured to perform various types of parallel processing operations. In one embodiment, the processing cluster array 212 is configured to perform general-purpose parallel computing operations. For example, the processing cluster array 212 may include logic for performing processing tasks, including filtering video and / or audio data, performing modeling operations (including physical operations), and performing data transformations.
[0042] In one embodiment, the processing cluster array 212 is configured to perform parallel graphics processing operations. In embodiments where the parallel processor 200 is configured to perform graphics processing operations, the processing cluster array 212 may include additional logic for supporting the performance of such graphics processing operations, including but not limited to texture sampling logic for performing texture operations, as well as tessellation logic and other vertex processing logic. Additionally, the processing cluster array 212 may be configured to execute shader programs related to graphics processing, such as, but not limited to, vertex shaders, tessellation shaders, geometry shaders, and pixel shaders. The parallel processing unit 202 may pass data from system memory for processing via I / O unit 204. During processing, the passed data may be stored in on-chip memory (e.g., parallel processor memory 222) and then written back to system memory.
[0043] In one embodiment, when the parallel processing unit 202 is used to perform graphics processing, the scheduler 210 can be configured to divide the processing workload into approximately equal-sized tasks to better enable the distribution of graphics processing operations across multiple clusters 214A to 214N in the processing cluster array 212. In some embodiments, multiple portions of the processing cluster array 212 can be configured to perform different types of processing. For example, a first portion can be configured to perform vertex shading and topology generation, a second portion can be configured to perform tessellation and geometry shading, and a third portion can be configured to perform pixel shading or other screen-space operations to produce a rendered image for display. Intermediate data generated by one or more of the clusters 214A to 214N can be stored in a buffer to allow the intermediate data to be transferred between clusters 214A to 214N for further processing.
[0044] During operation, the processing cluster array 212 may receive processing tasks to be executed via a scheduler 210, which receives commands defining the processing tasks from the front end 208. For graphics processing operations, a processing task may include an index of data to be processed (e.g., surface (patch) data, primitive data, vertex data, and / or pixel data), as well as state parameters and commands defining how the data should be processed (e.g., what program to execute). The scheduler 210 may be configured to retrieve the indexes corresponding to the task, or may receive these indexes from the front end 208. The front end 208 may be configured to ensure that the processing cluster array 212 is configured to be active before initiating a workload specified by an incoming command buffer (e.g., a batch buffer, a push buffer, etc.).
[0045] Each of one or more instances of parallel processing unit 202 may be coupled to parallel processor memory 222. Parallel processor memory 222 may be accessed via memory crossbar switch 216, which receives memory requests from processing cluster array 212 and I / O unit 204. Memory crossbar switch 216 may access parallel processor memory 222 via memory interface 218. Memory interface 218 may include multiple partition units (e.g., partition unit 220A, partition unit 220B, up to partition unit 220N), each partition unit being coupled to a portion (e.g., memory cell) of parallel processor memory 222. In one implementation, the number of partition units 220A-220N is configured equal to the number of memory cells, such that a first partition unit 220A has a corresponding first memory cell 224A, a second partition unit 220B has a corresponding memory cell 224B, and the Nth partition unit 220N has a corresponding Nth memory cell 224N. In other embodiments, the number of partition units 220A-220N may not be equal to the number of memory devices.
[0046] In various embodiments, memory cells 224A to 224N may include various types of memory devices, including dynamic random access memory (DRAM) or graphics random access memory (e.g., synchronous graphics random access memory (SGRAM), including graphics double data rate (GDDR) memory). In one embodiment, memory cells 224A to 224N may also include 3D stacked memory, including but not limited to high bandwidth memory (HBM). Those skilled in the art will recognize that the specific implementation of memory cells 224A to 224N can vary and may be selected from a variety of conventional designs. Render targets (e.g., frame buffers or texture maps) may be stored across memory cells 224A to 224N, thereby allowing partitioning cells 220A to 220N to write in parallel to multiple portions of each render target to efficiently utilize the available bandwidth of parallel processor memory 222. In some embodiments, local instances of parallel processor memory 222 may be excluded to facilitate a unified memory design that utilizes system memory by incorporating local cache memory.
[0047] In one embodiment, any of the clusters 214A-214N of the processing cluster array 212 can process data to be written to any of the memory cells 224A-224N within the parallel processor memory 222. The memory crossbar switch 216 can be configured to pass the output of each cluster 214A-214N to any partition cell 220A-220N or another cluster 214A-214N on which additional processing operations can be performed. Each cluster 214A-214N can communicate with the memory interface 218 via the memory crossbar switch 216 to read from or write to various external memory devices. In one embodiment, the memory crossbar switch 216 has a connection to the memory interface 218 for communication with the I / O unit 204 and a connection to a local instance of the parallel processor memory 222, thereby enabling processing units within different processing clusters 214A-214N to communicate with system memory or other memory not local to the parallel processing unit 202. In one embodiment, the memory crossbar switch 216 may use a virtual channel to separate traffic flows between clusters 214A-214N and partition units 220A-220N.
[0048] While a single instance of the parallel processing unit 202 is shown within the parallel processor 200, any number of instances of the parallel processing unit 202 may be included. For example, multiple instances of the parallel processing unit 202 may be provided on a single plug-in card, or multiple plug-in cards may be interconnected. Different instances of the parallel processing unit 202 may be configured to interoperate even if these different instances have different numbers of processing cores, different amounts of local parallel processor memory, and / or other configuration differences. For example, and in one embodiment, some instances of the parallel processing unit 202 may include higher precision floating-point units relative to other instances. Systems including one or more instances of the parallel processing unit 202 or the parallel processor 200 may be implemented in a variety of configurations and form factors, including but not limited to desktop, laptop, or handheld personal computers, servers, workstations, game consoles, and / or embedded systems.
[0049] Figure 2B This is a block diagram of partitioning unit 220 according to an embodiment. In one embodiment, partitioning unit 220 is... Figure 2AAn example of one of partition units 220A to 220N. As shown, partition unit 220 includes an L2 cache 221, a frame buffer interface 225, and a ROP 226 (raster operation unit). The L2 cache 221 is a read / write cache configured to perform load and store operations received from memory crossbar switch 216 and ROP 226. Read misses and urgent write-back requests are output by the L2 cache 221 to the frame buffer interface 225 for processing. Updates can also be sent to the frame buffer for processing via the frame buffer interface 225. In one embodiment, the frame buffer interface 225 intersects with one of the memory cells in the parallel processor memory (e.g., memory cells 224A to 224N of FIG. 2 (e.g., within parallel processor memory 222)).
[0050] In graphics applications, ROP 226 is a processing unit that performs raster operations such as stencil printing, z-testing, blending, etc. ROP 226 then outputs processed graphics data stored in graphics memory. In some embodiments, ROP 226 includes compression logic for compressing depth or color data written to memory and decompressing depth or color data read from memory. The compression logic can be a lossless compression logic utilizing one or more of various compression algorithms. The type of compression performed by ROP 226 can vary based on the statistical characteristics of the data to be compressed. For example, in one embodiment, Δcolor compression is performed on a tile-by-tile basis on both depth and color data.
[0051] In some embodiments, ROP 226 is included within each processing cluster (e.g., clusters 214A to 214N of FIG. 2) rather than within partition unit 220. In such embodiments, read and write requests for pixel data, rather than pixel fragment data, are transmitted via memory crossbar switch 216. Processed graphics data can be displayed on a display device (e.g., Figure 1 Displayed on one or more display devices 110, routed for further processing by processor(s) 102, or routed for use by Figure 2A One of the processing entities within the parallel processor 200 is further processed.
[0052] Figure 2CThis is a block diagram of a processing cluster 214 within a parallel processing unit according to an embodiment. In one embodiment, the processing cluster is an instance of one of the processing clusters 214A to 214N of FIG. 2. The processing cluster 214 can be configured to execute a number of threads in parallel, wherein the term "thread" refers to an instance of a specific program executing on a particular set of input data. In some embodiments, a Single Instruction Multiple Data (SIMD) instruction issuance technique is used to support the parallel execution of a large number of threads without providing multiple independent instruction units. In other embodiments, a Single Instruction Multiple Threading (SIMT) technique is used to support the parallel execution of a large number of generally synchronous threads, wherein the common instruction unit is configured to issue instructions to a set of processing engines within each of the processing clusters. Unlike the SIMD execution regime (where all processing engines typically execute the same instructions), SIMT execution allows different threads to more easily follow divergent execution paths through a given thread program. Those skilled in the art will understand that the SIMD processing regime represents a functional subset of the SIMT processing regime.
[0053] The operation of the processing cluster 214 can be controlled via a pipeline manager 232, which distributes processing tasks to SIMT parallel processors. The pipeline manager 232 receives instructions from the scheduler 210 of FIG. 2 and manages the execution of those instructions via the graphics multiprocessor 234 and / or texture unit 236. The graphics multiprocessor 234 shown is an exemplary instance of a SIMT parallel processor. However, various types of SIMT parallel processors with different architectures can be included within the processing cluster 214. One or more instances of the graphics multiprocessor 234 can be included within the processing cluster 214. The graphics multiprocessor 234 can process data, and the data crossover switch 240 can be used to distribute the processed data to one of several possible destinations (including other shader units). The pipeline manager 232 can facilitate the distribution of processed data by specifying the destination of the processed data to be distributed via the data crossover switch 240.
[0054] Each graphics multiprocessor 234 within the processing cluster 214 can include the exact same set of functional execution logic (e.g., arithmetic logic units, load-memory units, etc.). The functional execution logic can be configured in a pipelined manner, where new instructions can be issued before previous instructions complete. The functional execution logic supports a wide variety of operations, including integer and floating-point arithmetic, comparison operations, Boolean operations, bit shifting, and computation of various algebraic functions. In one embodiment, different operations can be performed using the same functional unit hardware, and any combination of functional units can exist.
[0055] Instructions transmitted to processing cluster 214 constitute threads. A group of threads executing across a set of parallel processing engines is a thread group. Thread groups execute the same program on different input data. Each thread within a thread group can be assigned to a different processing engine within graphics multiprocessor 234. A thread group may include fewer threads than the number of processing engines within graphics multiprocessor 234. When a thread group includes fewer threads than the number of processing engines, one or more of the processing engines may be idle during the cycle in which the thread group is being processed. A thread group may also include more threads than the number of processing engines within graphics multiprocessor 234. When a thread group includes more threads than the number of processing engines within graphics multiprocessor 234, processing can be performed on consecutive clock cycles. In one embodiment, multiple thread groups can be executed concurrently on graphics multiprocessor 234.
[0056] In one embodiment, the graphics multiprocessor 234 includes an internal cache memory for performing load and store operations. In another embodiment, the graphics multiprocessor 234 may forgo the internal cache and use a cache memory within the processing cluster 214 (e.g., L1 cache 308). Each graphics multiprocessor 234 also has access to an L2 cache within a partition unit (e.g., partition units 220A to 220N of FIG. 2) that is shared across all processing clusters 214 and can be used to transfer data between threads. The graphics multiprocessor 234 may also access off-chip global memory, which may include one or more of local parallel processor memory and / or system memory. Any memory outside the parallel processing unit 202 may be used as global memory. Multiple embodiments (where the processing cluster 214 includes multiple instances of the graphics multiprocessor 234) may share common instructions and data, which may be stored in the L1 cache 308.
[0057] Each processing cluster 214 may include an MMU 245 (Memory Management Unit) configured to map virtual addresses to physical addresses. In other embodiments, one or more instances of the MMU 245 may reside within the memory interface 218 of FIG2. The MMU 245 includes: a set of page table entries (PTEs) for mapping virtual addresses of tiles (more specifically tiling) to physical addresses; and optionally, a cache line index. The MMU 245 may include an address translation lookahead buffer (TLB) or cache that may reside within the graphics multiprocessor 234 or the L1 cache or processing cluster 214. Physical addresses are processed to distribute surface data access locality, thereby allowing efficient request interleaving within partitioned units. The cache line index can be used to determine whether a request for a cache line is a hit or a miss.
[0058] In graphics and computing applications, processing cluster 214 may be configured such that each graphics multiprocessor 234 is coupled to texture unit 236 for performing texture mapping operations, such as determining texture sample locations, reading texture data, and filtering texture data. Texture data may be read from an internal texture L1 cache (not shown) or, in some embodiments, from an L1 cache within the graphics multiprocessor 234, and may be retrieved from an L2 cache, local parallel processor memory, or system memory, as needed. Each graphics multiprocessor 234 outputs processed tasks to data crossover switch 240 to provide the processed tasks to another processing cluster 214 for further processing or to store the processed tasks in L2 cache, local parallel processor memory, or system memory via memory crossover switch 216. PreROP 242 (e.g., pre-raster operation unit) is configured to receive data from graphics multiprocessor 234 and direct the data to ROP units, which may be located alongside partitioning units as described herein (e.g., partitioning units 220A to 220N of FIG. 2). The preROP 242 unit can perform optimizations for color blending, organize pixel color data, and perform address translation.
[0059] It will be appreciated that the core architecture described herein is illustrative, and various variations and modifications are possible. Any number of processing units (e.g., graphics multiprocessors 234, texture units 236, preROP 242, etc.) may be included within processing cluster 214. Furthermore, although only one processing cluster 214 is shown, parallel processing units as described herein may include any number of instances of processing cluster 214. In one embodiment, each processing cluster 214 may be configured to operate independently of other processing clusters 214 using separate and distinct processing units, L1 caches, etc.
[0060] Figure 2D A graphics multiprocessor 234 according to one embodiment is illustrated. In such an embodiment, the graphics multiprocessor 234 is coupled to a pipeline manager 232 of a processing cluster 214. The graphics multiprocessor 234 has an execution pipeline including, but not limited to: an instruction cache 252, an instruction unit 254, an address mapping unit 256, a register file 258, one or more general-purpose graphics processing unit (GPGPU) cores 262, and one or more load / store units 266. The GPGPU cores 262 and the load / store units 266 are coupled to a cache memory 272 and a shared memory 270 via a memory and cache interconnect 268.
[0061] In one embodiment, instruction cache 252 receives a stream of instructions to be executed from pipeline manager 232. These instructions are cached in instruction cache 252 and dispatched for execution by instruction unit 254. Instruction unit 254 can dispatch instructions into thread groups (e.g., thread bundles), where each thread in the thread group is assigned to a different execution unit within GPGPU core 262. Instructions can access either the local, shared, or global address space by specifying an address within a unified address space. Address mapping unit 256 can be used to translate addresses in the unified address space into distinct memory addresses that can be accessed by load / store unit 266.
[0062] Register file 258 provides a set of registers for the functional units of graphics multiprocessor 324. Register file 258 provides temporary storage for operands on data paths connected to functional units of graphics multiprocessor 324 (e.g., GPGPU core 262, load / store unit 266). In one embodiment, register file 258 is partitioned among each of these functional units, such that each functional unit is allocated a dedicated portion of register file 258. In another embodiment, register file 258 is partitioned among different thread bundles executed by graphics multiprocessor 324.
[0063] Each GPGPU core 262 may include a floating-point unit (FPU) and / or an integer arithmetic logic unit (ALU) for executing instructions of the graphics multiprocessor 324. According to embodiments, the GPGPU cores 262 may be architecturally similar or architecturally different. For example, in one embodiment, a first portion of the GPGPU core 262 includes a single-precision FPU and an integer ALU, while a second portion of the GPGPU core includes a double-precision FPU. In one embodiment, the FPU may implement the IEEE 754-2008 standard for floating-point arithmetic or may implement variable-precision floating-point arithmetic. The graphics multiprocessor 324 may additionally include one or more fixed-function or special-function units to perform specific functions (e.g., copying rectangles or pixel blending operations). In one embodiment, one or more of the GPGPU cores may also include fixed-function or special-function logic.
[0064] In one embodiment, GPGPU core 262 includes SIMD logic capable of executing a single instruction on multiple sets of data. In one embodiment, GPGPU core 262 can physically execute SIMD4, SIMD8, and SIMD16 instructions, and logically execute SIMD1, SIMD2, and SIMD32 instructions. The SIMD instructions for the GPGPU core can be generated by a shader compiler at compile time, or can be automatically generated when executing a program written and compiled for a Single Program Multiple Data (SPMD) or SIMT architecture. Multiple threads of a program configured for a SIMT execution model can be executed via a single SIMD instruction. For example, in one embodiment, eight SIMT threads performing the same or similar operations can be executed in parallel via a single SIMD8 logic unit.
[0065] The memory and cache interconnect 268 is an interconnect network that connects each functional unit of the graphics multiprocessor 234 to the register file 258 and to the shared memory 270. In one embodiment, the memory and cache interconnect 268 is a cross-switch interconnect that allows the load / store unit 266 to perform load and store operations between the shared memory 270 and the register file 258. The register file 258 can operate at the same frequency as the GPGPU core 262, resulting in very low latency for data transfer between the GPGPU core 262 and the register file 258. The shared memory 270 can be used to implement communication between threads executing on functional units within the graphics multiprocessor 234. The cache memory 272 can be used, for example, as a data cache to cache texture data communicated between functional units and texture units 236. The shared memory 270 can also be used as a program-managed cache. Threads executing on the GPGPU core 262 can programmatically store data in the shared memory other than the automatically cached data stored in the cache memory 272.
[0066] Figures 3A to 3B An additional graphics multiprocessor according to an embodiment is shown. The graphics multiprocessors 325 and 350 shown are... Figure 2C A variant of the graphics multiprocessor 234. The graphics multiprocessors 325 and 350 shown can be configured as streaming multiprocessors (SMs) capable of executing a large number of execution threads simultaneously.
[0067] Figure 3A A graphics multiprocessor 325 according to an additional embodiment is shown. The graphics multiprocessor 325 is relative to... Figure 2DThe graphics multiprocessor 234 includes multiple additional instances of execution resource units. For example, the graphics multiprocessor 325 may include multiple instances of instruction units 332A to 332B, register files 334A-334B, and texture units 344A-344B. The graphics multiprocessor 325 also includes multiple sets of graphics or compute execution units (e.g., GPGPU cores 336A to 336B, GPGPU cores 337A to 337B, GPGPU cores 338A to 338B) and multiple sets of load / store units 340A to 340B. In one embodiment, the execution resource units have a common instruction cache 330, a texture and / or data cache memory 342, and a shared memory 346.
[0068] Various components can communicate via interconnect structure 327. In one embodiment, interconnect structure 327 includes one or more crossbar switches to enable communication between various components of the graphics multiprocessor 325. In one embodiment, interconnect structure 327 is a separate high-speed network structure layer on which each component of the graphics multiprocessor 325 is stacked. Components of the graphics multiprocessor 325 communicate with remote components via interconnect structure 327. For example, GPGPU cores 336A-336B, 337A-337B, and 338A-338B can each communicate with shared memory 346 via interconnect structure 327. Interconnect structure 327 can arbitrate communication within the graphics multiprocessor 325 to ensure fair bandwidth allocation among components.
[0069] Figure 3B A graphics multiprocessor 350 according to an additional embodiment is illustrated. The graphics processor includes multiple sets of execution resources 356A to 356D, wherein each set of execution resources includes multiple instruction units, register files, GPGPU cores, and load memory units, such as... Figure 2D and Figure 3A As shown in the diagram. Execution resources 356A to 356D can work in harmony with texture units 360A to 360D for texture operations, while sharing instruction cache 354 and shared memory 362. In one embodiment, execution resources 356A to 356D can share multiple instances of instruction cache 354, shared memory 362, and texture and / or data cache memories 358A to 358B. Various components can be connected via a network similar to... Figure 3A The interconnection structure 327 communicates with the interconnection structure 352.
[0070] Those skilled in the art will understand that Figure 1 , Figures 2A to 2D as well as Figures 3A to 3BThe architecture described herein is descriptive and non-limiting for the purposes of this embodiment. Therefore, the techniques described herein can be implemented on any properly configured processing unit without departing from the scope of the embodiments described herein, including but not limited to one or more mobile application processors, one or more desktop computer or server central processing units (CPUs) (including multi-core CPUs), one or more parallel processing units (e.g., parallel processing unit 202 of FIG2), and one or more graphics processors or dedicated processing units.
[0071] In some embodiments, a parallel processor or GPGPU, as described herein, is communicatively coupled to a host / processor core to accelerate graphics operations, machine learning operations, pattern analysis operations, and various general-purpose GPU (GPGPU) functions. The GPU may be communicatively coupled to the host processor / core via a bus or other interconnect (e.g., a high-speed interconnect such as PCIe or NVLink). In other embodiments, the GPU may be integrated on the same package or chip as these cores and communicatively coupled to these cores via an internal processor bus / interconnect (i.e., inside the package or chip). Regardless of how the GPU is connected, the processor core can assign work to the GPU in the form of a sequence of commands / instructions contained in a job descriptor. The GPU then uses dedicated circuitry / logic to efficiently process these commands / instructions.
[0072] Technologies for GPU-to-host processor interconnect
[0073] Figure 4A An exemplary architecture is illustrated, in which multiple GPUs 410 to 413 are communicatively coupled to multiple multi-core processors 405 to 406 via high-speed links 440 to 443 (e.g., bus, point-to-point interconnect, etc.). In one embodiment, depending on the implementation, high-speed links 440 to 443 support communication throughput of 4GB / s, 30GB / s, 80GB / s, or higher. Various interconnect protocols can be used, including but not limited to PCIe 4.0 or 5.0 and NVLink 2.0. However, the basic principles of the invention are not limited to any particular communication protocol or throughput.
[0074] Additionally, in one embodiment, two or more of GPUs 410 to 413 are interconnected via high-speed links 444 to 445, which may be implemented using the same or different protocols / links as those used for high-speed links 440 to 443. Similarly, two or more of multi-core processors 405 to 406 may be connected via high-speed link 433, which may be a symmetric multiprocessor (SMP) bus operating at 20GB / s, 30GB / s, 120GB / s, or higher. Alternatively, Figure 4AAll communication between the various system components shown can be achieved using the same protocol / link (e.g., via a common interconnect structure). However, as mentioned, the basic principles of the invention are not limited to any particular type of interconnect technology.
[0075] In one embodiment, each multi-core processor 405 to 406 is communicatively coupled to processor memories 401 to 402 via memory interconnects 430 to 431, and each GPU 410 to 413 is communicatively coupled to GPU memories 420 to 423 via GPU memory interconnects 450 to 453. Memory interconnects 430 to 431 and 450 to 453 may utilize the same or different memory access technologies. By way of example and without limitation, processor memories 401 to 402 and GPU memories 420 to 423 may be volatile memories, such as dynamic random access memory (DRAM) (including stacked DRAM), graphics DDR SDRAM (GDDR) (e.g., GDDR5, GDDR6), or high-bandwidth memory (HBM), and / or may be non-volatile memories, such as 3D XPoint or nanometer random access memory. In one embodiment, a portion of the memory may be volatile memory, and another portion may be non-volatile memory (e.g., using a two-level memory (2LM) hierarchy).
[0076] As described below, although the various processors 405 to 406 and GPUs 410 to 413 can be physically coupled to specific memories 401 to 402 and 420 to 423 respectively, a unified memory architecture can be implemented, in which the same virtual system address space (also known as the “effective address” space) is distributed across all the various physical memories. For example, processor memories 401 to 402 can each include 64GB of system memory address space, and GPU memories 420 to 423 can each include 32GB of system memory address space (resulting in a total of 256GB of addressable memory in this example).
[0077] Figure 4B Additional details are shown regarding the interconnect between a multi-core processor 407 and a graphics acceleration module 446 according to one embodiment. The graphics acceleration module 446 may include one or more GPU chips integrated on a line card coupled to the processor 407 via a high-speed link 440. Alternatively, the graphics acceleration module 446 may be integrated on the same package or chip as the processor 407.
[0078] The processor 407 shown includes multiple cores 460A to 460D, each core having translation lookaside buffers 461A to 461D and one or more caches 462A to 462D. These cores may include various other components for executing instructions and processing data, which are not shown to avoid obscuring the fundamental principles of the invention (e.g., instruction fetch unit, branch prediction unit, decoder, execution unit, reordering buffer, etc.). Caches 462A to 462D may include Level 1 (L1) and Level 2 (L2) caches. Additionally, one or more shared caches 426 may be included in the cache hierarchy and shared by multiple sets of cores 460A to 460D. For example, one embodiment of the processor 407 includes 24 cores, each core having its own L1 cache, 12 shared L2 caches, and 12 shared L3 caches. In this embodiment, one of the L2 and L3 caches is shared by two adjacent cores. The processor 407 and graphics accelerator integration module 446 are connected to the system memory 441, which may include processor memories 401 to 402.
[0079] The consistency of data and instructions stored in various caches 462A to 462D, 456 and system memory 441 is maintained via inter-core communication on the consistency bus 464. For example, each cache may have associated cache consistency logic / circuit to communicate via the consistency bus 464 in response to a detected read or write to a specific cache line. In one implementation, a cache snooping protocol is implemented via the consistency bus 464 to snoop on cache accesses. Cache snooping / consistency techniques are well understood by those skilled in the art and will not be described in detail herein to avoid obscuring the basic principles of the invention.
[0080] In one embodiment, proxy circuitry 425 communicatively couples graphics acceleration module 446 to coherence bus 464, thereby allowing graphics acceleration module 446 to participate in cache coherence protocols as a peer of the core. Specifically, interface 435 provides connectivity to proxy circuitry 425 via high-speed link 440 (e.g., PCIe bus, NVLink, etc.), and interface 437 connects graphics acceleration module 446 to link 440.
[0081] In one implementation, accelerator integrated circuit 436 represents multiple graphics processing engines 431, 432, N of graphics acceleration module 446 to provide cache management, memory access, context management, and interrupt management services. Graphics processing engines 431, 432, N may each include a separate graphics processing unit (GPU). Alternatively, graphics processing engines 431, 432, N may include different types of graphics processing engines within the GPU, such as graphics execution units, media processing engines (e.g., video encoders / decoders), samplers, and bit-block transfer engines. In other words, the graphics acceleration module may be a GPU with multiple graphics processing engines 431 to 432, N, or graphics processing engines 431 to 432, N may be individual GPUs integrated on a common package, line card, or chip.
[0082] In one embodiment, accelerator integrated circuit 436 includes a memory management unit (MMU) 439 for performing various memory management functions, such as virtual-to-physical memory translation (also known as effective-to-real memory translation) and memory access protocols for accessing system memory 441. MMU 439 may also include a translation back buffer (TLB) (not shown) for translating virtual / effective cache to physical / real address. In one implementation, cache 438 stores commands and data for effective access by graphics processing engines 431 to 432, N. In one embodiment, data stored in cache 438 and graphics memories 433 to 434, N are consistent with core caches 462A to 462D, 456 and system memory 411. As mentioned, this can be achieved via proxy circuitry 425, which participates in the cache coherence mechanism on behalf of cache 438 and memories 433 to 434, N (e.g., sending updates related to modifications / accesses to cache lines on processor caches 462A to 462D, 456 to cache 438, and receiving updates from cache 438).
[0083] A set of registers 445 stores context data for threads executed by graphics processing engines 431 to 432, N, and context management circuitry 448 manages the thread context. For example, context management circuitry 448 can perform save and restore operations during context switching to save and restore the context of various threads (e.g., where a first thread is saved and a second thread is stored so that the second thread can be executed by the graphics processing engine). For example, during context switching, context management circuitry 448 can store the current register values to a designated region in memory (e.g., identified by a context pointer). It can then restore these register values upon returning to the context. In one embodiment, interrupt management circuitry 447 receives and processes interrupts received from the system device.
[0084] In one implementation, the MMU 439 translates the virtual / effective address from the graphics processing engine 431 into a real / physical address in system memory 411. One embodiment of the accelerator integrated circuit 436 supports multiple (e.g., 4, 8, 16) graphics accelerator modules 446 and / or other accelerator devices. The graphics accelerator module 446 may be dedicated to a single application executing on processor 407 or may be shared among multiple applications. In one embodiment, a virtualized graphics execution environment is presented, in which multiple applications or virtual machines (VMs) share the resources of graphics processing engines 431 to 432, N. These resources may be further divided into "slices," which are allocated to these VMs and / or applications based on processing requirements and priorities associated with different VMs and / or applications.
[0085] Therefore, the accelerator integrated circuit acts as a bridge to the system of the graphics acceleration module 446, and provides address translation and system memory caching services. Additionally, the accelerator integrated circuit 436 can provide virtualization facilities for the host processor to manage the virtualization, interrupt, and memory management of the graphics processing engine.
[0086] Because the hardware resources of graphics processing engines 431 to 432, N are explicitly mapped to the real address space seen by the host processor 407, any host processor can directly address these resources using valid address values. In one embodiment, one function of the accelerator integrated circuit 436 is to physically separate the graphics processing engines 431 to 432, N, so that they appear as independent units to the system.
[0087] As mentioned, in the illustrated embodiment, one or more graphics memories 433 to 434, M are coupled to each of the graphics processing engines 431 to 432, N, respectively. Graphics memories 433 to 434, M store instructions and data processed by each of the graphics processing engines 431 to 432, N. Graphics memories 433 to 434, M can be volatile memories, such as DRAM (including stacked DRAM), GDDR memory (e.g., GDDR5, GDDR6), or HBM, and / or can be non-volatile memories, such as 3D XPoint or Nano-RAM.
[0088] In one embodiment, to reduce data traffic on link 440, a biasing technique is used to ensure that the data stored in graphics memories 433 to 434, M is the data that will be used most frequently by graphics processing engines 431 to 432, N and preferably not used (at least not frequently) by cores 460A to 460D. Similarly, the biasing mechanism attempts to store the data required by the cores (and preferably not by graphics processing engines 431 to 432, N) in the caches 462A to 462D, 456 of these cores and in system memory 411.
[0089] Figure 4C Another embodiment is shown, in which the accelerator integrated circuit 436 is integrated within the processor 407. In this embodiment, graphics processing engines 431 to 432, N communicate directly with the accelerator integrated circuit 436 via high-speed link 440 through interfaces 437 and 435 (again, these interfaces can utilize any form of bus or interface protocol). The accelerator integrated circuit 436 can perform operations related to... Figure 4B The operation described is the same, but it is potentially at a higher throughput due to its extremely close proximity to the coherence bus 462 and caches 462A to 462D, 426.
[0090] One embodiment supports different programming models, including a dedicated process programming model (without graphics acceleration module virtualization) and a shared programming model (with virtualization). The latter may include a programming model controlled by accelerator integrated circuit 436 and a programming model controlled by graphics acceleration module 446.
[0091] In one embodiment of the dedicated process model, graphics processing engines 431 to 432, N are dedicated to a single application or process within a single operating system. A single application can funnel requests from other applications to graphics engines 431 to 432, N, thereby providing virtualization within a VM / partition.
[0092] In a dedicated process programming model, graphics processing engines 431 to 432, N can be shared by multiple VM / application partitions. This shared model requires a hypervisor to virtualize graphics processing engines 431 to 432, N to allow access by each operating system. For single-partition systems without a hypervisor, graphics processing engines 431 to 432, N are owned by the operating system. In both cases, the operating system can virtualize graphics processing engines 431 to 432, N to provide access to each process or application.
[0093] For a shared programming model, the graphics acceleration module 446 or individual graphics processing engines 431 to 432, N use process handles to select process elements. In one embodiment, process elements are stored in system memory 411 and can be addressed using the effective address to real address translation techniques described herein. The process handle can be an implementation-specific value provided to the host process when registering its context with the graphics processing engines 431 to 432, N (i.e., invoking system software to add process elements to the process element linked list). The lower 16 bits of the process handle can be the offset of the process element within the process element linked list.
[0094] Figure 4D An exemplary accelerator integration slice 490 is shown. As used herein, a “slice” includes a designated portion of the processing resources of the accelerator integrated circuit 436. The application-effective address space 482 within system memory 411 stores process elements 483. In one embodiment, process element 483 is stored in response to a GPU call 481 from an application 480 executing on processor 407. Process element 483 contains the process state of the corresponding application 480. A job descriptor (WD) 484 contained in process element 483 may be a single job requested by the application, or may contain a pointer to a job queue. In the latter case, WD 484 is a pointer to a job request queue in the application's address space 482.
[0095] The graphics acceleration module 446 and / or individual graphics processing engines 431 to 432, N can be shared by all processes or a subset of processes in the system. Embodiments of the invention include infrastructure for setting process states and sending WD 484 to the graphics acceleration module 446 to initiate operations in a virtualized environment.
[0096] In one implementation, the dedicated process programming model is implementation-specific. In this model, a single process owns either the graphics acceleration module 446 or an individual graphics processing engine 431. Since the graphics acceleration module 446 is owned by a single process, when assigning the graphics acceleration module 446, the hypervisor initializes the accelerator integrated circuit 436 for the owned partition, and the operating system initializes the accelerator integrated circuit 436 for the owned process.
[0097] In operation, the WD acquisition unit 491 in the accelerator integrated slice 490 acquires the next WD 484, which includes an indication of the work to be performed by one of the graphics processing engines of the graphics acceleration module 446. Data from the WD 484 may be stored in register 445 and used by the MMU 439, interrupt management circuitry 447, and / or context management circuitry 446 as shown. For example, one embodiment of the MMU 439 includes segment / page walk circuitry for accessing segment / page tables 486 within the OS virtual address space 485. The interrupt management circuitry 447 may handle interrupt events 492 received from the graphics acceleration module 446. When performing graphics operations, the MMU 439 translates the valid address 493 generated by the graphics processing engines 431 to 432, N into a real address.
[0098] In one embodiment, a set of identical registers 445 is copied for each graphics processing engine 431 to 432, N and / or graphics acceleration module 446, and these registers can be initialized by a hypervisor or operating system. Each of these copied registers may be included in the accelerator integration slice 490. Exemplary registers that can be initialized by a hypervisor are shown in Table 1.
[0099] Table 1 - Registers initialized by the management program
[0100] 1 Slice Control Register 2 Real Address (RA) Scheduled Process Region Pointer 3 Authority Mask Override Register 4 Interruption vector table entry offset 5 Interrupt vector table entry limit 6 Status Register 7 Logical partition ID 8 Real Address (RA) management accelerator utilizes record pointers 9 Storage description register
[0101] Table 2 shows exemplary registers that can be initialized by the operating system.
[0102] Table 2 - Registers for Operating System Initialization
[0103] 1 Process and thread identifiers 2 Valid Address (EA) Context Save / Restore Pointer 3 Virtual address (VA) accelerators utilize record pointers 4 Virtual address (VA) memory segment table pointer 5 Permission mask 6 Job descriptor
[0104] In one embodiment, each WD 484 is specific to a particular graphics acceleration module 446 and / or graphics processing engines 431 to 432, N. It contains all the information required for the graphics processing engines 431 to 432, N to complete their work, or it may be a pointer to a memory location where the application has set up a command queue of tasks to be completed.
[0105] Figure 4E Additional details of one embodiment of the shared model are shown. This embodiment includes a hypervisor real address space 498 in which a list of process elements 499 is stored. The hypervisor real address space 498 is accessible via a hypervisor 496, which virtualizes the graphics acceleration module engine of the operating system 495.
[0106] The shared programming model allows all processes or subsets of processes from all partitions or subsets of partitions in the system to use the graphics acceleration module 446. Two programming models exist where the graphics acceleration module 446 is shared by multiple processes and partitions: time-slice sharing and graphics-directed sharing.
[0107] In this model, the hypervisor 496 owns the graphics acceleration module 446 and makes its functionality available to all operating systems 495. For the graphics acceleration module 446 to support virtualization performed by the hypervisor 496, the graphics acceleration module 446 may meet the following requirements: 1) Application job requests must be autonomous (i.e., no state maintenance is required between jobs), or the graphics acceleration module 446 must provide context saving and restoration mechanisms. 2) The graphics acceleration module 446 guarantees completion of application job requests within a specified time (including any transition failures), or the graphics acceleration module 446 provides the ability to preempt job processing. 3) When operating in a directed shared programming model, fairness of the graphics acceleration module 446 among processes must be guaranteed.
[0108] In one embodiment, for the shared model, application 480 needs to make an operating system 495 system call using the graphics acceleration module 446 type, working descriptor (WD), authority mask register (AMR) value, and context save / restore region pointer (CSRP). The graphics acceleration module 446 type describes the target acceleration function used for the system call. The graphics acceleration module 446 type can be a system-specific value. The WD is formatted specifically for the graphics acceleration module 446 and can be in the form of a graphics acceleration module 446 command, a valid address pointer to a user-defined structure, a valid address pointer to a command queue, or any other data structure describing the work to be performed by the graphics acceleration module 446. In one embodiment, the AMR value is the AMR state to be used for the current process. The value passed to the operating system is similar to the application setting the AMR. If the accelerator integrated circuit 436 and the graphics acceleration module 446 implementation do not support the User Authority Mask Override Register (UAMOR), then the operating system can apply the current UAMOR value to the AMR value and then pass the AMR in the hypervisor call. Optionally, hypervisor 496 may apply the current privilege mask overwrite register (AMOR) value and then place the AMR into process element 483. In one embodiment, CSRP is one of registers 445 that contains the effective address of a region in the application's address space 482 for use by the graphics acceleration module 446 to save and restore context state. This pointer is optional if saving state between jobs is not required or when a job is preempted. The context save / restore region may be pinned system memory.
[0109] Upon receiving a system call, the operating system 495 verifies that application 480 has been registered and granted permission to use the graphics acceleration module 446. Then, the operating system 495 uses the information shown in Table 3 to invoke the hypervisor 496.
[0110] Table 3 – OS to Hypervisor Call Parameters
[0111] 1 Working Descriptor (WD) 2 Authority Mask Register (AMR) value (potentially masked) 3 Valid Address (EA) Context Save / Restore Region Pointer (CSRP) 4 Process ID (PID) and Optional Thread ID (TID) 5 Virtual address (VA) accelerators utilize record pointers (AURP). 6 Virtual address of the segment table pointer (SSTP) 7 Logical Interrupt Service Number (LISN)
[0112] Upon receiving a hypervisor call, hypervisor 496 verifies that operating system 495 is registered and has been granted permission to use graphics acceleration module 446. Then, hypervisor 496 places process element 483 into a linked list of process elements corresponding to graphics acceleration module 446 type. Process elements may include the information shown in Table 4.
[0113] Table 4 - Process Element Information
[0114] 1 Working Descriptor (WD) 2 Authority Mask Register (AMR) value (potentially masked) 3 Valid Address (EA) Context Save / Restore Region Pointer (CSRP) 4 Process ID (PID) and Optional Thread ID (TID) 5 Virtual address (VA) accelerators utilize record pointers (AURP). 6 Virtual address of the segment table pointer (SSTP) 7 Logical Interrupt Service Number (LISN) 8 Interrupt vector table exported from the hypervisor call parameters 9 Status Register (SR) Value 10 Logical Partition ID (LPID) 11 Real Address (RA) management accelerator utilizes record pointers 12 Memory Descriptor Register (SDR)
[0115] In one embodiment, the hypervisor initializes multiple accelerator integration slice 490 registers 445.
[0116] like Figure 4F As shown, one embodiment of the invention employs a unified memory addressable via a common virtual memory address space for accessing physical processor memories 401-402 and GPU memories 420-423. In this implementation, operations performed on GPUs 410-413 utilize the same virtual / effective memory address space to access processor memories 401-402 and vice versa, thereby simplifying programmability. In one embodiment, a first portion of the virtual / effective address space is allocated to processor memory 401, a second portion to a second processor memory 402, a third portion to GPU memory 420, and so on. This allows the entire virtual / effective memory space (sometimes referred to as the effective address space) to be distributed across each of processor memories 401-402 and GPU memories 420-423, thereby allowing any processor or GPU to access any physical memory using virtual addresses mapped to said memory.
[0117] In one embodiment, bias / coherence management circuitry 494A to 494E within one or more of the MMUs 439A to 439E ensures cache coherence between the host processor (e.g., 405) and the caches of the GPUs 410 to 413, and implements biasing techniques that indicate the physical memory where certain types of data should be stored. Although Figure 4F Several instances of bias / coherence management circuitry 494A to 494E are shown, but bias / coherence circuitry can be implemented within the MMU of one or more host processors 405 and / or within the accelerator integrated circuit 436.
[0118] One embodiment allows GPU-attached memories 420 to 423 to be mapped as part of system memory and accessed using shared virtual memory (SVM) technology, without suffering the typical performance drawbacks associated with full system cache coherence. This ability to access GPU-attached memories 420 to 423 as system memory without the heavy overhead of cache coherence provides a beneficial operating environment for GPU offloading. This arrangement allows host processor 405 software to set operands and access computation results without the overhead of traditional I / O DMA data copying. Such traditional copying involves driver calls, interrupts, and memory-mapped I / O (MMIO) accesses, all of which are inefficient compared to simple memory access. Meanwhile, the ability to access GPU-attached memories 420 to 423 without cache coherence overhead can be critical to the execution time of offloaded computations. In cases with substantial streaming write memory traffic, for example, cache coherence overhead can significantly reduce the effective write bandwidth seen by GPUs 410 to 413. The efficiency of operand setting, the efficiency of result access, and the efficiency of GPU computation all play a role in determining the effectiveness of GPU offloading.
[0119] In one implementation, the choice between GPU bias and host processor bias is driven by a bias tracker data structure. A bias table can be used, for example, which could be a page-granular structure comprising 1 or 2 bits per GPU-attached memory page (i.e., controlled at the memory page level). The bias table can be implemented using one or more stolen memory ranges of GPU-attached memory 420-423, with or without a bias cache in GPUs 410-413 (e.g., for caching frequently used / recently used entries of the bias table). Alternatively, the entire bias table can be kept within the GPU.
[0120] In one implementation, the bias table entries associated with each access to GPU-attached memory 420-423 are accessed before the actual access to GPU memory, resulting in the following operations: First, local requests from GPUs 410-413 to locate their pages in the GPU bias (these local requests find their pages are in the GPU bias) are forwarded directly to the corresponding GPU memory 420-423. Local requests from GPUs (these local requests find their pages are in the host bias) are forwarded to processor 405 (e.g., via a high-speed link as discussed above). In one embodiment, a request from processor 405 to locate the requested page in the host processor bias completes a request similar to a normal memory read. Alternatively, requests for GPU-biased pages can be forwarded to GPUs 410-413. Then, if the GPU is not currently using the page, it can redirect the page to the host processor bias.
[0121] The page bias state can be changed by a software-based mechanism, a hardware-assisted software-based mechanism, or a purely hardware-based mechanism for a limited set of cases.
[0122] One mechanism for changing the bias state employs an API call (e.g., OpenCL) that in turn invokes the GPU's device driver, which then sends a message (or queues a command descriptor) to the GPU, instructing it to change the bias state and perform a cache flushing operation on the host for some transitions. The cache flushing operation is necessary for transitions from host processor 405 bias to GPU bias, but not for the reverse transition.
[0123] In one embodiment, cache coherence is maintained by temporarily rendering GPU bias pages that cannot be cached by the host processor 405. To access these pages, the processor 405 may request access from the GPU 410, depending on the implementation, whether access can be granted immediately or not. Therefore, to reduce communication between the processor 405 and the GPU 410, it is advantageous to ensure that the GPU bias pages are those required by the GPU but not by the host processor 405 (and vice versa).
[0124] Graphics processing pipeline
[0125] Figure 5 A graphics processing pipeline 500 according to an embodiment is shown. In one embodiment, a graphics processor may implement the shown graphics processing pipeline 500. The graphics processor may be included within a parallel processing subsystem as described herein, such as the parallel processor 200 of FIG2, which in one embodiment is... Figure 1 Variations of the (multiple) parallel processors 112. Various parallel processing systems can implement the graphics processing pipeline 500 via one or more instances of parallel processing units as described herein (e.g., parallel processing unit 202 of FIG. 2). For example, a shader unit (e.g., graphics multiprocessor 234 of FIG. 3) can be configured to perform the functions of one or more of the vertex processing unit 504, tessellation control processing unit 508, tessellation evaluation processing unit 512, geometry processing unit 516, and fragment / pixel processing unit 524. The functions of the data assembler 502, primitive assemblers 506, 514, 518, tessellation unit 510, rasterizer 522, and raster operation unit 526 can also be performed by other processing engines and corresponding partitioning units (e.g., partitioning units 220A to 220N of FIG. 2) within a processing cluster (e.g., processing cluster 214 of FIG. 3). The graphics processing pipeline 500 can also be implemented using dedicated processing units for one or more functions. In one embodiment, one or more portions of the graphics processing pipeline 500 may be executed by parallel processing logic within a general-purpose processor (e.g., a CPU). In one embodiment, one or more portions of the graphics processing pipeline 500 may access on-chip memory (e.g., parallel processor memory 222 in FIG2) via a memory interface 528, which may be an instance of memory interface 218 of FIG2.
[0126] In one embodiment, the data assembler 502 is a processing unit that collects vertex data of surfaces and primitives. The data assembler 502 then outputs vertex data, including vertex attributes, to the vertex processing unit 504. The vertex processing unit 504 is a programmable execution unit that executes a vertex shader program to illuminate and transform the vertex data as specified by the vertex shader program. The vertex processing unit 504 reads data stored in a cache, local, or system memory for use in processing the vertex data, and the vertex processing unit 504 can be programmed to transform the vertex data from an object-based coordinate representation to world space coordinate space or normalized device coordinate space.
[0127] The first instance of primitive assembler 506 receives vertex attributes from vertex processing unit 504. Primitive assembler 506 reads the stored vertex attributes as needed and constructs graphical primitives for processing by tessellation control processing unit 508. Graphical primitives include triangles, lines, points, patches, etc., supported by various graphics processing application programming interfaces (APIs).
[0128] The tessellation control processing unit 508 treats input vertices as control points for a geometric patch. These control points are transformed from an input representation of the patch (e.g., the patch's base) to a representation suitable for use in surface evaluation by the tessellation evaluation processing unit 512. The tessellation control processing unit 508 can also calculate tessellation factors for the edges of the geometric patch. The tessellation factors are applied to individual edges and quantize the view-dependent level of detail associated with that edge. The tessellation unit 510 is configured to receive the tessellation factors for the edges of the patch and subdivides the patch surface into multiple geometric primitives, such as lines, triangles, or quadrilaterals, which are then transmitted to the tessellation evaluation processing unit 512. The tessellation evaluation processing unit 512 operates on the parametric coordinates of the subdivided patch to generate a surface representation and vertex attributes associated with each vertex of the geometric primitives.
[0129] A second instance of the primitive assembler 514 receives vertex attributes from the tessellation evaluation processing unit 512, reads stored vertex attributes as needed, and constructs graphical primitives for processing by the geometry processing unit 516. The geometry processing unit 516 is a programmable execution unit that executes a geometry shader program to transform the graphical primitives received from the primitive assembler 514 as specified by the geometry shader program. In one embodiment, the geometry processing unit 516 is programmed to further subdivide the graphical primitives into one or more new graphical primitives and calculate parameters for rasterizing the new graphical primitives.
[0130] In some embodiments, the geometry processing unit 516 may add or remove elements in the geometry stream. The geometry processing unit 516 outputs parameters and vertices specifying new graphic primitives to the primitive assembler 518. The primitive assembler 518 receives parameters and vertices from the geometry processing unit 516 and constructs graphic primitives for processing by the viewport scaling, picking, and clipping unit 520. The geometry processing unit 516 reads data stored in parallel processor memory or system memory for use when processing geometry data. The viewport scaling, picking, and clipping unit 520 performs clipping, picking, and viewport scaling and outputs the processed graphic primitives to the rasterizer 522.
[0131] Rasterizer 522 can perform depth picking and other depth-based optimizations. Rasterizer 522 also performs scan transformations of new graphic primitives to generate fragments and outputs those fragments and associated overlay data to fragment / pixel processing unit 524. Fragment / pixel processing unit 524 is a programmable execution unit configured to execute fragment shader programs or pixel shader programs. Fragment / pixel processing unit 524 transforms fragments or pixels received from rasterizer 522 as specified by the fragment or pixel shader program. For example, fragment / pixel processing unit 524 can be programmed to perform operations that produce shaded fragments or pixels output to raster operation unit 526, including but not limited to texture mapping, shading, blending, texture correction, and perspective correction. Fragment / pixel processing unit 524 can read data stored in parallel processor memory or system memory for use when processing fragment data. Fragment or pixel shader programs can be configured to shade at samples, pixels, tiles, or other granularities depending on the sampling rate configured for the processing unit.
[0132] Raster operation unit 526 is a processing unit that performs raster operations including but not limited to stencil printing, z-testing, blending, etc., and outputs pixel data as processed graphic data for storage in a graphics memory (e.g., parallel processor memory 222 as shown in FIG2 and / or Figure 1 The data is stored in system memory 104, displayed on one or more display devices 110, or further processed by one or more processors 102 or one of parallel processors 112. In some embodiments, the raster operation unit 526 is configured to compress z-or color data written to memory and decompress z-or color data read from memory.
[0133] Using overlay for mixed reality coding
[0134] Figure 6 A general block diagram is shown for encoding real-world content and rendered content. Camera 602 captures real-world content and sends it to compositor 606. Real-world content can include images, pictures, videos, or other visual real-world content. Compositor 606 also receives data via the graphics pipeline (see...). Figure 5 The rendered content 604 generated by the graphics pipeline 500 in the image processing unit (GPU) can include graphics. Real-world content, such as input video, and graphics generated by the graphics pipeline are composited together and sent to the encoder 608. The encoder 608 encodes the input data by converting it into a normalized or compressed format and sends the output data / signals to a client (not shown) for display via a communication link.
[0135] The content presented by the above process has little or no flexibility. Compositeping world content with rendered content and then encoding the data before sending it to the client may hinder the client's ability to choose what content to display and what not to display. The information sent to the client is exactly what the client will see. In other words, the data sent may not be modified or changed.
[0136] High Efficiency Video Coding (HEVC), or ITU-T-Rec.H.265, is a video compression standard from the International Telecommunication Union (ITU), Place des Nations, 1211 Geneva 20, Switzerland. HEVC has a special extension called HEVC overlay auxiliary pictures. Through this special extension, real-world content such as pictures, images, videos, or other visual content is encoded using traditional coding. In one embodiment, the real-world content is encoded into the base layer or layer 0 and is primarily intended for display. HEVC overlay auxiliary pictures are pictures typically used for auxiliary purposes. They provide additional layers to HEVC overlay auxiliary pictures. HEVC overlay auxiliary pictures can be used as overlays of pictures, images, videos, or other visual content in the base layer. Another HEVC overlay auxiliary picture can represent the layout of where objects within an overlay on another layer will be placed. This can be called mapping data. Another type of HEVC overlay auxiliary picture can represent the transparency or blending of overlay objects from one or more overlay auxiliary pictures. This type of overlay auxiliary picture can be called alpha data. Each HEVC overlay auxiliary image is represented with a layer ID value, where each layer ID overlay auxiliary image is encoded separately using an HEVC encoder. Having a separate layer for each overlay auxiliary image allows for better compression efficiency and greater flexibility on the client side in deciding whether to display the overlay auxiliary image.
[0137] In one embodiment, an overlay of an auxiliary image is used to process mixed reality content. Figure 7 An exemplary block diagram 700 is shown for encoding mixed reality content as an overlay auxiliary image according to an embodiment. Figure 700 includes: a camera 702; and a semiconductor packaging device coupled to the camera 702. The semiconductor packaging device includes a substrate and logic coupled to the substrate. In one embodiment, the substrate may be silicon. Other substrates will be appreciated by those skilled in the art. The logic includes a graphics pipeline (…). Figure 5 (As shown in the figure), multiple encoders 706, 708, 710 and 712 and multiplexer 714. Each encoder 706, 708, 710 and 712 is coupled to multiplexer 714.
[0138] like Figure 7As shown, camera 702 provides real-world input to encoder 706 at the base layer—layer 0. Encoder 706 can be a conventional encoder. An overlay auxiliary image is used for the rendered content 704, and therefore an HEVC encoder is needed to encode the rendered content as an overlay auxiliary image. The rendered content 704 is input to encoder 708 at the first non-base layer or layer 1. The rendered content 704 is rendered using a graphics pipeline (e.g., such as...). Figure 5 (As shown in the diagram) The rendered content 704 includes mixed reality content. Multiplexer 714 interleaves the encoded real-world content from the base layer with the encoded mixed reality content from the first non-base layer to obtain a single output signal with mixed reality content. Mapping data 716 associated with the rendered content 704 is input to encoder 710 at the second non-base layer or layer 2. Mapping data 716 is encoded as an overlay auxiliary image. Mapping data 716 provides an overlay layout for the rendered content 704. α data 718 associated with the rendered content 704 is input to encoder 712 at the third non-base layer or layer 3. α data 718 is also encoded as an overlay auxiliary image. α data 718 can be used to blend the overlay images to indicate the transparency of the overlay auxiliary image relative to the encoded real-world content from the base layer. α data 718 is indicated by dashed lines to show that it is an optional input.
[0139] The embodiments are not limited to a set of rendered content 704, 716, and 718. Although not in Figure 7 As shown, additional rendered content can be used by adding additional layers (overlaying auxiliary images) to the rendered content, the mapped data, and optionally the alpha data, respectively.
[0140] Multiplexer 714 continues to interleave the second non-base layer or layer 2 (encoded mapped data) and the third non-base layer or layer 3 (encoded α data) with the first non-base layer and the base layer to form a single output signal containing the encoded bitstream. The single output signal is sent to a client device for encoding and displaying the content.
[0141] In one or more embodiments, the client device may be co-located with the transmitting device. For example, the transmitting device may be a personal computer, and the client device may be a wireless head-mounted display (HMD). In another embodiment, the client device may be a laptop computer co-located with the transmitting device, wherein the transmitting device is a desktop computer. In yet another embodiment, the client device may be an HMD, and the transmitting device may be a mobile device, such as, for example, a smartphone. In other embodiments, the transmitting device may communicate with the client device via a network in a manner well known to those skilled in the art.
[0142] Figure 8 An exemplary flowchart is shown for encoding mixed reality content as an overlay auxiliary image according to an embodiment. The illustrated method can be implemented as one or more modules in a set of logic instructions stored in a non-transitory machine or computer-readable storage medium such as random access memory (RAM), read-only memory (ROM), programmable ROM (PROM), firmware, flash memory, etc.; in configurable logic such as, for example, a programmable logic array (PLA), a field-programmable gate array (FPGA), a complex programmable logic device (CPLD); or in fixed-function hardware logic using circuit technologies such as, for example, application-specific integrated circuit (ASIC), complementary metal-oxide-semiconductor (CMOS), or transistor-transistor logic (TTL) technology, or any combination thereof.
[0143] The process begins at box 802 and proceeds directly to box 804. In box 804, real-world content is encoded into a base layer or layer 0 using a standard encoder. In one embodiment, real-world content may include video captured from a camera. Real-world content is not limited to captured video but may also include images, pictures, and other visual real-world data. The process then proceeds to box 806.
[0144] In box 806, the rendered content is encoded into a first non-base layer, or Layer 1, using an HEVC encoder. The rendered content is then encoded as an overlay auxiliary image of Layer 1. The process then proceeds to box 808.
[0145] In box 808, the base layer is interleaved with the first non-base layer to obtain a single output signal with multiple signals. The first signal is encoded real-world content. The second signal is encoded as mixed reality content overlaid with an auxiliary image. The process then proceeds to box 810.
[0146] In box 810, the mapping data is encoded into a second non-base layer, or Layer 2, using an HEVC encoder. The mapping data is an overlay auxiliary image encoded as Layer 2. The process then proceeds to box 812.
[0147] In block 812, the second non-base layer is interleaved with the first non-base layer and the base layer to further maintain a single output signal with multiple signals. Now, the multiple signals include the addition of mapping data. The process proceeds to optional blocks 814 and 816, or to block 818 if optional blocks 814 and 816 will not be performed.
[0148] In optional box 814, the α data is encoded into a third non-base layer, or layer 3, using an HEVC encoder. The α data is encoded as an overlay auxiliary image of layer 3. The process then proceeds to optional box 816, where the third non-base layer is interleaved with the second non-base layer, the first non-base layer, and the base layer to further maintain a single output signal with multiple signals. Now, the multiple signals include the addition of the α data. The process then proceeds to box 818.
[0149] In box 818, a single output signal having multiple signals is transmitted to the client via a communication link. If the transmission is native, the single output signal can be transmitted to the client via Bluetooth or Wi-Fi. Bluetooth and Wi-Fi are well known to those skilled in the art. If the transmission is remote, the single output signal can be transmitted to the client via a network. The transmission is performed in a manner well known to those skilled in the art.
[0150] After a single output signal is received by the client device, the client device must individually recover each of the plurality of signals and decode them for display on the client device. Before displaying the signal, it is sent to a synthesizer to combine the signals. In one embodiment, the user has better interactivity because the user can now select the overlay auxiliary image they want to display and deselect any overlay auxiliary images they do not want to display. If the user does not want to display an overlay auxiliary image, they will not select the overlay that will be combined at the synthesizer. This gives the user greater flexibility.
[0151] Figure 9 An exemplary block diagram 900 is shown for decoding mixed reality content into an overlay auxiliary image according to an embodiment. Figure 900 includes a display 916 coupled to a semiconductor package device. The semiconductor package device includes a substrate and logic coupled to the substrate. In one embodiment, the substrate may be silicon. Those skilled in the art will understand that other substrates may be used. The logic includes a demultiplexer 904, a plurality of decoders 906, 908, 910, and 912, and a synthesizer 914. The demultiplexer 904 is coupled to the decoders 906, 908, 910, and 912. The decoders 906, 908, 910, and 912 are coupled to the synthesizer 914. The synthesizer 914 is coupled to the display 916.
[0152] Demultiplexer 904 receives data from transmitting device (e.g., Figure 7The input signal 902 (shown in the diagram) consists of multiple signals interleaved to form a single complex signal, as previously indicated. The demultiplexer 904 recovers each signal by separating each signal from the single complex signal. The demultiplexer 904 separates the base layer signal and sends it to the base layer decoder, identified as Layer 0. Layer 0 represents the base layer decoder. The base layer signal includes real-world content.
[0153] The demultiplexer separates the first non-base layer signal and sends it to the first non-base layer decoder, identified as Layer 1. Layer 1 represents the first layer overlaid with auxiliary images. The first non-base layer signal includes mixed reality content.
[0154] Demultiplexer 904 separates the second non-base layer signal and sends it to the second non-base layer decoder, identified as layer 2. Layer 2 represents the second layer overlaid with auxiliary images. The second non-base layer signal includes mapping data.
[0155] Demultiplexer 904 separates the third non-base layer signal and sends it to the third non-base layer decoder, identified as layer 3. Layer 3 represents the third layer for overlaying auxiliary images. The third non-base layer signal includes alpha data for fusion. As previously indicated, the alpha data is optional.
[0156] Decoders 906, 908, 910, and 912 decode the base layer, the first non-base layer, the second non-base layer, and the third non-base layer, respectively. The decoded data from each decoder 906, 908, 910, and 912 is sent to synthesizer 914.
[0157] Composer 914 is used to combine real-world content, which serves as the main image, with mixed-real-world content, which serves as an overlay auxiliary image, to form the final composite. Composer 914 extracts information from mapping data and alpha data (if used) to place the mixed-real-world overlay auxiliary image onto the main image, which has real-world content.
[0158] The synthesizer 914 shown includes a selector that allows a user to select an overlay auxiliary image chosen by the selector. The user can use the selector to turn the overlay input on or off. If the overlay input is off, the synthesizer 914 will not include the overlay when the inputs are combined to form the final composition. The final composition is then sent to a display 916. As previously indicated, the display can be a head-mounted display, a laptop display, a desktop display, a mobile device display, or any other display device capable of displaying the combined real-world content and mixed reality.
[0159] As previously indicated, the embodiments are not limited to a set of rendered content data. Those skilled in the art will understand that additional sets of rendered content data (rendered content, mapping data, and optional alpha data) will require additional layers of supplementary images. These additional layers of supplementary images will then require additional encoders and decoders.
[0160] Figure 10 An exemplary flowchart is shown for decoding mixed reality content into an overlay auxiliary image according to an embodiment. The illustrated method can be implemented as one or more modules in a set of logic instructions stored in a non-transitory machine or computer-readable storage medium such as RAM, ROM, PROM, firmware, flash memory, in configurable logic such as PLA, FPGA, CPLD, etc., or in fixed-function hardware logic using circuit technologies such as ASIC, CMOS, or TTL, or any combination thereof.
[0161] The process begins at box 1002 and proceeds directly to box 1004. In box 1004, a single complex signal is received. This single complex signal consists of multiple encoded signals interleaved together. The process then proceeds to box 1006.
[0162] In block 1006, a demultiplexer is used to separate the plurality of encoded signals from the single complex signal. The single complex signal is separated into a base layer signal, a first non-base layer signal, a second non-base layer signal, and a third non-base layer signal. The base layer signal represents encoded real-world content. The first non-base layer signal represents encoded mixed reality content using layer 1 overlaid with an auxiliary image. The second non-base layer signal represents encoded mapping data using layer 2 overlaid with an auxiliary image. Optionally, the third non-base layer signal represents encoded alpha data using layer 3 overlaid with an auxiliary image. The process then proceeds to block 1008.
[0163] In block 1008, the base layer signal and the three (3) non-base layer signals are decoded. The process then proceeds to block 1010.
[0164] In box 1010, the overlay auxiliary image to be used in the final composite is selected. This is achieved by allowing the user to turn the overlay auxiliary image on and off. The process then proceeds to box 1012.
[0165] In box 1012, the base layer and the selected overlay auxiliary image are composited together to form the final composite by using mapping data and fusion data (if fusion is selected) to place a mixed reality overlay onto the main image (real-world content). The process then proceeds to box 1014.
[0166] In box 1014, the final composite is presented on a display device, such as, for example, a head-mounted display, a laptop computer, a desktop computer, a mobile device, or other display devices capable of displaying real-world content as an overlay with mixed reality.
[0167] Display technology
[0168] Turn now Figure 11 The diagram illustrates a performance-enhanced computing system 1100. In the illustrated example, a processor 1110 is coupled to a display 1120. The processor 1110 typically generates images that will be displayed on an LCD panel 1150 of the display 1120. In one example, the processor 1110 includes communication interfaces such as, for example, Video Graphics Array (VGA), DisplayPort (DP) interface, Embedded DisplayPort (eDP) interface, High Definition Multimedia Interface (HDMI), Digital Vision Interface (DVI), etc. The processor 1110 may be a graphics processor (e.g., a graphics processing unit / GPU) that processes graphics data and generates images (e.g., video frames, still images) displayed on the LCD panel 1150. Furthermore, the processor 1110 may include one or more image processing pipelines that generate pixel data. The image processing pipelines may conform to the OpenGL architecture or other suitable architectures. Additionally, the processor 1110 may be connected to a host processor (e.g., a central processing unit / CPU) that performs control over the processor 1100 and / or one or more device drivers that interact with the processor 1110.
[0169] The illustrated display 1120 includes a timing controller (TCON) 1130 that can individually address different pixels on the LCD panel 1150 and update each individual pixel on the LCD panel 1150 on a refresh cycle. In this regard, the LCD panel 1150 may include multiple liquid crystal elements, such as, for example, liquid crystals and integrated color filters. Each pixel of the LCD panel 1150 may include a triplet of liquid crystal elements, each with a red, green, and blue filter. The LCD panel 1150 can arrange pixels in a two-dimensional (2D) array controlled via row drivers 1152 and column drivers 1154 to update the image being displayed by the LCD panel 1150. Therefore, the TCON 1130 can drive the row drivers 1152 and column drivers 1154 to address specific pixels of the LCD panel 1150. The TCON 1130 can also adjust the voltage supplied to the liquid crystal elements in the pixel to change the light intensity passing through each of the three liquid crystal elements, and thus change the color of the pixel displayed on the surface of the LCD panel 1150.
[0170] The backlight 1160 may include a plurality of light-emitting elements, such as, for example, light-emitting diodes (LEDs), arranged at the edges of the LCD panel 1150. Accordingly, the light generated by the LEDs may be dispersed through the LCD panel 1150 by a diffuser (not shown). In another example, LEDs are arranged in a 2D array directly behind the LCD panel 1150 in a configuration whereby each LED disperses light through one or more corresponding pixels of the LCD panel 1150 positioned in front of that LED; therefore, this configuration is sometimes referred to as direct backlighting. The light-emitting elements may also include compact fluorescent lamps (CFLs) arranged along one or more edges of the LCD panel 1150. To eliminate multiple edges, the combination of edges may be varied to achieve selective illumination of areas, where fewer than the entire group of lighting elements is used with less power.
[0171] The light-emitting element may also include one or more sheets of electroluminescent material placed behind the LCD panel 1150. In such cases, light from the surface of the sheet can be dispersed through the pixels of the LCD panel 1150. Furthermore, the sheet can be divided into multiple regions, such as, for example, quadrants. In one example, each region is individually controlled to illuminate only a portion of the LCD panel 1150. Other backlighting solutions may also be used.
[0172] The illustrated display 1120 also includes a backlight controller (BLC) 1140 that supplies voltage to the light-emitting elements of the backlight 1160. For example, the BLC 1140 may include a pulse-width modulation (PWM) driver (not shown) to generate a PWM signal that activates at least a portion of the light-emitting elements of the backlight 1160. The duty cycle and frequency of the PWM signal can dim the light generated by the light-emitting elements. For example, a 100% duty cycle may correspond to the light-emitting elements being fully on, while a 0% duty cycle may correspond to the light-emitting elements being fully off. Therefore, intermediate duty cycles (e.g., 25%, 50%) typically keep the light-emitting elements on for a percentage of the cycle time. This cycle time can be fast enough that the flickering of the light-emitting elements is imperceptible to the human eye. Furthermore, the effect on the user may be that the level of light emitted by the backlight 1160 is lower than when the backlight 1160 is fully activated. The BLC 1140 may be separate from or incorporated into the TCON 1130.
[0173] Alternatively, an emissive display system can be used, in which the LCD panel 1150 is replaced by an emissive display panel (e.g., organic light-emitting diode / OLED), the backlight 1160 is omitted, and the row driver 1152 and column driver 1154 can be used to directly modulate the pixel color and brightness, respectively.
[0174] Distance-based display resolution
[0175] Figure 12A The illustration depicts a scenario where a user 1218 interacts with a data processing device 1200 including a display unit 1228. The display processing device 1200 may include, for example, a notebook computer, desktop computer, tablet computer, convertible tablet, mobile internet device (MID), personal digital assistant (PDA), wearable device (e.g., head-mounted display / HMD), media player, etc., or any combination thereof. The illustrated data processing device 1200 includes a processor 1224 (e.g., an embedded controller, microcontroller, host processor, graphics processor) coupled to a memory 1222, which may include storage locations addressable by the processor 1224. As will be discussed in more detail, a distance sensor 1210 may enable distance-based display resolution relative to the display unit 1228.
[0176] The illustrated memory 1222 includes display data 1226 to be rendered on display unit 1228. In one example, processor 1224 performs data transformation on display data 1226 before presenting it on display unit 1228. Post-processing engine 1214 can be executed on processor 1224 to receive display data 1226 and output from proximity sensor 1210. Post-processing engine 1214 can modify display data 1226 to enhance the readability of screen content on display unit 1228, reduce power consumption in data processing device 1200, etc., or any combination thereof.
[0177] The displayed memory 1222 stores display resolution settings 1216 in addition to the operating system 1212 and application 1220. Display resolution settings 1216 specify the number of pixels of display data 1226 to be rendered on display unit 1228 along both length and width dimensions. If display data 1226 generated by application 1220 is incompatible with the format of display unit 1228, processor 1224 can configure the scaling of display data 1226 to match the format of display unit 1228. In this regard, display resolution settings 1216 can be associated with and / or incorporated into configuration data that defines other settings for display unit 1228. Furthermore, display resolution settings 1216 can be defined in terms of unit distance or area (e.g., pixels per inch / PPI) or other suitable parameters.
[0178] Application 1220 can generate a user interface in which user 1218 can interact to select display resolution setting 1216 from one or more options provided through the user interface, type display resolution setting 1216 as a requested value, etc. Therefore, the size of display data 1226 can be adjusted to fit display resolution setting 1216 before being rendered on display unit 1228.
[0179] The distance sensor 1210 can track the distance between the user 1218 and the display unit 1228, wherein distance sensing can be triggered by a physical button associated with the data processing device 1200 / display unit 1228, by a user interface provided by the loading of the application 1220 and / or the operating system 1220, etc. For example, during the boot of the data processing device 1200, the operating system 1212 can execute an automatic process to trigger distance sensing in the background or foreground. Distance sensing can be performed periodically or continuously.
[0180] Figure 12B An example of a distance sensing scenario is shown. In the example shown, distance sensor 1210 uses transceiver 1208 to transmit electromagnetic beam 1202 in the direction of user 1218. Therefore, transceiver 1202 can be positioned in front of data processing device 1200. Figure 12A On the forward surface of the electromagnetic beam 1202, the electromagnetic beam 1202 can affect the user 1218 and can be reflected / scattered from the user 1218 as a return electromagnetic beam 1204. The return electromagnetic beam 1204 can be generated by, for example, a processor 1224. Figure 12A ) and / or post-processing engine 1214 ( Figure 12A Analysis to determine the relationship between user 1218 and display unit 1228 ( Figure 12A The distance between them is 1206. A distance of 1206 can be used to adjust the display resolution setting to 1216.
[0181] Display layer
[0182] Turn now Figure 13 The illustration shows a display system 1300, in which cascaded display layers 1361, 1362, and 1363 are used to implement spatial / temporal super-resolution in display component 1360. In the illustrated example, processor 1310 provides raw graphics data 1334 (e.g., video frames, still images) to system 1300 via bus 1320. Cascaded display programs 1331 may be stored in memory 1330, wherein cascaded display programs 1331 may be part of a display driver associated with display component 1360. The illustrated memory 1330 also includes raw graphics data 1334 and decomposed graphics data 1335. In one example, cascaded display programs 1331 include a temporal decomposition component 1332 and a spatial decomposition component 1333. The temporal decomposition component 1332 performs temporal decomposition calculations, while the spatial decomposition component performs spatial decomposition calculations. The cascaded display program 331 can derive decomposed graphic data 1335 based on user configuration and original graphic data 1334 for rendering on each display layer 1361, 1362 and 1363.
[0183] Display component 1360 can be implemented as an LCD (Liquid Crystal Display) for use in applications such as head-mounted displays (HMDs). More specifically, display component 1360 may include a stack of LCD panels, interface boards, lens accessories, etc. Each panel can operate at, for example, a native resolution of 1280*1280 and a refresh rate of 60Hz. Other native resolutions, refresh rates, display panel technologies, and / or layer configurations can be used.
[0184] Multiple display units
[0185] Figure 14 A graphics display system 1400 is shown, comprising a set of display units 1430 (1430a-1430n). These display units 1430 are generally used to output a widescreen (e.g., panoramic) presentation 1440, which includes coordinated content in a cohesive and structured topological form. In the illustrated example, a data processing device 1418 includes a processor 1415 that applies logic function 1424 to hardware profile data 1402 received via network 1420 from the set of display units 1430. When no match is found between the hardware profile data and a set of settings in a hardware profile lookup table 1412, applying logic function 1424 to the hardware profile data 1402 creates a set of automatic topology settings 1406. The illustrated set of automatic topology settings 1406 is transmitted from the display processing device 1418 to the display units 1430 via network 1420.
[0186] Processor 1415 may execute and run logic function 1424 after receiving it from display driver 1410. In this regard, display driver 1410 may include an automatic topology module 1408 that automatically configures and constructs the topology of display unit 1432 to create presentation 1440. In one example, display driver 1410 is a set of instructions that, when executed by processor 1415, cause data processing device 1418 to communicate with display unit 1430, video card, etc., and perform automatic topology generation operations.
[0187] Data processing device 1418 may include, for example, a server, desktop computer, laptop computer, tablet computer, convertible tablet, MID, PDA, wearable device, media player, etc. Therefore, display processing device 1418 may include hardware control module 1416, storage device 1414, random access memory (RAM, not shown), controller card including one or more video controller cards, etc. In one example, display unit 1430 may be a flat panel display (e.g., liquid crystal, active matrix, plasma, etc.), HMD, video projection device, etc., that works together to produce presentation 1440. Furthermore, presentation 1440 may be generated based on media files stored in storage device 1414, wherein the media files may include, for example, movies, video clips, animations, advertisements, etc., or any combination thereof.
[0188] The term "topology" can be considered as the number, scaling, shape, and / or other configuration parameters of the first display unit 1430a, the second display unit 1430b, the third display unit 1430n, etc. Accordingly, the topology of the display units 1430 allows the presentation 1440 to be presented visually consistently, ensuring that the various segments of the presentation 1440 are proportional to and compatible with the original scale and extent of the media being played through the display units 1430. Therefore, the topology can constitute spatial relationships and / or geometric properties unaffected by continuous changes in the shape or size of the content rendered in the presentation 1440. In one example, the automatic topology module 1408 includes a timing module 1426, a control module 1428, a signal monitor module 1432, and a signal display module 1434. The timing module 1426 can designate a specific display unit from a set of display units 1430 as a sample display unit. In such cases, the timing module 1426 can designate the remaining display modules 1430 as additional display units. In one example, timing module 1426 automatically sets the shape factor to be compatible with hardware profile data 1402, where demonstration 1440 is automatically initiated by a sequence of graphic signals 1422.
[0189] In one example, control module 1428 modifies a set of automatic topology settings 1406. Furthermore, signal monitor module 1432 can automatically monitor the sequence 1422 of graphic signals and trigger storage device 1414 to associate the set of automatic topology settings 1406 with hardware profile lookup table 1412. Additionally, signal monitor module 1432 can automatically detect changes in a set of display units 1430 according to a set of change criteria and automatically generate a new topology profile corresponding to the changes in the set of display units 1430. Thus, the new topology profile can be applied to the set of display units 1430. If the sequence 1422 of graphic signals does not meet a set of criteria, signal monitor module 1432 can also trigger signal display module 1434 to reapply the set of automatic topology settings 1406. If hardware profile data 1402 does not support automatic topology display of the sequence 1422 of graphic signals, data processing device 1418 can report an error and log the error in error log 1413.
[0190] Cloud-assisted media delivery
[0191] Turn now Figure 15 The cloud gaming system 1500 includes clients 1540 coupled to server 1520 via network 1510. Client 1540 can generally be a consumer of graphical (e.g., game, virtual reality / VR, augmented reality / AR) content hosted, processed, and rendered on server 1520. The scalable server 1520 shown has the capacity to simultaneously (e.g., by utilizing parallel and amortized processing and rendering resources) serve graphical content to multiple clients. In one example, the scalability of server 1520 is limited by the capacity of network 1510. Accordingly, there may be a threshold number of clients beyond which service is degraded for all clients.
[0192] In one example, server 1520 includes a graphics processor (e.g., GPU) 1530, a host processor (e.g., CPU) 1524, and a network interface card (NIC) 1552. NIC 1522 can receive requests for graphics content from client 1540. Requests from client 1540 can cause graphics content to be fetched from memory via an application executing on host processor 1524. Host processor 1524 can perform high-level operations, such as, for example, determining the position, collision, and motion of objects in a given scene. Based on these high-level operations, host processor 1524 can generate rendering commands combined with scene data and executed by graphics processor 1530. Rendering commands enable graphics processor 1530 to define scene geometry, shading, lighting, motion, textures, camera parameters, etc., for a scene to be rendered via client 1540.
[0193] More specifically, the illustrated graphics processor 1530 includes a graphics renderer 1532 that performs the rendering process according to rendering commands generated by the host processor 1524. The output of the graphics renderer 1532 may be a raw video frame stream provided to a frame capture unit 1534. The illustrated frame capture unit 1534 is coupled to an encoder 1536 that can compress / format the raw video stream for transmission over the network 1510. The encoder 1536 may use various video compression algorithms, such as, for example, the H.264 standard from the International Telecommunication Union Telecommunication Standardization Sector (ITUT), the MPEG4 Advanced Video Coding (AVC) standard from the International Organization for Standardization / International Electrotechnical Commission (ISO / IEC), and so on.
[0194] The client 1540 shown (which may be a desktop computer, laptop computer, tablet computer, convertible computer, wearable device, MID, PDA, media player, etc.) includes a NIC 1542 to receive the transmitted video stream from the server 1520. The NIC 1522 may include the physical layer and the software layer foundation of the network interface in the client 1540 to facilitate communication on the network 1510. The client 1540 may also include a decoder 1544 employing the same formatting / compression scheme as the encoder 1536. Therefore, the decompressed video stream can be provided from the decoder 1544 to the video renderer 1546. The shown video renderer 1546 is coupled to a display 1548 that visually presents the graphical content.
[0195] As already documented, the graphical content may include game content. In this regard, client 1540 may perform a real-time interactive streaming process involving collecting user input from input device 1550 and delivering the user input to server 1520 via network 1510. This real-time interactive aspect of cloud gaming presents challenges regarding latency.
[0196] Additional system overview example
[0197] Figure 16 This is a block diagram of a processing system 1600 according to an embodiment. In various embodiments, system 1600 includes one or more processors 1602 and one or more graphics processors 1608, and may be a single-processor desktop computer system, a multiprocessor workstation system, or a server system having a large number of processors 1602 or processor cores 1607. In one embodiment, system 1600 is a processing platform included in a system-on-a-chip (SoC) for use in mobile devices, handheld devices, or embedded devices.
[0198] Embodiments of system 1600 may include or be included in the following: a server-based game platform, a game console (including game and media consoles), a mobile game console, a handheld game console, or an online game console. In some embodiments, system 1600 is a mobile phone, smartphone, tablet computing device, or mobile internet device. Data processing system 1600 may also include, be coupled to, or be integrated into the following: wearable devices, such as smartwatches, smart glasses, augmented reality devices, or virtual display devices. In some embodiments, data processing system 1600 is a television or set-top box device having one or more processors 1602 and a graphics interface generated by one or more graphics processors 1608.
[0199] In some embodiments, one or more processors 1602 each include one or more processor cores 1607 for processing instructions that, when executed, perform operations on the system and user software. In some embodiments, each of the one or more processor cores 1607 is configured to process a specific instruction set 1609. In some embodiments, the instruction set 1609 may facilitate Complex Instruction Set Computing (CISC), Reduced Instruction Set Computing (RISC), or computation via Very Long Instruction Word (VLIW). Multiple processor cores 1607 may each process different instruction sets 1609, which may include instructions for facilitating emulation of other instruction sets. Processor cores 1607 may also include other processing means, such as digital signal processors (DSPs).
[0200] In some embodiments, processor 1602 includes cache memory 1604. Depending on the architecture, processor 1602 may have a single internal cache or multiple levels of internal cache. In some embodiments, cache memory is shared among various components of processor 1602. In some embodiments, processor 1602 also uses external caches (e.g., Level 3 (L3) cache or Last Level Cache (LLC) (not shown), which can be shared among processor cores 1607 using known cache coherence techniques. Register file 1606 is additionally included in processor 1602, and the register file may include different types of registers for storing different types of data (e.g., integer registers, floating-point registers, status registers, and instruction pointer registers). Some registers may be general-purpose registers, while others may be specific to the design of processor 1602.
[0201] In some embodiments, processor 1602 is coupled to processor bus 1610 to transmit communication signals (e.g., address, data, or control signals) between processor 1602 and other components in system 1600. In one embodiment, system 1600 uses an exemplary 'central' system architecture including a memory controller central hub 1616 and an input / output (I / O) controller central hub 1630. Memory controller central hub 1616 facilitates communication between memory devices and other components of system 1600, while I / O controller central hub (ICH) 1630 provides connectivity to I / O devices via a local I / O bus. In one embodiment, the logic of memory controller central hub 1616 is integrated within the processor.
[0202] Memory device 1620 may be a dynamic random access memory (DRAM) device, a static random access memory (SRAM) device, a flash memory device, a phase-change memory device, or some other memory device with suitable performance to serve as process memory. In one embodiment, memory device 1620 may operate as system memory of system 1600 to store data 1622 and instructions 1621 for use when one or more processors 1602 execute an application or process. Memory controller hub 1616 is also coupled to an optional external graphics processor 1612, which may be coupled to graphics processor 1608 in processor 1602 to perform graphics and media operations.
[0203] In some embodiments, ICH 1630 enables peripheral devices to be connected to memory device 1620 and processor 1602 via a high-speed I / O bus. I / O peripheral devices include, but are not limited to: audio controller 1646, firmware interface 1628, wireless transceiver 1626 (e.g., Wi-Fi, Bluetooth), data storage device 1624 (e.g., hard disk drive, flash memory, etc.), and a conventional I / O controller 1640 for coupling conventional (e.g., Personal System 2 (PS / 2)) devices to the system. One or more Universal Serial Bus (USB) controllers 1642 connect input devices (e.g., a keyboard and mouse combination 1644). Network controller 1634 may also be coupled to ICH 1630. In some embodiments, a high-performance network controller (not shown) is coupled to processor bus 1610. It will be appreciated that the illustrated system 1600 is exemplary and not limiting, as other types of data processing systems configured differently may also be used. For example, the I / O controller hub 1630 may be integrated within one or more processors 1602, or the memory controller hub 1616 and the I / O controller hub 1630 may be integrated within a discrete external graphics processor (such as external graphics processor 1612).
[0204] Figure 17 This is a block diagram of an embodiment of processor 1700, which has one or more processor cores 1702A to 1702N, an integrated memory controller 1714, and an integrated graphics processor 1708. Figure 17 Those elements having the same reference numerals (or names) as elements in any other figure herein may operate or function in any manner similar to, but not limited to, those described elsewhere herein. Processor 1700 may include up to and including additional cores 1702N, indicated by dashed boxes. Each of processor cores 1702A to 1702N includes one or more internal cache units 1704A to 1704N. In some embodiments, each processor core is also able to access one or more shared cache units 1706.
[0205] Internal cache units 1704A to 1704N and shared cache unit 1706 represent the cache memory hierarchy within processor 1700. The cache memory hierarchy may include at least one level of instruction and data cache within each processor core and one or more levels of shared intermediate level caches (e.g., Level 2 (L2), Level 3 (L3), Level 4 (L4), or other levels of cache), wherein the highest-level cache preceding external memory is classified as LLC. In some embodiments, cache coherence logic maintains coherence between the various cache units 1706 and 1704A to 1704N.
[0206] In some embodiments, the processor 1700 may further include a set of one or more bus controller units 1716 and a system agent core 1710. The one or more bus controller units 1716 manage a set of peripheral buses, such as one or more peripheral component interconnect buses (e.g., PCI, PCI Fast Bus). The system agent core 1710 provides management functions for each processor unit. In some embodiments, the system agent core 1710 includes one or more integrated memory controllers 1714 for managing access to various external memory devices (not shown).
[0207] In some embodiments, one or more of processor cores 1702A to 1702N include support for simultaneous multithreaded processing. In such an embodiment, system agent core 1710 includes components for coordinating and operating cores 1702A to 1702N during multithreaded processing. System agent core 1710 may additionally include a power control unit (PCU) including logic and components for regulating the power states of processor cores 1702A to 1702N and graphics processor 1708.
[0208] In some embodiments, processor 1700 further includes a graphics processor 1708 for performing graphics processing operations. In some embodiments, graphics processor 1708 is coupled to a set of shared cache units 1706 and system proxy core 1710, including one or more integrated memory controllers 1714. In some embodiments, display controller 1711 is coupled to graphics processor 1708 to drive graphics processor output to one or more coupled displays. In some embodiments, display controller 1711 may be a separate module coupled to graphics processor via at least one interconnect, or it may be integrated within graphics processor 1708 or system proxy core 1710.
[0209] In some embodiments, a ring-based interconnect unit 1712 is used to couple the internal components of the processor 1700. However, alternative interconnect units, such as point-to-point interconnects, switched interconnects, or other technologies, including those well known in the art, may be used. In some embodiments, the graphics processor 1708 is coupled to the ring interconnect 1712 via I / O link 1713.
[0210] Exemplary I / O link 1713 represents at least one of a variety of I / O interconnects, including an on-package I / O interconnect that facilitates communication between various processor components and a high-performance embedded memory module 1718 (such as an eDRAM module). In some embodiments, each of processor cores 1702 to 1702N and graphics processor 1708 uses the embedded memory module 1718 as a shared last-level cache.
[0211] In some embodiments, processor cores 1702A to 1702N are homogeneous cores executing the same instruction set architecture. In another embodiment, processor cores 1702A to 1702N are heterogeneous in terms of instruction set architecture (ISA), wherein one or more of processor cores 1702A to 1702N execute a first instruction set, while at least one of the other cores executes a subset of the first instruction set or different instruction values. In one embodiment, processor cores 1702A to 1702N are heterogeneous in terms of microarchitecture, wherein one or more cores with relatively higher power consumption are coupled to one or more power cores with lower power consumption. Additionally, processor 1700 can be implemented on one or more chips or implemented as a SoC integrated circuit having, among other components, the components shown.
[0212] Figure 18This is a block diagram of a graphics processor 1800, which may be a discrete graphics processing unit or a graphics processor integrated with multiple processing cores. In some embodiments, the graphics processor communicates with memory via a mapped I / O interface to registers on the graphics processor and using commands placed in processor memory. In some embodiments, the graphics processor 1800 includes a memory interface 1814 for accessing memory. The memory interface 1814 may be an interface to local memory, one or more internal caches, one or more shared external caches, and / or to system memory.
[0213] In some embodiments, the graphics processor 1800 further includes a display controller 1802 for driving display output data to a display device 1820. The display controller 1802 includes hardware for one or more overlapping planes of the display and a multilayer video or user interface element. In some embodiments, the graphics processor 1800 includes a video codec engine 1806 for encoding, decoding, or converting media codes to, from, or between one or more media encoding formats, including but not limited to: Moving Picture Experts Group (MPEG) formats (such as MPEG-2), Advanced Video Decoding (AVC) formats (such as H.264 / MPEG-4 AVC), and Society of Motion Picture & Television Engineers (SMPTE) 421M / VC-1, and Joint Group of Picture Experts Group (JPEG) formats (such as JPEG and Motion JPEG (MJPEG)).
[0214] In some embodiments, the graphics processor 1800 includes a block image transfer (BLIT) engine 1804 for performing two-dimensional (2D) rasterizer operations, including, for example, bit boundary block transfer. However, in one embodiment, 2D graphics operations are performed using one or more components of a graphics processing engine (GPE) 1810. In some embodiments, the graphics processing engine 1810 is a computational engine for performing graphics operations, including three-dimensional (3D) graphics operations and media operations.
[0215] In some embodiments, GPE 1810 includes a 3D pipeline 1812 for performing 3D operations, such as rendering 3D images and scenes using processing functions acting on the shapes of 3D primitives (e.g., rectangles, triangles, etc.). The 3D pipeline 1812 includes programmable and fixed-function elements that perform various tasks to the 3D / media subsystem 1815 within components and / or generated execution threads. While the 3D pipeline 1812 can be used to perform media operations, embodiments of GPE 1810 also include a media pipeline 1816 specifically for performing media operations, such as video post-processing and image enhancement.
[0216] In some embodiments, the media pipeline 1816 includes fixed-function or programmable logic units for performing one or more specialized media operations, such as video decoding acceleration, video deinterleaving, and video encoding acceleration, in place of or on behalf of the video codec engine 1806. In some embodiments, the media pipeline 1816 further includes a thread generation unit to generate threads for execution on the 3D / media subsystem 1815. The generated threads perform calculations on the media operations for one or more graphics execution units included in the 3D / media subsystem 1815.
[0217] In some embodiments, the 3D / media subsystem 1815 includes logic for executing threads generated by the 3D pipeline 1812 and the media pipeline 1816. In one embodiment, the pipelines send thread execution requests to the 3D / media subsystem 1815, the 3D / media subsystem including thread dispatch logic for arbitrating and dispatching requests to available thread execution resources. Execution resources include an array of graphics execution units for processing 3D and media threads. In some embodiments, the 3D / media subsystem 1815 includes one or more internal caches for thread instructions and data. In some embodiments, the subsystem also includes shared memory (including registers and addressable memory) for sharing data between threads and storing output data.
[0218] 3D / Media Processing
[0219] Figure 19 This is a block diagram of a graphics processing engine 1910 of a graphics processor according to some embodiments. In one embodiment, GPE 1910 is... Figure 18 The image shows a version of GPE 1810. Figure 19 Elements having the same reference numerals (or names) as elements in any other figure herein may operate or function in any manner similar to, but not limited to, those described elsewhere herein.
[0220] In some embodiments, GPE 1910 is coupled to command stream converter 1903, which provides command streams to the GPE's 3D pipeline 1912 and media pipeline 1916. In some embodiments, command stream converter 1903 is coupled to memory, which may be system memory, or one or more of internal cache memory and shared cache memory. In some embodiments, command stream converter 1903 receives commands from memory and sends the commands to 3D pipeline 1912 and / or media pipeline 1916. The commands are instructions obtained from a ring buffer storing instructions for 3D pipeline 1912 and media pipeline 1916. In one embodiment, the ring buffer may additionally include a batch command buffer storing multiple batches of multiple commands. 3D pipeline 1912 and media pipeline 1916 process the commands by performing operations via logic within their respective pipelines or by dispatching one or more execution threads to execution unit array 1914. In some embodiments, the execution unit array 1914 is scalable, such that the array includes a variable number of execution units based on the target power and performance level of the GPE 1910.
[0221] In some embodiments, the sampling engine 1930 is coupled to memory (e.g., cache memory or system memory) and the execution unit array 1914. In some embodiments, the sampling engine 1930 provides a memory access mechanism for the execution unit array 1914, which allows the execution array 1914 to read graphics and media data from memory. In some embodiments, the sampling engine 1930 includes logic for performing specialized image sampling operations for media.
[0222] In some embodiments, the dedicated media sampling logic in the sampling engine 1930 includes a denoising / deinterlacing module 1932, a motion estimation module 1934, and an image scaling and filtering module 1936. In some embodiments, the denoising / deinterlacing module 1932 includes logic for performing one or more of a denoising or deinterlacing algorithm on the decoded video data. The deinterlacing logic combines the alternating lengths of the interlaced video content into a single frame of video. The denoising logic reduces or removes data noise from the video and image data. In some embodiments, the denoising and deinterlacing logic is motion-adaptive and uses spatial or temporal filtering based on the amount of motion detected in the video data. In some embodiments, the denoising / deinterlacing module 1932 includes dedicated motion detection logic (e.g., within the motion estimation engine 1934).
[0223] In some embodiments, the motion estimation engine 1934 provides hardware acceleration for video operations by performing video acceleration functions (such as motion vector estimation and prediction) on the video data. The motion estimation engine determines motion vectors describing the transformation of image data between consecutive video frames. In some embodiments, the graphics processor media codec uses the video motion estimation engine 1934 to perform operations on macroblock-level video, which may be too computationally intensive to perform using a general-purpose processor. In some embodiments, the motion estimation engine 1934 is typically used in graphics processor components to assist video decoding and processing functions that are sensitive to or adaptive to the direction or magnitude of motion within the video data.
[0224] In some embodiments, the image scaling and filtering module 1936 performs image processing operations to improve the visual quality of the resulting images and videos. In some embodiments, the scaling and filtering module 1936 processes image and video data during sampling operations before providing data to the execution unit array 1914.
[0225] In some embodiments, the GPE 1910 includes a data port 1944 that provides additional mechanisms for enabling the graphics subsystem to access memory. In some embodiments, the data port 1944 facilitates memory access for operations including render target writes, constant buffer reads, temporary memory space reads / writes, and media surface access. In some embodiments, the data port 1944 includes cache memory space for cached access to memory. The cache memory may be a single data cache or may be partitioned into multiple caches (e.g., render buffer cache, constant buffer cache, etc.) for multiple subsystems accessing memory via the data port. In some embodiments, threads executing on execution units in the execution unit array 1914 communicate with the data port by exchanging messages via a data distribution interconnect coupled to each subsystem of the GPE 1910.
[0226] Execution unit
[0227] Figure 20 This is a block diagram of another embodiment of the graphics processor 2000. Figure 20 Elements having the same reference numerals (or names) as elements in any other figure herein may operate or function in any manner similar to, but not limited to, those described elsewhere herein.
[0228] In some embodiments, the graphics processor 2000 includes a ring interconnect 2002, a pipeline front-end 2004, a media engine 2037, and graphics cores 2080A to 2080N. In some embodiments, the ring interconnect 2002 couples the graphics processor to other processing units, including other graphics processors or one or more general-purpose processor cores. In some embodiments, the graphics processor is one of a plurality of processors integrated within a multi-core processing system.
[0229] In some embodiments, the graphics processor 2000 receives multiple batches of commands via a ring interconnect 2002. The incoming commands are translated by a command stream converter 2003 in the pipeline front-end 2004. In some embodiments, the graphics processor 2000 includes scalable execution logic for performing 3D geometry processing and media processing via graphics cores 2080A to 2080N. For 3D geometry processing commands, the command stream converter 2003 supplies commands to a geometry pipeline 2036. For at least some media processing commands, the command stream converter 2003 supplies commands to a video front-end 2034, which is coupled to a media engine 2037. In some embodiments, the media engine 2037 includes a video quality engine (VQE) 2030 for video and image post-processing and a multi-format encoding / decoding (MFX) engine 2033 for providing hardware-accelerated media data encoding and decoding. In some embodiments, the geometry pipeline 2036 and the media engine 2037 each generate execution threads for use with thread execution resources provided by at least one graphics core 2080A.
[0230] In some embodiments, the graphics processor 2000 includes scalable thread execution resources characterized by modular cores 2080A to 2080N (sometimes referred to as core slices), each modular core having a plurality of sub-cores 2050A to 2050N, 2060A to 2060N (sometimes referred to as core sub-slices). In some embodiments, the graphics processor 2000 may have any number of graphics cores 2080A to 2080N. In some embodiments, the graphics processor 2000 includes a graphics core 2080A, which has at least a first sub-core 2050A and a second sub-core 2060A. In other embodiments, the graphics processor is a low-power processor having a single sub-core (e.g., 2050A). In some embodiments, the graphics processor 2000 includes a plurality of graphics cores 2080A to 2080N, each graphics core including a set of first sub-cores 2050A to 2050N and a set of second sub-cores 2060A to 2060N. Each of the first set of sub-cores 2050A to 2050N includes at least a first set of execution units 2052A to 2052N and media / texture samplers 2054A to 2054N. Each of the second set of sub-cores 2060A to 2060N includes at least a second set of execution units 2062A to 2062N and samplers 2064A to 2064N. In some embodiments, each sub-core 2050A to 2050N and 2060A-2060N shares a set of shared resources 2070A to 2070N. In some embodiments, these shared resources include shared cache memory and pixel operation logic. Other shared resources may also be included in various embodiments of the graphics processor.
[0231] Figure 21 Threadable execution logic 2100 is shown, including an array of processing elements employed in some embodiments of GPE. Figure 21 Those elements that have the same reference numerals (or names) as elements in any other figure herein may operate or function in any manner similar to, but not limited to, those described elsewhere herein.
[0232] In some embodiments, thread execution logic 2100 includes a pixel shader 2102, a thread dispatcher 2104, an instruction cache 2106, a scalable execution unit array including multiple execution units 2108A to 2108N, a sampler 2110, a data cache 2112, and a data port 2114. In one embodiment, these included components are interconnected via an interconnect structure linking to each of these components. In some embodiments, thread execution logic 2100 includes one or more connections to memory (e.g., system memory or cache memory) via one of the instruction cache 2106, data port 2114, sampler 2110, and execution unit arrays 2108A to 2108N. In some embodiments, each execution unit (e.g., 2108A) is an individual vector processor capable of executing multiple concurrent threads and processing multiple data elements in parallel for each thread. In some embodiments, execution unit arrays 2108A to 2108N include any number of individual execution units.
[0233] In some embodiments, execution unit arrays 2108A to 2108N are primarily used to execute "shader" programs. In some embodiments, the execution units in arrays 2108A to 2108N execute instruction sets that include native support for many standard 3D graphics shader instructions, enabling the execution of shader programs from graphics libraries (e.g., Direct 3D and OpenGL) with minimal transformations. The execution units support vertex and geometry processing (e.g., vertex programs, geometry programs, vertex shaders), pixel processing (e.g., pixel shaders, fragment shaders), and general processing (e.g., computation and media shaders).
[0234] Each execution unit in the execution unit arrays 2108A to 2108N operates on an array of data elements. The number of data elements is the "execution size" or the number of channels used for instructions. An execution channel is a logical execution unit used for data element access, masking, and flow control within instructions. The number of channels may be independent of the number of physical arithmetic logic units (ALUs) or floating-point units (FPUs) for a particular graphics processor. In some embodiments, execution units 2108A to 2108N support both integer and floating-point data types.
[0235] The execution unit instruction set includes Single Instruction Multiple Data (SIMD). Various data elements can be stored in registers as compressed data types, and the execution unit will process these elements based on their data size. For example, when operating on a 256-bit wide vector, the 256-bit vector is stored in registers, and the execution unit operates on the vector as four individual 64-bit compressed data elements (four times the word length (QW) size), eight individual 32-bit compressed data elements (double the word length (DW) size), sixteen individual 16-bit compressed data elements (word length (W) size), or thirty-two individual 8-bit data elements (byte (B) size). However, different vector widths and register sizes are possible.
[0236] One or more internal instruction caches (e.g., 2106) are included in thread execution logic 2100 to cache thread instructions for the execution unit. In some embodiments, one or more data caches (e.g., 2112) are included to cache thread data during thread execution. In some embodiments, sampler 2110 is included for providing texture sampling for 3D operations and media sampling for media operations. In some embodiments, sampler 2110 includes dedicated texture or media sampling functions to process texture or media data during the sampling process before providing sampled data to the execution unit.
[0237] During execution, the graphics pipeline and media pipeline send thread initiation requests to thread execution logic 2100 via thread generation and dispatch logic. In some embodiments, thread execution logic 2100 includes a local thread dispatcher 2104 that arbitrates thread initiation requests from the graphics pipeline and media pipeline and instantiates the requested thread on one or more execution units 2108A to 2108N. For example, the geometry pipeline (e.g., Figure 20 2036) dispatches vertex processing, tessellation, or geometry processing threads to thread execution logic 2100. Figure 21 In some embodiments, thread dispatcher 2104 can also handle runtime thread generation requests from the shader execution program.
[0238] Once a set of geometric objects has been processed and rasterized into pixel data, pixel shader 2102 is invoked to further compute output information and cause the results to be written to an output surface (e.g., a color buffer, depth buffer, stencil buffer, etc.). In some embodiments, pixel shader 2102 computes values for vertex attributes that are interpolated across the rasterized objects. In some embodiments, pixel shader 2102 then executes a pixel shader program provided by an application programming interface (API). To execute the pixel shader program, pixel shader 2102 dispatches a thread to an execution unit (e.g., 2108A) via thread dispatcher 2104. In some embodiments, pixel shader 2102 uses texture sampling logic in sampler 2110 to access texture data in a texture map stored in memory. Arithmetic operations performed on the texture data and input geometry compute pixel color data for each geometric fragment, or discard one or more pixels for further processing.
[0239] In some embodiments, data port 2114 provides a memory access mechanism for enabling thread execution logic 2100 to output processed data to memory for processing on the graphics processor output pipeline. In some embodiments, data port 2114 includes or is coupled to one or more cache memories (e.g., data cache 2112) to cache data via the data port for memory access.
[0240] Figure 22 This is a block diagram illustrating a graphical processor instruction format 2200 according to some embodiments. In one or more embodiments, the graphics processor execution unit supports an instruction set having multiple instruction formats. Solid lines indicate components that are typically included in the execution unit instructions, while dashed lines include components that are optional or included only in a subset of the instructions. In some embodiments, the instruction format 2200 described and illustrated are macro instructions, as they are instructions supplied to the execution unit, as opposed to micro-operations generated from instruction decoding (once the instruction is processed).
[0241] In some embodiments, the graphics processor execution unit natively supports instructions in 128-bit format 2210. A 64-bit compact instruction format 2230 can be used for some instructions based on the selected instruction, instruction options, and number of operands. The native 128-bit format 2210 provides access to all instruction options, while some options and operations are restricted in the 64-bit format 2230. The native instructions available in 64-bit format 2230 vary depending on the embodiment. In some embodiments, instructions are partially compacted using a set of index values in index field 2213. The execution unit hardware references a set of compression tables based on these index values and uses the output of the compression tables to reconstruct the native instructions in 128-bit format 2210.
[0242] For each format, the instruction opcode 2212 defines the operation to be performed by the execution unit. The execution unit executes each instruction in parallel across multiple data elements of each operand. For example, in response to an addition instruction, the execution unit performs simultaneous addition across each color channel representing a texture element or image element. By default, the execution unit executes each instruction across all data channels of the operand. In some embodiments, the instruction control field 2214 enables control over certain execution options, such as channel selection (e.g., prediction) and data channel ordering (e.g., blending). For the 128-bit instruction 2210, the execution size field 2216 limits the number of data channels that will be executed in parallel. In some embodiments, the execution size field 2216 is not available for the 64-bit compact instruction format 2230.
[0243] Some execution unit instructions have up to three operands, including two source operands src0 2220 and src1 2222, and a destination 2218. In some embodiments, the execution unit supports dual-destination instructions, where one of these destinations is implicit. Data manipulation instructions may have a third source operand (e.g., SRC2 2224), where the instruction opcode 2212 determines the number of source operands. The last source operand of the instruction may be an immediate (e.g., hard-coded) value passed through the instruction.
[0244] In some embodiments, the 128-bit instruction format 2210 includes access / address mode information 2226, which specifies, for example, whether to use direct register addressing mode or indirect register addressing mode. When using direct register addressing mode, the register addresses of one or more operands are provided directly by bits in the instruction 2210.
[0245] In some embodiments, the 128-bit instruction format 2210 includes an access / address mode field 2226 that specifies the address mode and / or access mode of the instruction. In one embodiment, the access mode defines the data access alignment of the instruction. Some embodiments support access modes including a 16-byte aligned access mode and a 1-byte aligned access mode, wherein the byte alignment of the access mode determines the access alignment of the instruction operands. For example, when in a first mode, the instruction 2210 may use byte-aligned addressing for both the source and destination operands, and when in a second mode, the instruction 2210 may use 16-byte aligned addressing for all source and destination operands.
[0246] In one embodiment, the address mode portion of the access / address mode field 2226 determines whether the instruction will use direct or indirect addressing. When using direct register addressing mode, the bits in instruction 2210 directly provide the register addresses of one or more operands. When using indirect register addressing mode, the register addresses of one or more operands can be calculated based on the address register value and the address immediate field in the instruction.
[0247] In some embodiments, instructions are grouped based on the 2212-bit field of the opcode to simplify opcode decoding 2240. For an 8-bit opcode, bits 4, 5, and 6 allow the execution unit to determine the type of opcode. The precise opcode grouping shown is merely exemplary. In some embodiments, the move and logic opcode group 2242 includes data move and logic instructions (e.g., move (mov), compare (cmp)). In some embodiments, the move and logic group 2242 shares five most significant bits (MSB), where move (mov) instructions are in the form of 0000xxxxb, and logic instructions are in the form of 0001xxxxb. The flow control instruction group 2244 (e.g., call, jump (jmp)) includes instructions in the form of 0010xxxxb (e.g., 0x20). The promiscuous instruction group 2246 includes a mixture of instructions, including synchronization instructions (e.g., wait, send) in the form of 0011xxxxb (e.g., 0x30). Parallel math instruction group 2248 includes component-wise arithmetic instructions (e.g., add, subtract mul) in the form 0100xxxxb (e.g., 0x40). Parallel math group 2248 performs arithmetic operations in parallel across data channels. Vector math group 2250 includes arithmetic instructions (e.g., dp4) in the form 0101xxxxb (e.g., 0x50). Vector math group performs arithmetic, such as calculating the dot product of vector operands.
[0248] Graphics Pipeline
[0249] Figure 23 This is a block diagram of another embodiment of the graphics processor 2300. Figure 23 Elements having the same reference numerals (or names) as elements in any other figure herein may operate or function in any manner similar to, but not limited to, those described elsewhere herein.
[0250] In some embodiments, the graphics processor 2300 includes a graphics pipeline 2320, a media pipeline 2330, a display engine 2340, thread execution logic 2350, and a rendering output pipeline 2370. In some embodiments, the graphics processor 2300 is a graphics processor within a multi-core processing system including one or more general-purpose processing cores. The graphics processor is controlled by register writes to one or more control registers (not shown) or by commands issued to the graphics processor 2300 via a ring interconnect 2302. In some embodiments, the ring interconnect 2302 couples the graphics processor 2300 to other processing components, such as other graphics processors or general-purpose processors. Commands from the ring interconnect 2302 are translated by a command stream translator 2303, which supplies instructions to individual components of the graphics pipeline 2320 or the media pipeline 2330.
[0251] In some embodiments, command stream converter 2303 directs the operation of vertex acquirer 2305, which reads vertex data from memory and executes vertex processing commands provided by command stream converter 2303. In some embodiments, vertex acquirer 2305 provides vertex data to vertex shader 2307, which performs coordinate space transformation and lighting operations on each vertex. In some embodiments, vertex acquirer 2305 and vertex shader 2307 execute vertex processing instructions by dispatching execution threads to execution units 2352A and 2352B via thread dispatcher 2331.
[0252] In some embodiments, execution units 2352A and 2352B are arrays of vector processors having an instruction set for performing graphics and media operations. In some embodiments, execution units 2352A and 2352B have additional L1 caches 2351 specifically for each array or shared between arrays. The caches may be configured as data caches, instruction caches, or a single cache partitioned to contain data and instructions in different partitions.
[0253] In some embodiments, the graphics pipeline 2320 includes a tessellation component for performing hardware-accelerated tessellation of 3D objects. In some embodiments, a programmable shell shader 2311 configures the tessellation operation. A programmable domain shader 2317 provides back-end evaluation of the tessellation output. A tessellation unit 2313 operates in the direction of the shell shader 2311 and includes dedicated logic for generating a detailed set of geometric objects based on a rough geometry model that is provided as input to the graphics pipeline 2320. In some embodiments, the tessellation components 2311, 2313, and 2317 can be bypassed if tessellation is not used.
[0254] In some embodiments, the complete geometry object may be processed by the geometry shader 2319 via one or more threads dispatched to execution units 2352A, 2352B, or may proceed directly to the clipper 2329. In some embodiments, the geometry shader operates on the entire geometry object (rather than vertices or vertex patches such as those in previous stages of the graphics pipeline). If tessellation is disabled, the geometry shader 2319 receives input from the vertex shader 2307. In some embodiments, the geometry shader 2319 may be programmed by a geometry shader program to perform geometric tessellation when the tessellation unit is disabled.
[0255] Prior to rasterization, clipper 2329 processes vertex data. Clipper 2329 can be a fixed-function clipper or a programmable clipper with clipping and geometry shader capabilities. In some embodiments, rasterizer 2373 (e.g., a depth testing component) in the rendering output pipeline 2370 dispatches pixel shaders to convert geometric objects into their pixel-wise representations. In some embodiments, pixel shader logic is included in thread execution logic 2350. In some embodiments, the application can bypass rasterizer 2373 and access the unrasterized vertex data via outgoing unit 2323.
[0256] The graphics processor 2300 has an interconnect bus, interconnect structure, or some other interconnect mechanism that allows data and messages to be transferred among the main components of the processor. In some embodiments, execution units 2352A, 2352B and(multiple) associated caches 2351, texture and media samplers 2354, and texture / sampler cache 2358 are interconnected via data port 2356 to perform memory accesses and communicate with the processor's rendering output pipeline components. In some embodiments, samplers 2354, caches 2351, 2358, and execution units 2352A, 2352B each have a separate memory access path.
[0257] In some embodiments, the rendering output pipeline 2370 includes a rasterizer 2373 that converts vertex-based objects into associated pixel-based representations. In some embodiments, the rasterizer logic includes windower / masker units for performing fixed-function triangle and line rasterization. Associated rendering cache 2378 and depth cache 2379 are also available in some embodiments. Pixel manipulation unit 2377 performs pixel-based operations on the data; however, in some examples, pixel operations associated with 2D operations (e.g., bit-block image transfer and blending) are performed by the 2D engine 2341, or at display time by the display controller 2343 using an overlay display plane. In some embodiments, a shared L3 cache 2375 is available for all graphics components, allowing data to be shared without using main system memory.
[0258] In some embodiments, the graphics processor media pipeline 2330 includes a media engine 2337 and a video front-end 2334. In some embodiments, the video front-end 2334 receives pipeline commands from a command stream converter 2303. In some embodiments, the media pipeline 2330 includes a separate command stream converter. In some embodiments, the video front-end 2334 processes media commands before sending them to the media engine 2337. In some embodiments, the media engine 2337 includes thread generation functionality for generating threads for dispatching to thread execution logic 2350 via a thread dispatcher 2331.
[0259] In some embodiments, the graphics processor 2300 includes a display engine 2340. In some embodiments, the display engine 2340 is external to the processor 2300 and coupled to the graphics processor via a ring interconnect 2302 or some other interconnect bus or structure. In some embodiments, the display engine 2340 includes a 2D engine 2341 and a display controller 2343. In some embodiments, the display engine 2340 includes dedicated logic capable of operating independently of the 3D pipeline. In some embodiments, the display controller 2343 is coupled to a display device (not shown), which may be a system-integrated display device (such as in a laptop computer) or an external display device attached via a display device connector.
[0260] In some embodiments, the graphics pipeline 2320 and media pipeline 2330 may be configured to perform operations based on multiple graphics and media programming interfaces and are not specific to any one application programming interface (API). In some embodiments, the graphics processor's driver software translates API schedules specific to a particular graphics or media library into commands that can be processed by the graphics processor. In some embodiments, support is provided for the Open Graphics Library (OpenGL) and Open Computing Language (OpenCL) from the Khronos Group, the Direct 3D library from Microsoft, or both OpenGL and D3D. Support may also be provided for the open-source computer vision library (OpenCV). If a pipeline mapping from future API calls to the graphics processor's pipeline is possible, then future APIs with compatible 3D pipelines will also be supported.
[0261] Graphical Pipeline Programming
[0262] Figure 24A This is a block diagram of a schematic processor command format 2400 according to some embodiments. Figure 24B This is a block diagram of a schematic processor command sequence 2410 according to an embodiment. Figure 24A Solid lines in the diagram represent components that are typically included in the drawing command, while dashed lines represent optional components or components that are only included in a subset of the drawing command. Figure 24A An exemplary graphics processor command format 2400 includes data fields for identifying the target client 2402 of the command, the command operation code (opcode) 2404, and the command's associated data 2406. Some commands also include a sub-opcode 2405 and a command size 2408.
[0263] In some embodiments, client 2402 specifies a client unit of a graphics device that processes command data. In some embodiments, a graphics processor command parser examines the client field of each command to adjust further processing of the command and route command data to the appropriate client unit. In some embodiments, the graphics processor client unit includes a memory interface unit, a rendering unit, a 2D unit, a 3D unit, and a media unit. Each client unit has a corresponding processing pipeline for processing commands. Once a command is received by a client unit, the client unit reads opcode 2404 and (if present) sub-opcode 2405 to determine the operation to be performed. The client unit uses information in data field 2406 to execute the command. For some commands, an explicit command size 2408 is expected to specify the size of the command. In some embodiments, the command parser automatically determines the size of at least some commands in the command based on the command opcode. In some embodiments, commands are aligned via multiples of double word length.
[0264] Figure 24B The flowchart illustrates an exemplary graphics processor command sequence 2410. In some embodiments, software or firmware of a data processing system characterized by an embodiment of a graphics processor uses a version of the illustrated command sequence to initiate, execute, and terminate a set of graphics operations. Sample command sequences are shown and described for illustrative purposes only, as embodiments are not limited to these particular commands or this command sequence. Furthermore, the commands may be issued as a batch of commands in a command sequence, such that the graphics processor will process the command sequence in at least partially simultaneous manner.
[0265] In some embodiments, the graphics processor command sequence 2410 may be initiated by a pipeline dump clearing command 2412 to allow any active graphics pipeline to complete its currently pending commands. In some embodiments, the 3D pipeline 2422 and the media pipeline 2424 do not operate simultaneously. Pipeline dump clearing is performed to allow any pending commands from active graphics pipelines to complete. In response to pipeline dump clearing, the graphics processor's command resolver suspends command processing until the active graphics engine completes its pending operations and the associated read cache is invalidated. Optionally, any data marked 'dirty' in the render cache may be dumped to memory. In some embodiments, pipeline dump clearing command 2412 may be used for pipeline synchronization or before placing the graphics processor in a low-power state.
[0266] In some embodiments, pipeline selection command 2413 is used when a sequence of commands requires the graphics processor to make an explicit switch between pipelines. In some embodiments, pipeline selection command 2413 is only required once in an execution context before a pipeline command is issued, unless the context requires issuing commands for two pipelines. In some embodiments, pipeline dump clearing command 2412 is required immediately before pipeline switching via pipeline selection command 2413.
[0267] In some embodiments, pipeline control command 2414 configures a graphics pipeline for operation and is used to program the 3D pipeline 2422 and the media pipeline 2424. In some embodiments, pipeline control command 2414 configures the pipeline state of an active pipeline. In one embodiment, pipeline control command 2414 is used for pipeline synchronization and for clearing data from one or more cache memories within an active pipeline before processing a batch of commands.
[0268] In some embodiments, return buffer state command 2416 is used to configure a set of return buffers for causing corresponding pipelined write data. Some pipelined operations require allocating, selecting, or configuring one or more return buffers, which write intermediate data to said return buffers during processing. In some embodiments, the graphics processor also uses one or more return buffers to store output data and perform cross-thread communication. In some embodiments, return buffer state 2416 includes selecting the size and number of return buffers for a set of pipelined operations.
[0269] The remaining commands in the command sequence differ based on the active pipeline used for the operation. Based on pipeline determination 2420, the command sequence is customized according to the 3D pipeline 2422 and the media pipeline 2424, the 3D pipeline starting at 3D pipeline state 2430 and the media pipeline starting at media pipeline state 2440.
[0270] Commands for 3D pipeline state 2430 include 3D state setting commands for: vertex buffer state, vertex element state, constant color state, depth buffer state, and other state variables that will be configured before processing 3D primitive commands. The values of these commands are determined at least in part based on the specific 3D API in use. In some embodiments, 3D pipeline state 2430 commands can also selectively disable or bypass specific pipeline components (if those components will not be used).
[0271] In some embodiments, the 3D primitive 2432 command is used to submit 3D primitives to be processed by the 3D pipeline. The command and associated parameters passed to the graphics processor via the 3D primitive 2432 are forwarded to a vertex acquisition function in the graphics pipeline. The vertex acquisition function uses the 3D primitive 2432 command data to generate a vertex data structure. The vertex data structure is stored in one or more return buffers. In some embodiments, the 3D primitive 2432 command is used to perform vertex operations on the 3D primitives via a vertex shader. To process the vertex shader, the 3D pipeline 2422 dispatches shader execution threads to the graphics processor execution unit.
[0272] In some embodiments, the 3D pipeline 2422 is triggered by executing command 2434 or an event. In some embodiments, register writing triggers command execution. In some embodiments, execution is triggered by a 'go' or 'kick' command in a command sequence. In one embodiment, pipeline synchronization commands are used to trigger command execution to dump and clear the command sequence through the graphics pipeline. The 3D pipeline performs geometric processing on 3D primitives. Once the operations are complete, the resulting geometry is rasterized, and the pixel engine colors the resulting pixels. Additional commands for controlling pixel shading and pixel backend operations may also be included for those operations.
[0273] In some embodiments, when performing media operations, the graphics processor command sequence 2410 follows a path of media pipeline 2424. Generally, the specific purpose and programming of media pipeline 2424 depend on the media or computational operation to be performed. During media decoding, specific media decoding operations can be offloaded to the media pipeline. In some embodiments, the media pipeline can also be bypassed, and media decoding can be performed wholly or partially using resources provided by one or more general-purpose processing cores. In one embodiment, the media pipeline also includes elements for general-purpose graphics processing unit (GPGPU) operations, wherein the graphics processor is used to perform SIMD vector operations using computational shader programs that are not explicitly related to the rendering of graphics primitives.
[0274] In some embodiments, the media pipeline 2424 is configured in a manner similar to that of the 3D pipeline 2422. A set of media pipeline status commands 2440 is dispatched to or placed in a command queue prior to the media object command 2442. In some embodiments, the media pipeline status commands 2440 include data for configuring media pipeline elements that will be used to process media objects. This includes data for configuring video decoding and video encoding logic within the media pipeline (e.g., encoding or decoding modes). In some embodiments, the media pipeline status commands 2440 also support using one or more pointers to "indirect" status elements that contain a batch of status settings.
[0275] In some embodiments, media object command 2442 supplies pointers to media objects to be processed by the media pipeline. The media object includes a memory buffer containing video data to be processed. In some embodiments, all media pipeline states must be valid before issuing media object command 2442. Once the pipeline states are configured and media object command 2442 is queued, media pipeline 2424 is triggered via execution command 2444 or an equivalent execution event (e.g., register write). The output from media pipeline 2424 can then be post-processed by operations provided by 3D pipeline 2422 or media pipeline 2424. In some embodiments, GPGPU operations are configured and executed in a manner similar to media operations.
[0276] Graphical software architecture
[0277] Figure 25 An exemplary graphics software architecture of a data processing system 2500 according to some embodiments is shown. In some embodiments, the software architecture includes a 3D graphics application 2510, an operating system 2520, and at least one processor 2530. In some embodiments, the processor 2530 includes a graphics processor 2532 and one or more general-purpose processor cores 2534. The graphics application 2510 and the operating system 2520 each execute in the system memory 2550 of the data processing system.
[0278] In some embodiments, the 3D graphics application 2510 includes one or more shader programs, which include shader instructions 2512. The shader language instructions may be in the form of a high-level shader language, such as High-Level Shader Language (HLSL) or OpenGL Shader Language (GLSL). The application also includes executable instructions 2514 in machine language suitable for execution by a general-purpose processor core 2534. The application also includes geometric objects 2516 defined by vertex data.
[0279] In some embodiments, the operating system 2520 is from Microsoft Corporation. The operating system 2520 is a proprietary Unix-like operating system using a variant of the Linux kernel or an open-source Unix-like operating system. When the Direct3D API is in use, the operating system 2520 uses a front-end shader compiler 2524 to compile any shader instructions 2512 rendered in HLSL into a low-level shader language. This compilation can be just-in-time (JIT) compilation, or pre-compilation of the application-executable shaders. In some embodiments, high-level shaders are compiled into low-level shaders during the compilation of the 3D graphics application 2510.
[0280] In some embodiments, the user-mode graphics driver 2526 includes a back-end shader compiler 2527 for translating shader instructions 2512 into a hardware-specific representation. When the OpenGL API is in use, shader instructions 2512 in GLSL high-level language are passed to the user-mode graphics driver 2526 for compilation. In some embodiments, the user-mode graphics driver 2526 uses operating system kernel-mode functions 2528 to communicate with the kernel-mode graphics driver 2529. In some embodiments, the kernel-mode graphics driver 2529 communicates with the graphics processor 2532 to dispatch commands and instructions.
[0281] IP core implementation
[0282] One or more aspects of at least one embodiment may be implemented by representative code stored on a machine-readable medium that represents and / or defines logic within an integrated circuit (e.g., a processor). For example, the machine-readable medium may include instructions representing various logic within the processor. When read by a machine, the instructions may cause the machine to manufacture logic for performing the techniques described herein. Such representations (referred to as “IP cores”) are reusable units of logic for an integrated circuit that may be stored on a tangible, machine-readable medium as a hardware model describing the structure of the integrated circuit. The hardware model may be supplied to consumers or manufacturing facilities that load the hardware model onto manufacturing machines that manufacture integrated circuits. Integrated circuits may be manufactured such that the circuits perform the operations described in association with any of the embodiments described herein.
[0283] Figure 26 This is a block diagram illustrating an IP core development system 2600 according to an embodiment, which can be used to manufacture integrated circuits to perform operations. The IP core development system 2600 can be used to generate modular, reusable designs that can be incorporated into larger designs or used to build entire integrated circuits (e.g., SOC integrated circuits). Design facility 2630 can generate software simulation 2610 of the IP core design using a high-level programming language (e.g., C / C++). Software simulation 2610 can be used to design, test, and verify the behavior of the IP core. Register transfer level (RTL) designs can then be created or synthesized from the simulation model 2600. RTL design 2615 is an abstraction of the behavior of an integrated circuit (including associated logic performed using the modeled digital signals) that models the flow of digital signals between hardware registers. In addition to RTL design 2615, lower-level designs at logic levels or transistor levels can also be created, designed, or synthesized. Thus, the specific details of the initial design and simulation can vary.
[0284] The RTL design 2615 or its equivalent can be further synthesized into a hardware model 2620 by the design facility. This hardware model can be in the form of a hardware description language (HDL) or some other representation of physical design data. The HDL can be further simulated or tested to validate the IP core design. The IP core design can be stored in non-volatile memory 2640 (e.g., hard disk, flash memory, or any non-volatile storage medium) for delivery to a third-party manufacturing facility 2665. Alternatively, the IP core design can be transmitted (e.g., via the Internet) via a wired connection 2650 or a wireless connection 2660. The manufacturing facility 2665 can then manufacture an integrated circuit at least partially based on the IP core design. The manufactured integrated circuit can be configured to perform operations according to at least one embodiment described herein.
[0285] Figure 27 This is a block diagram illustrating an exemplary system-on-a-chip integrated circuit 2700 according to an embodiment, which can be fabricated using one or more IP cores. The exemplary integrated circuit includes one or more application processors 2705 (e.g., a CPU), at least one graphics processor 2710, and may additionally include an image processor 2715 and / or a video processor 2720, any of which may be modular IP cores from the same or multiple different design facilities. The integrated circuit includes peripheral or bus logic, including a USB controller 2725, a UART controller 2730, an SPI / SDIO controller 2735, and I... 2 S / I 2 C controller 2740. Additionally, the integrated circuit may include a display device 2745 coupled to one or more of a High Definition Multimedia Interface (HDMI) controller 2750 and a Mobile Industry Processor Interface (MIPI) display interface 2755. Storage may be provided by a flash memory subsystem 2760 (including flash memory and a flash memory controller). A memory interface may be provided via a memory controller 2765 for accessing SDRAM or SRAM memory devices. Some integrated circuits also include an embedded security engine 2770.
[0286] Additionally, other logic and circuitry may be included in the processor of the integrated circuit 2700, including additional graphics processors / cores, peripheral interface controllers, or general-purpose processor cores.
[0287] Additional notes and examples
[0288] Example 1 may include a system comprising: a camera for acquiring real-world content; a semiconductor package device coupled to the camera, wherein the semiconductor package device includes a substrate and logic coupled to the substrate; wherein the logic includes: a graphics pipeline for generating rendered content; a base layer encoder and a first layer encoder, the base layer encoder for encoding real-world content into a base layer, and the first layer encoder for encoding rendered content into a first non-base layer; a multiplexer for interleaving the base layer and the first non-base layer to obtain a single output signal with mixed reality content; and a transmitter for transmitting the single output signal.
[0289] Example 2 may include a system as described in Example 1, wherein a second-layer encoder is used to encode the mapped data into a second non-base layer, and the multiplexer is used to interleave the second non-base layer with the first non-base layer and the base layer.
[0290] Example 3 may include a system as described in Example 2, wherein the first layer encoder and the second layer encoder encode the rendered content and the mapping data into an overlay auxiliary image, and the mapping data distinguishes between the placement of multiple objects in another overlay auxiliary image on another layer.
[0291] Example 4 may include a system as described in Example 2, wherein a third-layer encoder is used to encode fused data into a third non-base layer, and the multiplexer is used to interleave the third non-base layer with the second non-base layer, the first non-base layer, and the base layer.
[0292] Example 5 may include a system as described in Example 4, wherein the third-layer encoder encodes the fused data into an overlay auxiliary image, and the fused data represents the transparency of an overlay object from one or more overlay auxiliary images.
[0293] Example 6 may include a system as described in Example 1, wherein the real-world content includes the captured video.
[0294] Example 7 may include a system as described in Example 1, wherein the base layer and the first non-base layer bypass synthesis.
[0295] Example 8 may include a device comprising: a substrate; and logic coupled to the substrate; wherein the logic is implemented in one or more of configurable logic or fixed-function hardware logic, the logic comprising: a base layer encoder and a first layer encoder, the base layer encoder being used to encode real-world content into a base layer, and the first layer encoder being used to encode rendered content into a first non-base layer; and a multiplexer for interleaving the base layer with the first non-base layer to obtain a single output signal with mixed reality content.
[0296] Example 9 may include the device as described in Example 8, wherein a second layer encoder is used to encode the mapping data into a second non-base layer, and the multiplexer is used to interleave the second non-base layer with the first non-base layer and the base layer.
[0297] Example 10 may include a device as described in Example 9, wherein the first layer encoder and the second layer encoder encode the rendered content and the mapping data into an overlay auxiliary image, and the mapping data distinguishes between the placement of objects within another overlay auxiliary image on another layer.
[0298] Example 11 may include the device as described in Example 8, wherein a third-layer encoder is used to encode fused data into a third non-base layer, and the multiplexer is used to interleave the third non-base layer with the second non-base layer, the first non-base layer, and the base layer.
[0299] Example 12 may include a device as described in Example 11, wherein the third-layer encoder encodes the fused data into an overlay auxiliary image, and the fused data represents the transparency of an overlay object from one or more overlay auxiliary images.
[0300] Example 13 may include a device as described in Example 8, wherein the real-world content includes captured video.
[0301] Example 14 may include a device as described in Example 8, wherein the base layer and the first non-base layer bypass synthesis.
[0302] Example 15 may include a system comprising: a display for presenting real-world content with a hybrid display overlay; a semiconductor package device coupled to the display, wherein the semiconductor package device includes a substrate and logic coupled to the substrate, wherein the logic includes: a demultiplexer for receiving a single complex signal having a plurality of interleaved signals, and for separating a base layer signal, a first non-base layer signal, a second non-base layer signal, and a third non-base layer signal from the single complex signal; a base layer decoder and a first layer decoder, the base layer decoder being configured to decode the base layer signal into real-world content, and the first layer decoder being configured to decode the first non-base layer into a hybrid reality content overlay; a second layer decoder and a third layer decoder, the second layer decoder being configured to decode the second non-base layer into a mapped overlay, and the third layer decoder being configured to decode the third non-base layer into a merged overlay; and a synthesizer for allowing a user to select or deselect one or more overlays and for synthesizing the selected overlay with the decoded base layer for display.
[0303] Example 16 may include a system as described in Example 15, wherein the real-world content includes the captured video.
[0304] Example 17 may include a system as described in Example 15, wherein the mapping overlay distinguishes between the placement of multiple objects within an overlay auxiliary image on another layer.
[0305] Example 18 may include a system as described in Example 15, wherein the fusion overlay represents the transparency of an overlay object from one or more overlay auxiliary images.
[0306] Example 19 may include a device comprising: a substrate and logic coupled to the substrate; wherein the logic is implemented in one or more of configurable logic or fixed-function hardware logic, the logic comprising: a demultiplexer for receiving a single complex signal having a plurality of interleaved signals, and for separating a base layer signal, a first non-base layer signal, a second non-base layer signal, and a third non-base layer signal from the single complex signal; a base layer decoder and a first layer decoder, the base layer decoder for decoding the base layer signal into real-world content, and the first layer decoder for decoding the first non-base layer into a mixed reality content overlay; a second layer decoder and a third layer decoder, the second layer decoder for decoding the second non-base layer into a mapped overlay, and the third layer decoder for decoding the third non-base layer into a merged overlay; and a synthesizer for allowing a user to select or deselect one or more overlays and for synthesizing the selected overlay with the decoded base layer for display.
[0307] Example 20 may include a device as described in Example 19, wherein the real-world content includes the captured video.
[0308] Example 21 may include a device as described in Example 19, wherein the mapping is overlaid to distinguish between the placement of multiple objects within an overlay auxiliary image on another layer.
[0309] Example 22 may include a device as described in Example 19, wherein the fusion overlay represents the transparency of an overlay object from one or more overlay auxiliary images.
[0310] Example 23 may include a method comprising: encoding real-world content into a base layer; encoding rendered content into a first non-base layer; and interleaving the base layer with the first non-base layer to obtain a single output signal having mixed reality content.
[0311] Example 24 may include the method as described in Example 23, further comprising: encoding the mapping data into a second non-base layer; and interleaving the second non-base layer with the first non-base layer and the base layer.
[0312] Example 25 may include the method as described in Example 24, wherein the rendered content is encoded as an overlay auxiliary image, and the mapping data distinguishes between the placement of multiple objects within one or more overlay auxiliary images on another layer.
[0313] Example 26 may include the method described in Example 24, further comprising: encoding fused data into a third non-base layer, and interleaving the third non-base layer with the second non-base layer, the first non-base layer, and the base layer.
[0314] Example 27 may include the method described in Example 23, wherein the real-world content includes the captured video.
[0315] Example 28 may include the method as described in Example 23, wherein the base layer and the first non-base layer bypass synthesis.
[0316] Example 29 may include at least one computer-readable medium, including a set of instructions that, when executed by a computing system, cause the computing system to: encode real-world content into a base layer; encode rendered content into a first non-base layer; and interleave the base layer with the first non-base layer to obtain a single output signal having mixed reality content.
[0317] Example 30 may include at least one computer-readable medium as described in Example 29, and further includes instructions that, when executed by the computing system, cause the computing system to: encode the mapping data into a second non-base layer; and to interleave the second non-base layer with the first non-base layer and the base layer.
[0318] Example 31 may include at least one computer-readable medium as described in Example 30, wherein the rendered content is encoded as an overlay auxiliary image, and the mapping data distinguishes between the placement of objects within one or more overlay auxiliary images on another layer.
[0319] Example 32 may include at least one computer-readable medium as described in Example 30, and further includes instructions that, when executed by the computing system, cause the computing system to: encode fused data into a third non-base layer; and to interleave the third non-base layer with the second non-base layer, the first non-base layer, and the base layer.
[0320] Example 33 may include at least one computer-readable medium as described in Example 29, wherein the real-world content includes captured video.
[0321] Example 34 may include at least one computer-readable medium as described in Example 29, wherein the base layer and the first non-base layer bypass synthesis.
[0322] Example 35 may include at least one computer-readable medium, including a set of instructions that, when executed by a computing system, cause the computing system to perform the method as described in any one of Examples 23 to 28.
[0323] Example 36 may include an apparatus comprising means for performing the method as described in any one of Examples 23 to 28.
[0324] Example 37 may include a method comprising: receiving a single complex signal, the single complex signal comprising a plurality of encoded signals interleaved together; separating the plurality of encoded signals from the single complex signal into a base layer signal, a first non-base layer signal, a second non-base layer signal, and a third non-base layer signal; decoding the base layer signal and the first non-base layer signal, the second non-base layer signal, and the third non-base layer signal; selecting an overlay auxiliary image for final synthesis; combining the base layer and the selected overlay auxiliary image to form the final synthesis; and displaying the final synthesis.
[0325] Example 38 may include the method as described in Example 33, wherein the final composite is displayed on one or more of a head-mounted display, a laptop computer, a desktop computer, a mobile device, or other display devices capable of displaying real-world content as an overlay with mixed reality.
[0326] Example 39 may include the method as described in Example 33, wherein the base layer signal represents encoded real-world content, the first non-base layer signal represents encoded mixed reality content with an auxiliary image overlaid using layer 1, the second non-base layer signal represents encoded mapping data with an auxiliary image overlaid using layer 2, and the third non-base layer signal represents encoded α data with an auxiliary image overlaid using layer 3.
[0327] Example 40 may include the method described in Example 33, wherein the final composition includes using mapping data and alpha data to place a mixed reality overlay onto a master image, the master image including real-world content.
[0328] Example 41 may include at least one computer-readable medium, including a set of instructions that, when executed by a computing system, cause the computing system to perform the method as described in any one of Examples 33 to 36.
[0329] Example 42 may include an apparatus comprising means for performing the method as described in any one of Examples 33 to 36.
[0330] The term "coupling" as used herein may refer to any type of direct or indirect relationship between the components under discussion and may be applied to electrical, mechanical, fluid, optical, electromagnetic, motor, or other connections. Furthermore, the terms "first," "second," etc., are used herein for convenience only and, unless otherwise indicated, do not convey any particular time domain or timing significance. Additionally, it should be understood that the indefinite articles "a" or "an" carry the meaning of "one or more" or "at least one."
[0331] As used in this application and claims, a list of items described by the term "one or more" may refer to any combination of the listed items. For example, the phrase "at least one of A, B, and C" means A, B, and C; A and B; A and C; B and C; or A, B, and C.
[0332] The embodiments have been described above with reference to specific examples. However, those skilled in the art will understand that various modifications and changes can be made thereto without departing from the broader spirit and scope of the embodiments as set forth in the appended claims. Therefore, the foregoing description and drawings are considered illustrative rather than restrictive.
Claims
1. A system comprising: Cameras are used to capture real-world content; A semiconductor packaging device coupled to the camera, wherein the semiconductor packaging device includes: A substrate and functional execution logic coupled to the substrate, wherein the functional execution logic includes: A graphics pipeline is used to generate rendered content. A base layer encoder and a first layer encoder, the base layer encoder being used to encode the real-world content into a base layer, and the first layer encoder being used to encode the rendered content into a first non-base layer. A multiplexer for interleaving the base layer with the first non-base layer to obtain a single output signal with mixed reality content; and Transmitter, used to transmit the single output signal. The second-layer encoder encodes the mapped data into a second non-base layer, and the multiplexer interleaves the second non-base layer with the first non-base layer and the base layer. Furthermore, the first layer encoder and the second layer encoder encode the rendered content and the mapping data into an overlay auxiliary image, and the mapping data is used to distinguish the placement of multiple objects in another overlay auxiliary image on another layer.
2. The system as claimed in claim 1, wherein, The third-layer encoder is used to encode the fused data into a third non-base layer, and the multiplexer is used to interleave the third non-base layer with the second non-base layer, the first non-base layer, and the base layer.
3. The system as described in claim 2, wherein, The third-layer encoder encodes the fused data into an overlay auxiliary image, and the fused data represents the transparency of the overlay object from one or more overlay auxiliary images.
4. The system as claimed in claim 1, wherein, The real-world content includes the captured video.
5. The system as claimed in claim 1, wherein, The base layer and the first non-base layer bypass synthesis.
6. An apparatus comprising: Substrate; as well as Functional execution logic, coupled to the substrate, wherein the functional execution logic is implemented in one or more of configurable logic or fixed-function hardware logic, the functional execution logic comprising: A demultiplexer is used to receive a single complex signal having multiple interleaved signals, and to separate a base layer signal, a first non-base layer signal, a second non-base layer signal, and a third non-base layer signal from the single complex signal. A base layer decoder and a first layer decoder, wherein the base layer decoder is used to decode the base layer signal into real-world content, and the first layer decoder is used to decode the first non-base layer into mixed reality content overlay; A second-layer decoder and a third-layer decoder, wherein the second-layer decoder decodes the second non-base layer into a mapped superposition, and the third-layer decoder decodes the third non-base layer into a fused superposition; and A synthesizer allows the user to select or deselect one or more overlays and to composite the selected overlays with the decoded base layer for display. The mapping overlay is used to distinguish the placement of objects within an overlay auxiliary image on another layer.
7. The device as claimed in claim 6, wherein, The real-world content includes the captured video.
8. The device as claimed in claim 6, wherein, The fusion overlay refers to the transparency of the overlay object derived from one or more overlay auxiliary images.
9. A method comprising: Encode real-world content into a base layer; Encode the rendered content into the first non-base layer; as well as The base layer is interleaved with the first non-base layer to obtain a single output signal with mixed reality content. The mapping data is encoded into a second non-base layer, and the second non-base layer is interleaved with the first non-base layer and the base layer. The rendered content is encoded into overlay auxiliary images, and the mapping data is used to distinguish the placement of multiple objects within one or more overlay auxiliary images on another layer.
10. The method of claim 9, further comprising: Encode the fused data into a third non-base layer; as well as The third non-base layer is interleaved with the second non-base layer, the first non-base layer, and the base layer.
11. The method of claim 9, wherein, The real-world content includes the captured video.
12. The method of claim 9, wherein, The base layer and the first non-base layer bypass synthesis.
13. An apparatus comprising means for performing the method as described in any one of claims 9 to 12.
14. A method comprising: Receive a single complex signal, the single complex signal comprising multiple encoded signals interleaved together; The plurality of encoded signals from the single complex signal are separated into a base layer signal, a first non-base layer signal, a second non-base layer signal, and a third non-base layer signal, wherein the base layer signal is obtained by encoding real-world content, the first non-base layer signal is obtained by encoding rendered content generated by the graphics pipeline, the second non-base layer signal is obtained by encoding mapping data, and the third non-base layer signal is obtained by encoding fused data. Decode the base layer signal, the first non-base layer signal, the second non-base layer signal, and the third non-base layer signal; Select the overlay auxiliary image to be used for the final synthesis, wherein the overlay auxiliary image is selected from the group consisting of the first non-base layer signal, the second non-base layer signal and the third non-base layer signal; The base layer and the selected overlay auxiliary image are combined together to form the final composite; as well as The final synthesis is shown. The mapping data is used to distinguish the placement of multiple objects within one or more overlaid auxiliary images on another layer.
15. The method of claim 14, wherein, The final composite display is shown on one or more of a head-mounted display, a laptop computer, a desktop computer, a mobile device, or other display devices capable of displaying real-world content as an overlay with mixed reality.
16. The method of claim 14, wherein, The base layer signal represents encoded real-world content, the first non-base layer signal represents encoded mixed reality content using layer 1 overlaid with auxiliary images, the second non-base layer signal represents encoded mapping data using layer 2 overlaid with auxiliary images, and the third non-base layer signal represents encoded α data using layer 3 overlaid with auxiliary images, wherein the α data represents the transparency of the overlaid auxiliary images.
17. The method of claim 14, wherein, The final composition involves using mapping data and alpha data to overlay mixed reality onto a main image, which includes real-world content.
18. An apparatus comprising means for performing the method as claimed in any one of claims 14 to 17.
19. A system comprising: Displays used to present real-world content with mixed display overlays; A semiconductor packaging apparatus coupled to the display, wherein the semiconductor packaging apparatus includes: A substrate and functional execution logic coupled to the substrate, wherein the functional execution logic includes: A demultiplexer is used to receive a single complex signal having multiple interleaved signals, and to separate a base layer signal, a first non-base layer signal, a second non-base layer signal, and a third non-base layer signal from the single complex signal. A base layer decoder and a first layer decoder, wherein the base layer decoder is used to decode the base layer signal into real-world content, and the first layer decoder is used to decode the first non-base layer into mixed reality content overlay; A second-layer decoder and a third-layer decoder, wherein the second-layer decoder decodes the second non-base layer into a mapped superposition, and the third-layer decoder decodes the third non-base layer into a fused superposition; and A synthesizer allows the user to select or deselect one or more overlays and to composite the selected overlays with the decoded base layer for display. The mapping data is used to distinguish the placement of multiple objects within one or more overlaid auxiliary images on another layer.
20. The system of claim 19, wherein, The real-world content includes the captured video.