Apparatus and method for motion blur using dynamic quantization grid
1. A method, comprising:
generating a volume-bounding-level (BVH) comprising a hierarchically arranged BVH node based on an input primitive, at least one BVH node comprising one or more child nodes;
determining a motion value of a quantization grid based on motion values of the one or more child nodes of the at least one BVH node; and
mapping the linear boundaries of each of the child nodes to the quantization grid.
2. The method of claim 1, wherein mapping the linear boundaries of each of the child nodes further comprises:
obtaining one or more residual motion values by subtracting motion values of the quantization grid from motion values associated with the one or more child nodes; and
deriving quantization boundaries for the one or more child nodes from the one or more residual motion values.
3. The method of claim 2, wherein the one or more child nodes comprise primitives.
4. A method as claimed in claim 3, wherein the primitives are in motion.
5. The method of claim 4, wherein the motion values associated with the one or more child nodes are determined based on motion of the primitive.
6. A method as claimed in any one of claims 3 to 5, wherein the primitives comprise triangles.
7. The method of any of claims 2 to 6, further comprising:
performing ray traversal and/or intersection operations according to the quantization boundaries of the one or more child nodes to determine one or more intersection points of rays.
8. The method of claim 7, further comprising:
deriving one or more shaders to perform a graphics operation with respect to the one or more intersection points.
9. A machine-readable medium having program code stored thereon, which when executed by a machine, causes the machine to perform operations comprising:
generating a volume-bounding-level (BVH) comprising a hierarchically arranged BVH node based on an input primitive, at least one BVH node comprising one or more child nodes;
determining a motion value of a quantization grid based on motion values of the one or more child nodes of the at least one BVH node; and
mapping the linear boundaries of each of the child nodes to the quantization grid.
10. The machine-readable medium of claim 9, wherein mapping the linear boundaries of each of the child nodes further comprises:
obtaining one or more residual motion values by subtracting motion values of the quantization grid from motion values associated with the one or more child nodes; and
deriving quantization boundaries for the one or more child nodes from the one or more residual motion values.
11. The machine-readable medium of claim 10, wherein the one or more child nodes comprise primitives.
12. The machine-readable medium of claim 11, wherein the primitives are in motion.
13. The machine-readable medium of claim 12, wherein the motion values associated with the one or more child nodes are determined based on motion of the primitive.
14. The machine readable medium of any of claims 11-13, wherein the primitives comprise triangles.
15. The machine-readable medium of any of claims 10-14, further comprising program code to cause the machine to:
performing ray traversal and/or intersection operations according to the quantization boundaries of the one or more child nodes to determine one or more intersection points of rays.
16. The machine-readable medium of claim 15, further comprising program code to cause the machine to:
deriving one or more shaders to perform a graphics operation with respect to the one or more intersection points.
17. A graphics processor, comprising:
a Bounding Volume Hierarchy (BVH) generator to construct a BVH comprising hierarchically arranged BVH nodes based on input primitives, at least one BVH node comprising one or more child nodes; and
motion blur processing hardware logic to determine motion values for a quantization grid based on motion values of the one or more child nodes of the at least one BVH node and to map linear boundaries of each of the child nodes to the quantization grid.
18. The graphics processor of claim 17, wherein to map linear boundaries of each of the child nodes, the motion blur processing hardware logic is to: obtaining one or more residual motion values by subtracting motion values of the quantization grid from motion values associated with the one or more child nodes; and deriving a quantization boundary for the one or more child nodes from the one or more residual motion values.
19. The graphics processor of claim 18, wherein the one or more child nodes comprise primitives.
20. The graphics processor of claim 19, wherein the primitives are in motion.
21. The graphics processor of claim 20, wherein the motion values associated with the one or more child nodes are determined based on motion of the primitive.
22. A graphics processor as claimed in any one of claims 19 to 21, wherein the primitives comprise triangles.
23. A graphics processor according to any of claims 18 to 22, further comprising:
ray traversal and intersection hardware logic to perform ray traversal and/or intersection operations according to the quantization boundaries of the one or more child nodes to determine one or more intersection points of a ray.
24. A graphics processor in accordance with claim 23, further comprising:
a plurality of execution circuits to execute one or more shaders to perform graphics operations with respect to the one or more intersections.
25. An apparatus, comprising:
means for generating a volume-bounding-hierarchy (BVH) comprising BVH nodes that are hierarchically arranged based on an input primitive, at least one BVH node comprising one or more child nodes;
means for determining a motion value of a quantization grid based on motion values of the one or more child nodes of the at least one BVH node; and
means for mapping linear boundaries of each of the child nodes to the quantization grid.
Background
Path tracking is a prior art technique for rendering realistic images for special effects in movies, animated movies, and professional visualizations. Generating these realistic images requires computing a physical simulation of light transmission in a virtual 3D scene using ray tracing as a tool for visibility queries. High performance implementations of these visibility queries require the construction of a 3D hierarchy on scene primitives (typically triangles) in a pre-processing stage. This level allows the ray tracing step to quickly determine the closest intersection point between the ray and the primitive (triangle).
Motion blur is an important feature in the photorealistic rendering of animations, where the effect of objects moving in a scene while the camera shutter is open is simulated. Simulating this effect results in blurring of the orientation of the moving object, which makes the animation appear smooth when played. Rendering motion blur requires random sampling of the time of each ray path evaluated and averaging over many of these paths provides the desired blurring effect. To implement this technique, the underlying ray tracing engine must be able to trace rays through the scene at any time within the camera shutter interval. This requires encoding of the motion of geometric objects within the spatial acceleration structure for ray tracing.
Drawings
A better understanding of the present invention can be obtained from the following detailed description in conjunction with the following drawings, in which:
FIG. 1 is a block diagram of an embodiment of a computer system with a processor having one or more processor cores and a graphics processor;
FIG. 2 is a block diagram of one embodiment of a processor having one or more processor cores, an integrated memory controller, and an integrated graphics processor;
FIG. 3 is a block diagram of one embodiment of a graphics processor, which may be a discrete graphics processing unit or may be a graphics processor integrated with multiple processing cores;
FIG. 4 is a block diagram of an embodiment of a graphics processing engine for a graphics processor;
FIG. 5 is a block diagram of another embodiment of a graphics processor;
6A-B illustrate examples of execution circuitry and logic;
FIG. 7 illustrates a graphics processor execution unit instruction format, according to an embodiment;
FIG. 8 is a block diagram of another embodiment of a graphics processor including a graphics pipeline, a media pipeline, a display engine, thread execution logic, and a render output pipeline;
FIG. 9A is a block diagram that illustrates a graphics processor command format, according to an embodiment;
FIG. 9B is a block diagram that illustrates a graphics processor command sequence, according to an embodiment;
FIG. 10 illustrates an exemplary graphics software architecture for a data processing system, according to an embodiment;
FIG. 11 illustrates an exemplary IP core development system that may be used to fabricate integrated circuits and exemplary package assemblies;
FIG. 12 illustrates an exemplary system-on-chip integrated circuit that may be fabricated using one or more IP cores, according to an embodiment;
13A-B illustrate an exemplary graphics processor of a system-on-chip integrated circuit that may be fabricated using one or more IP cores;
14A-B illustrate exemplary graphics processor architectures;
fig. 15 is an illustration of a bounding volume (bounding volume) according to an embodiment;
16A-B illustrate representations of bounding volume hierarchy (bounding volume hierarchy);
FIG. 17 is an illustration of a ray-box intersection test according to an embodiment;
fig. 18 is a block diagram illustrating an exemplary quantized BVH node according to an embodiment;
FIG. 19 is a block diagram of a compound floating point data block for use with a quantized BVH node, according to a further embodiment;
FIG. 20 illustrates ray-box intersections using quantization values to define child bounding boxes relative to parent bounding boxes, in accordance with an embodiment;
FIG. 21 is a flow diagram of BVH decompression and traversal logic, according to an embodiment;
FIG. 22 is an illustration of an exemplary two-dimensional shared plane bounding box;
fig. 23 is a flow diagram of shared plane BVH logic, according to an embodiment;
fig. 24 is a block diagram of a computing device including a graphics processor with bounding volume hierarchy logic, according to an embodiment;
FIG. 25 illustrates a device or system upon which an embodiment of the invention may be implemented;
FIG. 26 illustrates one embodiment of an apparatus for constructing, compressing, and decompressing nodes of a bounding volume hierarchy;
FIG. 27 illustrates one embodiment in which leaf nodes are compressed by replacing pointers with offsets;
FIG. 28 shows code associated with three BVH node types;
FIG. 29 compares an embodiment of the present invention with a prior implementation with respect to memory consumption (in MB) and total rendering performance (in fps);
FIG. 30 is used to compare a prior implementation with an embodiment of the present invention with respect to memory consumption (in MB units), traversal statistics, and overall performance;
FIG. 31 shows a na iotave extension of quantization bounding boxes to motion-blurred triangles;
FIG. 32 illustrates one embodiment of the present invention using smaller quantization grids at the start and end times;
FIG. 33 illustrates one embodiment of an architecture including motion blur processing hardware/logic; and
FIG. 34 illustrates a method according to one embodiment of the invention.
Detailed Description
In the following description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the invention described below. It will be apparent, however, to one skilled in the art that embodiments of the invention may be practiced without some of these specific details. In other instances, well-known structures and devices are shown in block diagram form in order to avoid obscuring the underlying principles of embodiments of the present invention.
Exemplary graphics processor architecture and data types
Overview of the System
Fig. 1 is a block diagram of a processing system 100 according to an embodiment. In various embodiments, system 100 includes one or more processors 102 and one or more graphics processors 108, and may be a single-processor desktop system, a multi-processor workstation system, or a server system having a large number of processors 102 or processor cores 107. In one embodiment, system 100 is a processing platform incorporated within a system on a chip (SoC) integrated circuit for use in a mobile, handheld, or embedded device.
In one embodiment, the system 100 may comprise or be incorporated into a server-based gaming platform, a gaming console (including a gaming and media console, a mobile gaming console, a handheld gaming console, or an online gaming console). In some embodiments, the system 100 is a mobile phone, a smart phone, a tablet computing device, or a mobile internet device. The processing system 100 may also include, be coupled with, or integrated within a wearable device, such as a smart watch wearable device, a smart eyewear device, an augmented reality device, or a virtual reality device. In some embodiments, the processing system 100 is a television or set-top box device having one or more processors 102 and a graphical interface generated by one or more graphics processors 108.
In some embodiments, the one or more processors 102 each include one or more processor cores 107 to process instructions that, when executed, perform operations for system and user software. In some embodiments, each of the one or more processor cores 107 is configured to process a particular instruction set 109. In some embodiments, the instruction set 109 may facilitate Complex Instruction Set Computing (CISC), Reduced Instruction Set Computing (RISC), or computing via Very Long Instruction Words (VLIW). Multiple processor cores 107 may each process a different instruction set 109, which instruction set 109 may include instructions to facilitate emulation of other instruction sets. Processor core 107 may also include other processing devices, such as a Digital Signal Processor (DSP).
In some embodiments, processor 102 includes cache memory 104. Depending on the architecture, the processor 102 can have a single internal cache or multiple levels of internal cache. In some embodiments, the cache memory is shared among various components of the processor 102. In some embodiments, the processor 102 also uses an external cache (e.g., a level 3 (L3) cache or Last Level Cache (LLC)) (not shown), which may be shared among the processor cores 107 using known cache coherency techniques. The register file 106 is additionally included in the processor 102, which processor 102 may include different types of registers (e.g., integer registers, floating point registers, status registers, and instruction pointer registers) for storing different types of data. Some registers may be general purpose registers, while other registers may be specific to the design of the processor 102.
In some embodiments, one or more processors 102 are coupled with one or more interface buses 110 to transmit communication signals, such as address, data, or control signals, between the processors 102 and other components in the system 100. Interface bus 110 can be a processor bus in one embodiment, such as a version of a Direct Media Interface (DMI) bus. However, the processor bus is not limited to a DMI bus, and may include one or more peripheral component interconnect buses (e.g., PCI Express), a memory bus, or other types of interface buses. In one embodiment, the processor(s) 102 include an integrated memory controller 116 and a platform controller hub 130. The memory controller 116 facilitates communication between the memory devices and other components of the system 100, while the Platform Controller Hub (PCH) 130 provides a connection to I/O devices via a local I/O bus.
Memory device 120 can be a Dynamic Random Access Memory (DRAM) device, a Static Random Access Memory (SRAM) device, a flash memory device, a phase change memory device, or some other memory device having suitable performance to act as a process memory. In one embodiment, memory device 120 is capable of operating as system memory for system 100 to store data 122 and instructions 121 for use when one or more processors 102 execute an application or process. The memory controller 116 is also coupled with an optional external graphics processor 112, which external graphics processor 112 may communicate with one or more graphics processors 108 in the processor 102 to perform graphics and media operations.
In some embodiments, a display device 111 can be connected to the processor(s) 102. The display device 111 can be one or more of an internal display device as in a mobile electronic device or laptop device or an external display device attached via a display interface (e.g., DisplayPort, etc.). In one embodiment, display device 111 can be a Head Mounted Display (HMD), such as a stereoscopic display device for use in Virtual Reality (VR) applications or Augmented Reality (AR) applications.
In some embodiments, the platform controller hub 130 enables peripherals to be connected to the memory devices 120 and the processor 102 via a high speed I/O bus. The I/O peripherals include, but are not limited to, an audio controller 146, a network controller 134, a firmware interface 128, a wireless transceiver 126, a touch sensor 125, a data storage device 124 (e.g., hard drive, flash memory, etc.). The data storage device 124 can be connected via a storage interface (e.g., SATA) or via a peripheral bus such as a peripheral component interconnect bus (e.g., PCI Express). The touch sensor 125 can include a touch screen sensor, a pressure sensor, or a fingerprint sensor. The wireless transceiver 126 can be a Wi-Fi transceiver, a bluetooth transceiver, or a mobile network transceiver such as a 3G, 4G, or Long Term Evolution (LTE) transceiver. The firmware interface 128 enables communication with system firmware and can be, for example, a Unified Extensible Firmware Interface (UEFI). Network controller 134 may implement a network connection to a wired network. In some embodiments, a high performance network controller (not shown) is coupled to interface bus 110. The audio controller 146 in one embodiment is a multi-channel high definition audio controller. In one embodiment, the system 100 includes an optional legacy I/O controller 140 for coupling legacy (e.g., personal System 2 (PS/2)) devices to the system. The platform controller hub 130 can also be connected to one or more Universal Serial Bus (USB) controllers 142 to connect input devices, such as a keyboard and mouse 143 combination, a camera 144, or other USB input devices.
It will be appreciated that the illustrated system 100 is exemplary and not limiting, as other types of data processing systems configured in different ways may also be used. For example, the instances of the memory controller 116 and the platform controller hub 130 may be integrated into a separate external graphics processor, such as the external graphics processor 112. In one embodiment, the platform controller hub 130 and/or the memory controller 116 may be external to the one or more processors 102. For example, the system 100 can include an external memory controller 116 and a platform controller hub 130, which may be configured as a memory controller hub and a peripheral controller hub within a system chipset in communication with the processor(s) 102.
FIG. 2 is a block diagram of an embodiment of a processor 200 having one or more processor cores 202A-202N, an integrated memory controller 214, and an integrated graphics processor 208. Those elements of fig. 2 having the same reference numbers (or names) as the elements of any other figure herein may operate or function in any manner similar to that described elsewhere herein, but are not limited to such. Processor 200 may include additional cores up to and including additional core 202N, represented by the dashed box. Each of the processor cores 202A-202N includes one or more internal cache units 204A-204N. In some embodiments, each processor core may also access one or more shared cache units 206.
Internal cache units 204A-204N and shared cache unit 206 represent cache levels within processor 200. The cache memory hierarchy may include at least one level of instruction and data cache within each processor core, as well as one or more levels of shared mid-level cache, such as level 2 (L2), level 3 (L3), level 4 (L4), or other levels of cache, with the highest level of cache preceding external memory classified as LLC. In some embodiments, cache coherency logic maintains coherency between the various cache units 206 and 204A-204N.
In some embodiments, processor 200 may also include a set of one or more bus controller units 216 and a system agent core 210. One or more bus controller units 216 manage a set of peripheral buses, such as one or more PCI or PCI express buses. The system agent core 210 provides management functionality for various processor components. In some embodiments, the system proxy core 210 includes one or more integrated memory controllers 214 to manage access to various external memory devices (not shown).
In some embodiments, one or more of the processor cores 202A-202N include support for simultaneous multithreading. In such embodiments, the system proxy core 210 includes components for coordinating and operating the cores 202A-202N during processing of multiple threads. The system proxy core 210 may additionally include a Power Control Unit (PCU) that includes logic and components to regulate the power states of the processor cores 202A-202N and the graphics processor 208.
In some embodiments, the processor 200 additionally includes a graphics processor 208 to perform graphics processing operations. In some embodiments, the graphics processor 208 is coupled to a set of shared cache units 206 and a system proxy core 210 that includes one or more integrated memory controllers 214. In some embodiments, the system proxy core 210 also includes a display controller 211 to drive graphics processor output to one or more coupled displays. In some embodiments, the display controller 211 may also be a separate module coupled with the graphics processor via at least one interconnect, or may be integrated within the graphics processor 208.
In some embodiments, ring-based interconnect unit 212 is used to couple internal components of processor 200. However, alternative interconnect elements may be used, such as point-to-point interconnects, switched interconnects, or other techniques, including techniques known in the art. In some embodiments, the graphics processor 208 is coupled with the ring interconnect 212 via an I/O link 213.
Exemplary I/O link 213 represents at least one of a plurality of kinds of I/O interconnects, including on-package I/O interconnects that facilitate communication between various processor components and a high performance embedded memory module 218, such as an eDRAM module. In some embodiments, each of the processor cores 202A-202N and the graphics processor 208 use the embedded memory module 218 as a shared last level cache.
In some embodiments, processor cores 202A-202N are homogeneous cores that execute the same instruction set architecture. In another embodiment, the processor cores 202A-202N are heterogeneous in Instruction Set Architecture (ISA), wherein one or more of the processor cores 202A-202N execute a first instruction set, while at least one of the other cores executes a subset of the first instruction set or a different instruction set. In one embodiment, the processor cores 202A-202N are heterogeneous in micro-architecture, with one or more cores having relatively higher power consumption coupled with one or more power cores having lower power consumption. Additionally, processor 200 can be implemented on one or more chips or as an SoC integrated circuit having the illustrated components, among other components.
Fig. 3 is a block diagram of a graphics processor 300, which graphics processor 300 may be a discrete graphics processing unit or may be a graphics processor integrated with multiple processing cores. In some embodiments, the graphics processor communicates via a memory mapped I/O interface to registers on the graphics processor and with commands placed in processor memory. In some embodiments, graphics processor 300 includes a memory interface 314 to access memory. Memory interface 314 can be an interface to local memory, one or more internal caches, one or more shared external caches, and/or to system memory.
In some embodiments, graphics processor 300 also includes a display controller 302 to drive display output data to a display device 320. The display controller 302 includes hardware for one or more overlay planes for displaying and combining multiple layers of video or user interface elements. The display device 320 can be an internal or external display device. In one embodiment, display device 320 is a head mounted display device, such as a Virtual Reality (VR) display device or an Augmented Reality (AR) display device. In some embodiments, graphics processor 300 includes a video codec engine 306 to encode, decode, or transcode media into, from, or between one or more media encoding formats, including, but not limited to, Moving Picture Experts Group (MPEG) formats such as MPEG-2, Advanced Video Coding (AVC) formats such as h.264/MPEG-4 AVC, and Society of Motion Picture and Television Engineers (SMPTE) 421M/VC-1 and Joint Photographic Experts Group (JPEG) formats such as JPEG and motion JPEG (mjpeg) formats.
In some embodiments, graphics processor 300 includes a block image transfer (BLIT) engine 304 to perform two-dimensional (2D) rasterizer operations, including, for example, bit boundary block transfers. However, in one embodiment, 2D graphics operations are performed using one or more components of a Graphics Processing Engine (GPE) 310. In some embodiments, GPE 310 is a compute engine for performing graphics operations including three-dimensional (3D) graphics operations and media operations.
In some embodiments, GPE 310 includes a 3D pipeline 312 for performing 3D operations, such as rendering three-dimensional images and scenes using processing functions that act on 3D primitive shapes (e.g., rectangles, triangles, etc.). The 3D pipeline 312 includes programmable and fixed function elements that perform various tasks within the element and/or spawn (spawn) execution threads to the 3D/media subsystem 315. While the 3D pipeline 312 can be used to perform media operations, embodiments of the GPE 310 also include a media pipeline 316 that is particularly used to perform media operations, such as video post-processing and image enhancement.
In some embodiments, media pipeline 316 includes fixed-function or programmable logic units to perform one or more dedicated media operations, such as video decoding acceleration, video de-interleaving, and video encoding acceleration, in place of or on behalf of video codec engine 306. In some embodiments, media pipeline 316 additionally includes a thread spawning unit to spawn threads for execution on 3D/media subsystem 315. The spawned threads perform computations for media operations on one or more graphics execution units included in 3D/media subsystem 315.
In some embodiments, 3D/media subsystem 315 includes logic for executing threads spawned by 3D pipeline 312 and media pipeline 316. In one embodiment, the pipeline sends thread execution requests to the 3D/media subsystem 315, the 3D/media subsystem 315 including thread dispatch logic for arbitrating and dispatching various requests to available thread execution resources. The execution resources include an array of graphics execution units to process 3D and media threads. In some embodiments, 3D/media subsystem 315 includes one or more internal caches for thread instructions and data. In some embodiments, the subsystem further includes a shared memory including registers and addressable memory to share data between the threads and to store output data.
Graphics processing engine
FIG. 4 is a block diagram of a graphics processing engine 410 of a graphics processor, according to some embodiments. In one embodiment, Graphics Processing Engine (GPE) 410 is some version of GPE 310 shown in FIG. 3. The elements of fig. 4 having the same reference numbers (or names) as the elements of any other figure herein can operate or function in any manner similar to that described elsewhere herein, but are not limited to such. For example, the 3D pipeline 312 and the media pipeline 316 of fig. 3 are shown. The media pipeline 316 is optional in some embodiments of the GPE 410 and may not be explicitly included within the GPE 410. For example, and in at least one embodiment, a separate media and/or image processor is coupled to GPE 410.
In some embodiments, GPE 410 is coupled with or includes command streamer 403 that provides a command stream to 3D pipeline 312 and/or media pipeline 316. In some embodiments, command streamer 403 is coupled with a memory, which can be a system memory, or one or more of an internal cache and a shared cache. In some embodiments, command streamer 403 receives commands from memory and sends the commands to 3D pipeline 312 and/or media pipeline 316. The command is an indication retrieved from a ring buffer that stores commands for the 3D pipeline 312 and the media pipeline 316. In one embodiment, the ring buffer can additionally include a batch command buffer that stores a batch of the plurality of commands. The commands for 3D pipeline 312 can also include references to data stored in memory, such as, but not limited to, vertex and geometry data for 3D pipeline 312 and/or image data and memory objects for media pipeline 316. The 3D pipeline 312 and the media pipeline 316 process commands and data by performing operations via logic within the respective pipelines or by dispatching one or more execution threads to the graphics core array 414. In one embodiment, graphics core array 414 includes one or more blocks of graphics cores (e.g., graphics core(s) 415A, graphics core(s) 415B), each block including one or more graphics cores. Each graphics core includes: a set of graphics execution resources comprising general purpose and graphics specific execution logic to perform graphics and computational operations; and fixed function texture processing and/or machine learning and artificial intelligence acceleration logic.
In various embodiments, 3D pipeline 312 includes fixed functionality and programmable logic to process one or more shader programs (such as vertex shaders, geometry shaders, pixel shaders, fragment shaders, compute shaders, or other shader programs) by processing instructions and dispatch execution threads to graphics core array 414. Graphics core array 414 provides a uniform block of execution resources for use in processing these shader programs. Multipurpose execution logic (e.g., execution units) within graphics core(s) 415A-415B of graphics core array 414 includes support for various 3D API shader languages and is capable of executing multiple simultaneous execution threads associated with multiple shaders.
In some embodiments, graphics core array 414 also includes execution logic to perform media functions, such as video and/or image processing. In one embodiment, the execution unit additionally includes general purpose logic that is programmable to perform parallel general purpose computing operations in addition to graphics processing operations. The general purpose logic is capable of performing processing operations in parallel or in conjunction with the general purpose logic within the processor core(s) 107 of fig. 1 or cores 202A-202N as in fig. 2.
Output data generated by threads executing on graphics core array 414 can output the data to memory in Unified Return Buffer (URB) 418. The URB 418 is capable of storing data for multiple threads. In some embodiments, the URB 418 may be used to send data between different threads executing on the graphics core array 414. In some embodiments, the URB 418 may additionally be used for synchronization between threads on the graphics core array and fixed function logic within the shared function logic 420.
In some embodiments, the graphics core array 414 is scalable such that the array includes a variable number of graphics cores each having a variable number of execution units based on the target power and performance level of the GPE 410. In one embodiment, the execution resources are dynamically scalable such that the execution resources may be enabled or disabled as needed.
Graphics core array 414 is coupled to shared function logic 420, which shared function logic 420 includes a plurality of resources shared between graphics cores in the graphics core array. The shared function within shared function logic 420 is a hardware logic unit that provides dedicated supplemental functionality to graphics core array 414. In various embodiments, shared function logic 420 includes, but is not limited to, sampler 421, math 422, and inter-thread communication (ITC) 423 logic. Additionally, some embodiments implement one or more caches 425 within shared function logic 420.
Shared functionality is implemented where the need for a given dedicated functionality is insufficient to be contained within graphics core array 414. Alternatively, a single instantiation of the dedicated function is implemented as a separate entity in the shared function logic 420 and is shared between execution resources within the graphics core array 414. The exact set of functions shared between graphics core array 414 and included within graphics core array 414 varies across embodiments. In some embodiments, a particular shared function within shared function logic 420 that is widely used by graphics core array 414 may be included within shared function logic 416 within graphics core array 414. In various embodiments, shared function logic 416 within graphics core array 414 can include some or all of the logic within shared function logic 420. In one embodiment, all logic elements within shared function logic 420 may be duplicated within shared function logic 416 of graphics core array 414. In one embodiment, shared function logic 420 is eliminated in favor of shared function logic 416 within graphics core array 414.
Figure 5 is a block diagram of hardware logic of graphics processor core 500 according to some embodiments described herein. Elements of fig. 5 having the same reference numbers (or names) as elements of any other figure herein can operate or function in any manner similar to that described elsewhere herein, but are not limited to such. In some embodiments, the illustrated graphics processor core 500 is included within the graphics core array 414 of fig. 4. Graphics processor core 500, sometimes referred to as a core slice, can be one or more graphics cores within a modular graphics processor. Graphics processor core 500 is an example of one graphics core slice, and a graphics processor as described herein may include multiple graphics core slices based on a target power and performance envelope. Each graphics processor core 500 can include a fixed function block 530 coupled to a plurality of sub-cores 501A-501F, also referred to as sub-slices, that include modular blocks of general and fixed function logic.
In some embodiments, fixed function block 530 includes a geometry/fixed function pipeline 536, which geometry/fixed function pipeline 536 can be shared by all of the sub-cores in graphics processor core 500, for example, in a lower performance/or lower power graphics processor implementation. In various embodiments, geometry/fixed function pipeline 536 includes a 3D fixed function pipeline (e.g., 3D pipeline 312 as in fig. 3 and 4), a video front end unit, a thread spawner (thread spawner) and a thread dispatcher (thread dispatcher), and a unified return buffer manager that manages a unified return buffer (such as unified return buffer 418 of fig. 4).
In one embodiment, fixed function block 530 also includes a graphics SoC interface 537, a graphics microcontroller 538, and a media pipeline 539. Graphics SoC interface 537 provides an interface between graphics processor core 500 and other processor cores within the system-on-a-chip integrated circuit. Graphics microcontroller 538 is a programmable sub-processor that may be configured to manage various functions of graphics processor core 500, including thread dispatch, scheduling, and preemption (pre-preemption). Media pipeline 539 (e.g., media pipeline 316 of fig. 3 and 4) includes logic to facilitate decoding, encoding, pre-processing, and/or post-processing of multimedia data, including image and video data. The media pipeline 539 implements media operations via requests to compute or sample logic within the sub-cores 501A-501F.
In one embodiment, SoC interface 537 enables graphics processor core 500 to communicate with general-purpose application processor cores (e.g., CPUs) and/or other components within the SoC, including memory hierarchy elements such as shared last level cache, system RAM, and/or embedded on-chip or on-package DRAM. SoC interface 537 may also enable communication with fixed-function devices within the SoC (such as camera imaging pipelines), and enable the use of and/or implement global memory atoms that may be shared between graphics processor core 500 and CPUs within the SoC. SoC interface 537 may also implement power management control for graphics processor core 500 and interface between the clock domains of graphics core 500 and other clock domains within the SoC. In one embodiment, SoC interface 537 enables receipt of command buffers from a command streamer (command stream) and a global thread dispatcher that are configured to provide commands and instructions to each of one or more graphics cores within a graphics processor. Commands and instructions can be dispatched to the media pipeline 539 when media operations are to be performed or to the geometry and fixed function pipelines (e.g., the geometry and fixed function pipeline 536, the geometry and fixed function pipeline 514) when graphics processing operations are to be performed.
Graphics microcontroller 538 can be configured to perform various scheduling and management tasks for graphics processor core 500. In one embodiment, the graphics microcontroller 538 is capable of performing graphics and/or computing workload scheduling on various graphics parallel engines within the Execution Unit (EU) arrays 502A-502F, 504A-504F within the sub-cores 501A-501F. In this scheduling model, host software executing on a CPU core of an SoC that includes graphics processor core 500 is able to submit a workload to one of a plurality of graphics processor doorbells (graphics processor doorbells), which invokes a scheduling operation on the appropriate graphics engine. Scheduling operations include determining which workload to run next, submitting the workload to a command streamer, preempting an existing workload running on an engine, monitoring the progress of the workload, and notifying the host software when the workload is complete. In one embodiment, graphics microcontroller 538 is also capable of facilitating a low power or idle state of graphics processor core 500, providing graphics processor core 500 with the ability to save and restore registers within graphics processor core 500 across low power state transitions independent of the operating system and/or graphics driver software on the system.
Graphics processor core 500 may have more or fewer sub-cores 501A-501F than shown, up to N modular sub-cores. For each set of N sub-cores, graphics processor core 500 can also include shared function logic 510, shared and/or cache memory 512, geometry/fixed function pipeline 514, and additional fixed function logic 516 to accelerate various graphics and computing processing operations. Shared function logic 510 can include logic units (e.g., samplers, math and/or inter-thread communication logic) associated with shared function logic 420 of fig. 4 that can be shared by every N sub-cores within graphics processor core 500. The shared and/or cache memory 512 can be a last level cache for a set of N sub-cores 501A-501F within the graphics processor core 500, and can also act as a shared memory accessible by multiple sub-cores. Geometry/fixed function pipeline 514 can be included in place of geometry/fixed function pipeline 536 within fixed function block 530 and can include the same or similar logic elements.
In one embodiment, graphics processor core 500 includes additional fixed function logic 516, which can include various fixed function acceleration logic for use by graphics processor core 500. In one embodiment, the additional fixed function logic 516 includes additional geometry pipelines for use in location-only shading. In position-only shading, there are two geometric pipelines: a full geometry pipeline within geometry/fixed function pipelines 516, 536; and a culling pipeline (cu pipe), which is an additional geometry pipeline that may be included within the additional fixed function logic 516. In one embodiment, the culling pipeline is a pruned version of the full geometry pipeline. The full pipeline and the culling pipeline are capable of executing different instances of the same application, each instance having a separate context. Location-only shading can hide long culling runs of discarded triangles so that shading can be done earlier in some instances. For example, and in one embodiment, the culling pipeline logic within the additional fixed function logic 516 is able to execute position shaders in parallel with the host application and generally generates critical results faster than a full pipeline, because the culling pipeline only acquires and colors position attributes of vertices and does not perform rasterization (rasterization) and rendering of pixels to a frame buffer. The culling pipeline can use the generated key results to calculate visibility information for all triangles regardless of whether those triangles were culled. The full pipeline (which may be referred to as a replay pipeline in this example) can consume visibility information to skip culled triangles to color only the visible triangles that are eventually passed to the rasterization stage.
In one embodiment, the additional fixed function logic 516 can also include machine learning acceleration logic, such as fixed function matrix multiplication logic, for implementation including optimization for machine learning training or reasoning.
Within each graphics sub-core 501A-501F is included a set of execution resources that may be used to perform graphics, media, and computational operations in response to requests by a graphics pipeline, media pipeline, or shader program. The graphics sub-cores 501A-501F include a plurality of EU arrays 502A-502F, 504A-504F, thread dispatch and inter-thread communication (TD/IC) logic 503A-503F, 3D (e.g., texture) samplers 505A-505F, media samplers 506A-506F, shader processors 507A-507F, and Shared Local Memories (SLM) 508A-508F. The EU arrays 502A-502F, 504A-504F each include a plurality of execution units, which are general purpose graphics processing units capable of performing floating point and integer/fixed point logical operations for servicing graphics, media, or computational operations, including graphics, media, or compute shader programs. The TD/IC logic 503A-503F performs local thread dispatch and thread control operations for execution units within the sub-cores and facilitates communication between threads executing on the execution units of the sub-cores. The 3D samplers 505A-505F can read texture or other 3D graphics related data into memory. The 3D sampler can read texture data in different ways based on the configured sample states and the texture format associated with a given texture. Media samplers 506A-506F can perform similar read operations based on the type and format associated with the media data. In one embodiment, each graphics sub-core 501A-501F can alternately include unified 3D and media samplers. Threads executing on execution units within each of the sub-cores 501A-501F can utilize shared local memory 508A-508F within each sub-core to enable threads executing within a thread group to execute using a common pool of on-chip memory.
Execution unit
6A-6B illustrate thread execution logic 600 including an array of processing elements employed in a graphics processor core according to embodiments described herein. The elements of fig. 6A-6B having the same reference numbers (or names) as the elements of any other figure herein can operate or function in any manner similar to that described elsewhere herein, but are not limited to such. FIG. 6A shows an overview of thread execution logic 600, which thread execution logic 600 may include a variation of the hardware logic shown with each of the sub-cores 501A-501F of FIG. 5. FIG. 6B shows exemplary internal details of an execution unit.
As shown in fig. 6A, in some embodiments, thread execution logic 600 includes shader processor 602, thread dispatcher 604, instruction cache 606, scalable execution unit array including a plurality of execution units 608A-608N, sampler 610, data cache 612, and data port 614. In one embodiment, the scalable array of execution units is capable of being dynamically scaled by enabling or disabling one or more execution units (e.g., any of execution units 608A, 608B, 608C, 608D through 608N-1 and 608N) based on the computational requirements of the workload. In one embodiment, the included components are interconnected via an interconnection fabric linked to each of the components. In some embodiments, the thread execution logic 600 includes one or more connections to memory (such as system memory or cache memory) through one or more of the instruction cache 606, data port 614, sampler 610, and execution units 608A-608N. In some embodiments, each execution unit (e.g., 608A) is an independently programmable general purpose computing unit capable of executing multiple simultaneous hardware threads while processing multiple data elements in parallel for each thread. In various embodiments, the array of execution units 608A-608N is scalable to include any number of individual execution units.
In some embodiments, the EUs 608A-608N are primarily used to execute shader programs. Shader processor 602 is capable of processing various shader programs and dispatches threads of execution associated with the shader programs via thread dispatcher 604. In one embodiment, the thread dispatcher includes logic to arbitrate thread initiation requests from the graphics and media pipelines and to instantiate the requested thread on one or more of the execution units 608A-608N. For example, the geometry pipeline can dispatch vertices, tessellations (tessellation), or geometry shaders to thread execution logic for processing. In some embodiments, the thread dispatcher 604 is also capable of processing runtime thread spawn requests from executing shader programs.
In some embodiments, execution units 608A-608N support an instruction set that includes native support for many standard 3D graphics shader instructions, such that shader programs from graphics libraries (e.g., Direct3D and OpenGL) execute with minimal translation. Execution units support vertex and geometry processing (e.g., vertex programs, geometry programs, vertex shaders), pixel processing (e.g., pixel shaders, fragment shaders), and general purpose processing (e.g., compute and media shaders). Each of the execution units 608A-608N is capable of multi-issue (multi-issue) Single Instruction Multiple Data (SIMD) execution, and multi-threading enables an efficient execution environment in the face of higher latency memory accesses. Each hardware thread within each execution unit has a dedicated high bandwidth register file and associated independent thread state. Execution is a multiple issue per clock to a pipeline capable of integer, single and double precision floating point operations, SIMD branch capability, logical operations, override operations, and other miscellaneous operations. While waiting for data from one of the memory or shared functions, dependency logic within the execution units 608A-608N sleeps the waiting thread until the requested data has been returned. While the waiting thread is sleeping, the hardware resources may be dedicated to processing other threads. For example, during a delay associated with vertex shader operations, the execution unit can execute operations for a pixel shader, a fragment shader, or another type of shader program (including a different vertex shader).
Each of the execution units 608A-608N operates on an array of data elements. The number of data elements is the "execution size" or number of lanes for the instruction. An execution channel is a logical unit for the execution of data element access, masking, and flow control within an instruction. The number of lanes may be independent of the number of physical Arithmetic Logic Units (ALUs) or Floating Point Units (FPUs) for a particular graphics processor. In some embodiments, execution units 608A-608N support both integer and floating point data types.
The execution unit instruction set includes SIMD instructions. Various data elements can be stored as packed data types in registers, and execution units will process the various elements based on their data sizes. For example, in operating on a 256-bit wide vector, 256 bits of the vector are stored in a register, and the execution unit operates on the vector as four separate 64-bit packed data elements (four word (QW) size data elements), eight separate 32-bit packed data elements (double word (DW) size data elements), sixteen separate 16-bit packed data elements (word (W) size data elements), or thirty-two separate 8-bit data elements (byte (B) size data elements). However, different vector widths and register sizes are possible.
In one embodiment, one or more execution units can be combined into a fused execution unit 609A-609N with thread control logic (607A-607N) that is common to the fused EU (607A-607N). Multiple EUs can be fused into an EU group. Each EU in the fused EU group can be configured to execute a separate SIMD hardware thread. The number of EUs in the fused EU group can vary according to the embodiment. In addition, various SIMD widths can be performed per EU, including but not limited to SIMD8, SIMD16, and SIMD 32. Each fused graphics execution unit 609A-609N includes at least two execution units. For example, the fused execution unit 609A includes a first EU 608A, a second EU 608B, and thread control logic 607A, the thread control logic 607A being common to the first EU 608A and the second EU 608B. The thread control logic 607A controls the threads executing on the fused graphics execution unit 609A, allowing each EU within the fused execution units 609A-609N to execute using a common instruction pointer register.
One or more internal instruction caches (e.g., 606) are included in the thread execution logic 600 to cache thread instructions for the execution units. In some embodiments, one or more data caches (e.g., 612) are included to cache thread data during thread execution. In some embodiments, sampler 610 is included to provide texture samples for 3D operations and media samples for media operations. In some embodiments, sampler 610 includes dedicated texture or media sampling functionality to process texture or media data during a sampling process prior to providing the sampled data to an execution unit.
During execution, the graphics and media pipeline sends thread initiation requests to the thread execution logic 600 via thread spawn and dispatch logic. Once a group of geometric objects has been processed and rasterized into pixel data, pixel processor logic (e.g., pixel shader logic, fragment shader logic, etc.) within shader processor 602 is invoked to further compute output information and cause the results to be written to an output surface (e.g., a color buffer, a depth buffer, a stencil buffer, etc.). In some embodiments, a pixel shader or fragment shader computes values for various vertex attributes to be interpolated across rasterized objects. In some embodiments, pixel processor logic within shader processor 602 then executes an Application Programming Interface (API) supplied pixel or fragment shader program. To execute shader programs, shader processor 602 dispatches threads to execution units (e.g., 608A) via thread dispatcher 604. In some embodiments, shader processor 602 uses texture sampling logic in sampler 610 to access texture data in a texture map stored in memory. Arithmetic operations on the texture data and the input geometry data compute pixel color data for each geometric segment, or discard one or more pixels from further processing.
In some embodiments, data port 614 provides a memory access mechanism for thread execution logic 600 to output processed data to memory for further processing on a graphics processor output pipeline. In some embodiments, data port 614 includes or is coupled to one or more cache memories (e.g., data cache 612) to cache data for memory access via the data port.
As shown in fig. 6B, the graphics execution unit 608 can include an instruction fetch unit 637, a general register file array (GRF) 624, an architectural register file Array (ARF) 626, a thread arbiter 622, a send unit 630, a branch unit 632, a set of SIMD Floating Point Units (FPUs) 634, and in one embodiment, a set of dedicated integer SIMD ALUs 635. The GRF 624 and ARF 626 include a set of general purpose register files and architectural register files associated with each simultaneous hardware thread that may be active in the graphics execution unit 608. In one embodiment, per-thread architecture state is maintained in the ARF 626, while data used during thread execution is stored in the GRF 624. The execution state of each thread (including the instruction pointer for each thread) can be maintained in thread specific registers in the ARF 626.
In one embodiment, the graphics execution unit 608 has an architecture that is a combination of Simultaneous Multithreading (SMT) and fine-grained Interleaved Multithreading (IMT). The architecture has a modular configuration that can be fine-tuned at design time based on the number of registers per execution unit and the target number of simultaneous threads, where execution unit resources are partitioned across logic used to execute multiple simultaneous threads.
In one embodiment, the graphics execution unit 608 is capable of collectively issuing multiple instructions, which may each be different instructions. The thread arbiter 622 of the graphics execution unit thread 608 can dispatch instructions to one of the issue unit 630, the branch unit 632, or the SIMD FPU(s) 634 for execution. Each execution thread has access to 128 general purpose registers within the GRF 624, where each register is capable of storing 32 bytes, which are accessible as a SIMD8 element vector of 32-bit data elements. In one embodiment, each execution unit thread may access 4 kilobytes within GRF 624, although embodiments are not so limited and in other embodiments more or less register resources may be provided. In one embodiment, up to seven threads may be executed simultaneously, although the number of threads per execution unit may also vary depending on the embodiment. In an embodiment where seven threads have access to 4 kilobytes, the GRF 624 is capable of storing a total of 28 kilobytes. The flexible addressing mode can allow registers to be addressed together to efficiently build wider registers or to represent a strided rectangular block data structure.
In one embodiment, memory operations, sampler operations, and other longer latency system communications are dispatched via a "send" instruction executed by messaging transmit unit 630. In one embodiment, branch instructions are dispatched to a dedicated branch unit 632 to facilitate SIMD divergence and eventual convergence.
In one embodiment, graphics execution unit 608 includes one or more SIMD floating-point units ((one or more) FPUs) 634 to perform floating-point operations. In one embodiment, the FPU(s) 634 also support integer computations. In one embodiment, FPU(s) 634 are capable of performing up to a number M of 32-bit floating point (or integer) operations on SIMD's, or up to 2M of 16-bit integer or 16-bit floating point operations on SIMD's. In one embodiment, at least one of the FPU(s) provides extended mathematical capabilities to support high throughput beyond mathematical functions and double precision 64-bit floating point. In some embodiments, there is also a set of 8-bit integer SIMD ALUs 635, and the set of 8-bit integer SIMD ALUs 635 may be specifically optimized to perform operations associated with machine learning computations.
In one embodiment, an array of multiple instances of the graphics execution unit 608 can be instantiated in a graphics sub-core grouping (e.g., a subslice). For scalability, the product architect can select the exact number of execution units per sub-core grouping. In one embodiment, the execution unit 608 is capable of executing instructions across multiple execution channels. In additional embodiments, each thread executing on the graphics execution unit 608 executes on a different channel.
FIG. 7 is a block diagram illustrating a graphics processor instruction format 700 according to some embodiments. In one or more embodiments, a graphics processor execution unit supports an instruction set with instructions in multiple formats. The solid box shows components that are typically included in an execution unit instruction, while the dashed line includes components that are optional or included only in a subset of instructions. In some embodiments, the instruction format 700 described and illustrated is a macro-instruction because they are instructions supplied to the execution units, as opposed to micro-operations that result from instruction decoding once the instruction is processed.
In some embodiments, the graphics processor execution unit natively supports instructions in the 128-bit instruction format 710. Based on the selected instruction, instruction options, and number of operands, a 64-bit compact instruction format 730 may be used for some instructions. The native 128-bit instruction format 710 provides access to all instruction options, while in the 64-bit format 730 some options and operations are restricted. The available native instructions in the 64-bit format 730 vary from embodiment to embodiment. In some embodiments, instructions are partially compressed using a set of index values in the index field 713. The execution unit hardware references a set of compression tables based on the index values and uses the compression table outputs to reconstruct native instructions in the 128-bit instruction format 710.
For each format, instruction opcode 712 defines the operation to be performed by the execution unit. An execution unit executes each instruction in parallel across multiple data elements of each operand. For example, in response to an add instruction, the execution unit performs a simultaneous add operation across each color channel representing a texture element or a picture element. By default, the execution unit executes each instruction across all data lanes of operands. In some embodiments, instruction control field 714 enables control of certain execution options such as channel selection (e.g., prediction) and data channel order (e.g., swizzle). For instructions that employ the 128-bit instruction format 710, the execution size field 716 limits the number of data lanes to be executed in parallel. In some embodiments, the execution size field 716 is not available for use in the 64-bit compact instruction format 730.
Some execution unit instructions have up to three operands, including two source operands, src 0720, src 1722, and one destination 718. In some embodiments, the execution unit supports dual destination instructions, where one of the destinations is implicit. The data manipulation instruction can have a third source operand (e.g., SRC 2724), where the instruction opcode 712 determines the number of source operands. The last source operand of the instruction can be an immediate (e.g., hard-coded) value passed with the instruction.
In some embodiments, 128-bit instruction format 710 includes an access/address mode field 726, the access/address mode field 726 specifying, for example, whether direct register addressing mode or indirect register addressing mode is used. When using the direct register addressing mode, the register address of one or more operands is provided directly by bits in the instruction.
In some embodiments, 128-bit instruction format 710 includes an access/address mode field 726, the access/address mode field 726 specifying an address mode and/or an access mode of the instruction. In one embodiment, an access pattern is used to define the data access alignment of an instruction. Some embodiments support access patterns that include 16 byte aligned access patterns and 1 byte aligned access patterns, where the byte alignment of the access patterns determines the access alignment of instruction operands. For example, when in the first mode, the instruction may use byte aligned addressing for the source operand and the destination operand, and when in the second mode, the instruction may use 16 byte aligned addressing for all of the source operand and the destination operand.
In one embodiment, the address mode portion of the access/address mode field 726 determines whether the instruction will use direct addressing or indirect addressing. When using the direct register addressing mode, bits in the instruction directly provide the register address of one or more operands. When using the indirect register addressing mode, register addresses for one or more operands may be calculated based on an address register value and an address immediate field in the instruction.
In some embodiments, instructions are grouped based on opcode 712 bit fields to simplify opcode decoding 740. For an 8-bit opcode, bits 4, 5, and 6 allow the execution unit to determine the type of opcode. The exact opcode groupings shown are examples only. In some embodiments, the move and logical opcode group 742 includes data move and logical instructions (e.g., move (mov), compare (cmp)). In some embodiments, move and logical group 742 share five Most Significant Bits (MSBs), with a move (mov) instruction taking the form 0000 xxxxxxb and a logical instruction taking the form 0001 xxxxb. The group of flow control instructions 744 (e.g., fetch, jump (jmp)) includes instructions in the form of 0010 xxxxxxb (e.g., 0x 20). The miscellaneous instruction group 746 includes a mix of instructions, including synchronous instructions (e.g., wait, send) in the form of 0011 xxxxxxb (e.g., 0x 30). The parallel mathematical instruction group 748 includes component-wise arithmetic instructions (e.g., add, multiply (mul)) in the form of 0100 xxxxxxb (e.g., 0x 40). The parallel math group 748 performs arithmetic operations in parallel across the data channels. The vector math group 750 includes arithmetic instructions (e.g., dp 4) in the form 0101xxxxb (e.g., 0x 50). Vector math groups perform arithmetic such as dot product calculations on vector operands.
Graphics pipeline
Fig. 8 is a block diagram of another embodiment of a graphics processor 800. The elements of fig. 8 having the same reference numbers (or names) as the elements of any other figure herein can operate or function in any manner similar to that described elsewhere herein, but are not limited to such.
In some embodiments, graphics processor 800 includes geometry pipeline 820, media pipeline 830, display engine 840, thread execution logic 850, and render output pipeline 870. In some embodiments, graphics processor 800 is a graphics processor within a multi-core processing system that includes one or more general purpose processing cores. The graphics processor is controlled by register writes to one or more control registers (not shown) or via commands issued to the graphics processor 800 over the ring interconnect 802. In some embodiments, ring interconnect 802 couples graphics processor 800 to other processing components, such as other graphics processors or general purpose processors. Commands from the ring interconnect 802 are interpreted by a command streamer 803, which command streamer 803 supplies instructions to the various components of the geometry pipeline 820 or media pipeline 830.
In some embodiments, the command streamer 803 directs the operation of a vertex fetcher 805, which vertex fetcher 805 reads the vertex data from memory and executes the vertex processing commands provided by the command streamer 803. In some embodiments, vertex fetcher 805 provides vertex data to vertex shader 807, which vertex shader 807 performs coordinate space transformations and lighting operations on each vertex. In some embodiments, vertex fetcher 805 and vertex shader 807 execute vertex processing instructions by dispatching execution threads to execution units 852A-852B via thread dispatcher 831.
In some embodiments, execution units 852A-852B are an array of vector processors having sets of instructions for performing graphics and media operations. In some embodiments, execution units 852A-852B have an attached L1 cache 851, the L1 cache 851 being specific to each array, or shared between arrays. The cache can be configured as a data cache, an instruction cache, or a single cache partitioned to contain data and instructions in different partitions.
In some embodiments, geometry pipeline 820 includes a tessellation component to perform hardware accelerated tessellation of 3D objects. In some embodiments, a programmable hull shader (programmable hull shader) 811 configures the tessellation operations. The programmable domain shader 817 provides back-end evaluation of the tessellation output. The tessellator 813 operates under the direction of the hull shader 811 and contains dedicated logic to generate a detailed set of geometric objects based on a coarse geometric model provided as input to the geometry pipeline 820. In some embodiments, tessellation components (e.g., hull shader 811, tessellator 813, and domain shader 817) can be bypassed if tessellation is not used.
In some embodiments, a full geometry object can be processed by the geometry shader 819 via one or more threads assigned to the execution units 852A-852B, or can travel directly to a clipper 829. In some embodiments, the geometry shader operates on the entire geometry object, rather than on vertices or patches of vertices (patches) as in previous stages of the graphics pipeline. If tessellation is disabled, geometry shader 819 receives input from vertex shader 807. In some embodiments, the geometry shader 819 may be programmed by a geometry shader program to perform geometry tessellation when the tessellation unit is disabled.
Before rasterization, the clipper 829 processes the vertex data. The clipper 829 may be a programmable clipper or a fixed function clipper with clipping and geometry shader functions. In some embodiments, a rasterizer and depth test component 873 in the render output pipeline 870 dispatches pixel shaders to convert the geometric objects into a per-pixel representation. In some embodiments, pixel shader logic is included in thread execution logic 850. In some embodiments, the application can bypass the rasterizer and depth test component 873 and access the un-rasterized vertex data via the stream output unit 823.
Graphics processor 800 has an interconnect bus, interconnect fabric, or some other interconnect mechanism that allows data and messages to be passed between the main components of the processor. In some embodiments, the execution units 852A-852B and associated logic units (e.g., L1 cache 851, sampler 854, texture cache 858, etc.) are interconnected via data ports 856 to perform memory accesses and communicate with the rendering output pipeline components of the processor. In some embodiments, sampler 854, caches 851, 858 and execution units 852A-852B each have separate memory access paths. In one embodiment, the texture cache 858 can also be configured as a sampler cache.
In some embodiments, the render output pipeline 870 contains a rasterizer and depth test component 873 that converts vertex-based objects into associated pixel-based representations. In some embodiments, the rasterizer logic includes a windower (windower)/masker unit to perform fixed function triangle and line rasterization. An associated render cache 878 and depth cache 879 may also be available in some embodiments. The pixel operations component 877 performs pixel-based operations on the data, although in some examples, pixel operations associated with 2D operations (e.g., bitblock image transfer with blending) are performed by the 2D engine 841 or replaced by the display controller 843 when displaying with an overlay display plane. In some embodiments, shared L3 cache 875 is available to all graphics components, allowing data to be shared without using main system memory.
In some embodiments, graphics processor media pipeline 830 includes a media engine 837 and a video front end 834. In some embodiments, video front end 834 receives pipeline commands from command streamer 803. In some embodiments, media pipeline 830 includes a separate command streamer. In some embodiments, the video front end 834 processes media commands before sending the commands to the media engine 837. In some embodiments, media engine 837 includes thread spawning functionality to spawn threads for dispatch to thread execution logic 850 via thread dispatcher 831.
In some embodiments, graphics processor 800 includes a display engine 840. In some embodiments, display engine 840 is external to processor 800 and is coupled with the graphics processor via ring interconnect 802 or some other interconnect bus or fabric. In some embodiments, display engine 840 includes a 2D engine 841 and a display controller 843. In some embodiments, the display engine 840 contains dedicated logic that can operate independently of the 3D pipeline. In some embodiments, the display controller 843 is coupled with a display device (not shown), which may be a system-integrated display device (as in a laptop computer) or may be an external display device attached via a display device connector.
In some embodiments, geometry pipeline 820 and media pipeline 830 may be configured to perform operations based on multiple graphics and media programming interfaces and are not specific to any one Application Programming Interface (API). In some embodiments, driver software for the graphics processor translates API calls specific to a particular graphics or media library into commands that can be processed by the graphics processor. In some embodiments, support is provided for an open graphics library (OpenGL), open computing language (OpenCL), and/or Vulkan graphics and computing APIs, all from the Khronos Group. In some embodiments, support may also be provided for the Direct3D library from microsoft corporation. In some embodiments, a combination of these libraries may be supported. Support may also be provided for an open source computer vision library (OpenCV). Future APIs with compatible 3D pipelines will also be supported if a mapping from the pipeline of the future API to the pipeline of the graphics processor can be made.
Graphics pipeline programming
FIG. 9A is a block diagram illustrating a graphics processor command format 900 according to some embodiments. FIG. 9B is a block diagram that illustrates a graphics processor command sequence 910, according to an embodiment. The solid line boxes in FIG. 9A show components that are typically included in graphics commands, while the dashed lines include components that are optional or included only in a subset of graphics commands. The exemplary graphics processor command format 900 of FIG. 9A includes data fields to identify the client 902, command operation code (opcode) 904, and data 906 of the command. Some commands also include a subopcode 905 and a command size 908.
In some embodiments, the client 902 specifies a client unit of the graphics device that processes command data. In some embodiments, the graphics processor command parser examines the client field of each command to adjust the further processing of the command and routes the command data to the appropriate client unit. In some embodiments, a graphics processor client unit includes a memory interface unit, a rendering unit, a 2D unit, a 3D unit, and a media unit. Each client unit has a corresponding processing pipeline that processes commands. Once the client unit receives the command, the client unit reads the operation code 904 and the sub-operation code 905 (if the sub-operation code 905 is present) to determine the operation to perform. The client unit uses the information in data field 906 to execute the command. For some commands, an explicit command size 908 is contemplated to specify the size of the command. In some embodiments, the command parser automatically determines the size of at least some of the commands based on the command opcode. In some embodiments, the commands are aligned via multiples of a doubleword.
An exemplary graphics processor command sequence 910 is shown in the flowchart of FIG. 9B. In some embodiments, software or firmware of a data processing system featuring an embodiment of a graphics processor uses the illustrated version of the command sequence to set, execute, and terminate a set of graphics operations. Sample command sequences are shown and described for purposes of example only, as embodiments are not limited to these particular commands or this sequence of commands. Additionally, the commands may be issued as a batch of commands in a sequence of commands such that the graphics processor will process the sequence of commands at least partially concurrently.
In some embodiments, graphics processor command sequence 910 may begin with a pipeline flush command 912 to cause any active graphics pipeline to complete the current pending commands of the pipeline. In some embodiments, 3D pipeline 922 and media pipeline 924 do not operate concurrently. A pipeline flush is performed to cause the active graphics pipeline to complete any pending commands. In response to a pipeline flush, a command parser for a graphics processor will halt command processing until the active drawing engine completes pending operations and the associated read cache is invalidated. Alternatively, any data in the render cache marked as dirty can be flushed to memory. In some embodiments, pipeline flush command 912 can be used for pipeline synchronization or used before placing the graphics processor into a low power state.
In some embodiments, the pipeline select command 913 is used when the command sequence requires the graphics processor to explicitly switch between pipelines. In some embodiments, the pipeline select command 913 is required only once within the execution context before issuing the pipeline command unless the context is to issue commands for both pipelines. In some embodiments, a pipeline flush command 912 is required immediately prior to a pipeline switch via pipeline select command 913.
In some embodiments, pipeline control commands 914 configure the graphics pipeline for operation and are used to program 3D pipeline 922 and media pipeline 924. In some embodiments, the pipeline control commands 914 configure the pipeline state for the active pipeline. In one embodiment, the pipeline control command 914 is used for pipeline synchronization and to flush data from one or more caches within the active pipeline before processing a batch of commands.
In some embodiments, the return buffer status command 916 is used to configure a set of return buffers for the respective pipeline to write data. Some pipelining operations require allocation, selection, or configuration of one or more return buffers into which these operations write intermediate data during processing. In some embodiments, the graphics processor also uses one or more return buffers to store output data and perform cross-thread communications. In some embodiments, return buffer status 916 includes selecting the size and number of return buffers to be used for a set of pipelining operations.
The remaining commands in the command sequence differ based on the active pipeline used for the operation. Based on the pipeline determination 920, the command sequence is customized to either the 3D pipeline 922, which starts in a 3D pipeline state 930, or the media pipeline 924, which starts in a media pipeline state 940.
The commands used to configure the 3D pipeline state 930 include 3D state set commands for vertex buffer state, vertex element state, constant color state, depth buffer state, and other state variables to be configured before processing the 3D primitive command. The values of these commands are determined based at least in part on the particular 3D API in use. In some embodiments, the 3D pipeline state 930 commands can also selectively disable or bypass certain pipeline elements if those elements are not to be used.
In some embodiments, the 3D primitive 932 command is used to submit a 3D primitive to be processed by the 3D pipeline. Commands and associated parameters passed to the graphics processor via the 3D primitive 932 commands are forwarded to vertex fetch functions in the graphics pipeline. The vertex fetch function uses the 3D primitive 932 command data to generate the vertex data structure. The vertex data structure is stored in one or more return buffers. In some embodiments, 3D primitive 932 commands are used to perform vertex operations on 3D primitives via a vertex shader. To process the vertex shader, 3D pipeline 922 dispatches shader execution threads to the graphics processor execution unit.
In some embodiments, the 3D pipeline 922 is triggered via an execute 934 command or event. In some embodiments, the register write triggers the command execution. In some embodiments, execution is triggered via a "go" or "kick" command in the command sequence. In one embodiment, a pipeline synchronization command to flush a sequence of commands through a graphics pipeline is used to trigger command execution. The 3D pipeline will perform geometric processing for the 3D primitives. Once the operation is complete, the resulting geometric object is rasterized and the pixel engine colors the resulting pixels. For those operations, additional commands to control pixel shading and pixel back-end operations may also be included.
In some embodiments, graphics processor command sequence 910 follows the path of media pipeline 924 when performing media operations. In general, the particular use and manner of programming for media pipeline 924 depends on the media or computing operation to be performed. Certain media decoding operations may be offloaded to the media pipeline during media decoding. In some embodiments, the media pipeline can also be bypassed and media decoding can be performed in whole or in part using resources provided by one or more general purpose processing cores. In one embodiment, the media pipeline further includes elements for General Purpose Graphics Processor Unit (GPGPU) operations, wherein the graphics processor is used to perform SIMD vector operations using a compute shader program that is not explicitly related to the rendering of graphics primitives.
In some embodiments, media pipeline 924 is configured in a similar manner as 3D pipeline 922. A set of commands to configure the media pipeline state 940 is dispatched or placed into a command queue prior to the media object command 942. In some embodiments, the commands for the media pipeline state 940 include data to configure the media pipeline elements that will be used to process the media object. This includes data, such as encoding and decoding formats, used to configure the video decoding and video encoding logic within the media pipeline. In some embodiments, the commands for the media pipeline state 940 also support the use of one or more pointers to "indirect" state elements containing a collection of state settings.
In some embodiments, media object command 942 supplies a pointer to a media object for processing by the media pipeline. The media object includes a memory buffer containing video data to be processed. In some embodiments, all of the media pipeline state must be valid before issuing the media object command 942. Once the pipeline state is configured and the media object command 942 is queued, the media pipeline 924 is triggered via an execute command 944 or equivalent execute event (e.g., a register write). The output from media pipeline 924 may then be post-processed by operations provided by 3D pipeline 922 or media pipeline 924. In some embodiments, GPGPU operations are configured and performed in a similar manner as media operations.
Graphics software architecture
FIG. 10 illustrates an exemplary graphics software architecture for data processing system 1000 in accordance with some embodiments. In some embodiments, the software architecture includes a 3D graphics application 1010, an operating system 1020, and at least one processor 1030. In some embodiments, processor 1030 includes a graphics processor 1032 and one or more general purpose processor cores 1034. Graphics application 1010 and operating system 1020 each execute in system memory 1050 of the data processing system.
In some embodiments, 3D graphics application 1010 contains one or more shader programs, including shader instructions 1012. The shader language instructions can be in a high level shader language, such as High Level Shader Language (HLSL) or OpenGL shader language (GLSL). The application also includes executable instructions 1014 in a machine language suitable for execution by the general purpose processor core 1034. The application also includes a graphical object 1016 defined by the vertex data.
In some embodiments, the operating system 1020 is Microsoft Windows ® operating system from Microsoft corporation, a proprietary UNIX-like operating system, or an open source UNIX-like operating system using a variant of the Linux kernel. The operating system 1020 can support a graphics API 1022, such as the Direct3D API, the OpenGL API, or the Vulkan API. When the Direct3D API is in use, the operating system 1020 uses a front-end shader compiler 1024 to compile any shader instructions 1012 that employ HLSL into a lower-level shader language. The compilation may be just-in-time (JIT) compilation or the application may be able to perform shader precompilation. In some embodiments, the high-level shaders are compiled into low-level shaders during compilation of the 3D graphics application 1010. In some embodiments, the shader instructions 1012 are provided in an intermediate form, such as a version of the Standard Portable Intermediate Representation (SPIR) used by the Vulkan API.
In some embodiments, user mode graphics driver 1026 contains a back-end shader compiler 1027 to convert shader instructions 1012 into a hardware-specific representation. When the OpenGL API is in use, shader instructions 1012 in the GLSL high-level language are passed to user-mode graphics driver 1026 for compilation. In some embodiments, the user mode graphics driver 1026 uses operating system kernel mode functions 1028 to communicate with the kernel mode graphics driver 1029. In some embodiments, the kernel mode graphics driver 1029 communicates with the graphics processor 1032 to dispatch commands and instructions.
IP check cash
One or more aspects of at least one embodiment may be implemented by representative code stored on a machine-readable medium which represents and/or defines logic within an integrated circuit such as a processor. For example, a machine-readable medium may include instructions representing various logic within a processor. When read by a machine, the instructions may cause the machine to fabricate logic to perform the techniques described herein. Such a representation, referred to as an "IP core," is a reusable unit of logic for an integrated circuit that may be stored on a tangible machine-readable medium as a hardware model that describes the structure of the integrated circuit. The hardware model may be supplied to various customers or manufacturing facilities that load the hardware model on fabrication machines that manufacture integrated circuits. An integrated circuit may be fabricated such that the circuit performs the operations described in association with any of the embodiments described herein.
Fig. 11A is a block diagram illustrating an IP core development system 1100 that may be used to fabricate integrated circuits to perform operations, according to an embodiment. The IP core development system 1100 may be used to generate a modular, reusable design that can be incorporated into a larger design or used to construct an entire integrated circuit (e.g., an SOC integrated circuit). Design facility 1130 is capable of generating software simulations 1110 of IP core designs in a high-level programming language (e.g., C/C + +). Software simulation 1110 can be used to design, test, and verify the behavior of an IP core using simulation model 1112. Simulation model 1112 may include functional, behavioral, and/or timing simulations. A Register Transfer Level (RTL) design 1115 can then be created or synthesized from simulation model 1112. RTL design 1115 is an abstraction of the behavior of an integrated circuit that models the flow of digital signals between hardware registers, including associated logic that executes using the modeled digital signals. In addition to RTL design 1115, lower level designs at the logic level or transistor level may be created, designed, or synthesized. Thus, the specific details of the initial design and simulation may differ.
The RTL design 1115 or equivalent may be further synthesized by the design facility into a hardware model 1120, which hardware model 1120 may employ a Hardware Description Language (HDL) or some other representation of physical design data. The HDL may be further simulated or tested to verify the IP core design. Non-volatile memory 1140 (e.g., a hard disk, flash memory, or any non-volatile storage medium) can be used to store the IP core design for delivery to third party fabrication facility 1165. Alternatively, the IP core design may be communicated over a wired connection 1150 or a wireless connection 1160 (e.g., via the Internet). Fabrication facility 1165 may then fabricate an integrated circuit based at least in part on the IP core design. The integrated circuit fabricated can be configured to perform operations according to at least one embodiment described herein.
Figure 11B illustrates a cross-sectional side view of an integrated circuit package assembly 1170 according to some embodiments described herein. Integrated circuit package assembly 1170 illustrates an implementation of one or more processor or accelerator devices as described herein. The packaging assembly 1170 includes a plurality of hardware logic units 1172, 1174 connected to a substrate 1180. The logic 1172, 1174 may be implemented at least partially in configurable logic or fixed functionality logic hardware, and can include one or more portions of any of the processor core(s), graphics processor(s), or other accelerator device described herein. Each cell of logic 1172, 1174 can be implemented within a semiconductor die and coupled with a substrate 1180 via an interconnect 1173. Interconnect structure 1173 may be configured to route electrical signals between logic 1172, 1174 and substrate 1180, and can include interconnects such as, but not limited to, bumps or posts. In some embodiments, the interconnect fabric 1173 may be configured to route electrical signals, such as, for example, input/output (I/O) signals and/or power or ground signals associated with the operation of the logic 1172, 1174. In some embodiments, the substrate 1180 is an epoxy-based laminate substrate. In other embodiments, the package substrate 1180 may include other suitable types of substrates. The package assembly 1170 can be connected to other electrical devices via package interconnect 1183. Package interconnect 1183 may be coupled to a surface of substrate 1180 to route electrical signals to other electrical devices, such as a motherboard, other chipset, or a multi-chip module.
In some embodiments, the logic units 1172, 1174 are electrically coupled with a bridge 1182, the bridge 1182 configured to route electrical signals between the logics 1172, 1174. Bridge 1182 may be a dense interconnect structure that provides routing for electrical signals. The bridge 1182 may include a bridge substrate composed of glass or a suitable semiconductor material. Circuit routing features can be formed on the bridge substrate to provide chip-to-chip connections between the logic 1172, 1174.
Although two logic units 1172, 1174 and a bridge 1182 are shown, embodiments described herein may include more or fewer logic units on one or more dies. Since bridge 1182 may be excluded when logic is included on a single die, one or more dies may be connected through zero or more bridges. Alternatively, multiple dies or logic units can be connected by one or more bridges. Additionally, multiple logic cells, dies, and bridges can be connected together in other possible configurations, including three-dimensional configurations.
Exemplary System-on-chip Integrated Circuit
Fig. 12-14 illustrate an example integrated circuit and associated graphics processor that may be fabricated using one or more IP cores, according to various embodiments described herein. Other logic and circuitry may be included in addition to those shown, including additional graphics processor/cores, peripheral interface controllers, or general purpose processor cores.
FIG. 12 is a block diagram illustrating an exemplary system-on-chip integrated circuit 1200 that may be fabricated using one or more IP cores, according to an embodiment. The exemplary integrated circuit 1200 includes one or more application processors 1205 (e.g., CPUs), at least one graphics processor 1210, and may additionally include an image processor 1215 and/or a video processor 1220, any of which may be modular IP cores from the same or multiple different design facilities. Integrated circuit 1200 includes peripheral or bus logic including USB controller 1225, UART controller 1230, SPI/SDIO controller 1235, and I2S/I2C controller 1240. Additionally, the integrated circuit can include a display device 1245 coupled to one or more of a High Definition Multimedia Interface (HDMI) controller 1250 and a Mobile Industrial Processor Interface (MIPI) display interface 1255. Storage may be provided by a flash memory subsystem 1260 including flash memory and a flash memory controller. A memory interface may be provided via the memory controller 1265 to access SDRAM or SRAM memory devices. Some integrated circuits additionally include an embedded security engine 1270.
Fig. 13A-13B are block diagrams illustrating an exemplary graphics processor for use within a SoC according to embodiments described herein. FIG. 13A illustrates an exemplary graphics processor 1310 of a system-on-chip integrated circuit that may be fabricated using one or more IP cores, according to an embodiment. FIG. 13B illustrates an additional exemplary graphics processor 1340 of a system-on-chip integrated circuit that may be fabricated using one or more IP cores, according to an embodiment. Graphics processor 1310 of FIG. 13A is an example of a low power graphics processor core. Graphics processor 1340 of fig. 13B is an example of a higher performance graphics processor core. Each of the graphics processors 1310, 1340 can be a variation of the graphics processor 1210 of fig. 12.
As shown in FIG. 13A, graphics processor 1310 includes a vertex processor 1305 and one or more fragment processors 1315A-1315N (e.g., 1315A, 1315B, 1315C, 1315D through 1315N-1 and 1315N). Graphics processor 1310 is capable of executing different shader programs via separate logic, such that vertex processor 1305 is optimized to perform operations for vertex shader programs, while one or more fragment processors 1315A-1315N perform fragment (e.g., pixel) shading operations for fragment or pixel shader programs. Vertex processor 1305 executes the vertex processing stages of the 3D graphics pipeline and generates primitive and vertex data. The fragment processor(s) 1315A-1315N use the primitive and vertex data generated by the vertex processor 1305 to produce a frame buffer for display on a display device. In one embodiment, fragment processor(s) 1315A-1315N are optimized to execute fragment shader programs as provided for in the OpenGL API, which can be used to perform similar operations as for the pixel shader programs as provided in the Direct3D API.
Graphics processor 1310 additionally includes one or more Memory Management Units (MMUs) 1320A-1320B, cache(s) 1325A-1325B, and circuit interconnect(s) 1330A-1330B. The one or more MMUs 1320A-1320B provide virtual address to physical address mapping for the graphics processor 1310, including for the vertex processor 1305 and/or the fragment processor(s) 1315A-1315N, which may reference vertex or image/texture data stored in memory in addition to vertex or image/texture data stored in one or more caches 1325A-1325B. In one embodiment, one or more MMUs 1320A-1320B may be synchronized with other MMUs within the system, including one or more MMUs associated with one or more application processors 1205, image processors 1215, and/or video processors 1220 of FIG. 12, enabling each processor 1205-1220 to participate in a shared or unified virtual memory system. According to an embodiment, one or more circuit interconnects 1330A-1330B enable graphics processor 1310 to interface with other IP cores within the SoC via the SoC's internal bus or via a direct connection.
As shown in FIG. 13B, graphics processor 1340 includes one or more MMUs 1320A-1320B, caches 1325A-1325B and circuit interconnects 1330A-1330B of graphics processor 1310 of FIG. 13. Graphics processor 1340 includes one or more shader cores 1355A-1355N (e.g., 1355A, 1355B, 1355C, 1355D, 1355E, 1355F through 1355N-1 and 1355N) that provide a unified shader core architecture in which a single core or a single type of core is capable of executing all types of programmable shader code, including shader program code to implement vertex shaders, fragment shaders, and/or compute shaders. The exact number of shader cores present can vary between embodiments and implementations. In addition, the graphic processor 1340 includes: an inter-core task manager 1345, the inter-core task manager 1345 acting as a thread dispatcher to dispatch execution threads to one or more shader cores 1355A-1355N; and a tiling unit 1358 to accelerate tiling operations (tiling operations) for tile-based rendering in which rendering operations for a scene are subdivided in image space, e.g., to take advantage of local spatial coherence within the scene or to optimize internal cache usage.
14A-14B illustrate additional exemplary graphics processor logic, according to embodiments described herein. FIG. 14A illustrates a graphics core 1400, which may be included within the graphics processor 1210 of FIG. 12, and which may be a unified shader core 1355A-1355N as in FIG. 13B. FIG. 14B shows an additional highly parallel general purpose graphics processing unit 1430, which is a highly parallel general purpose graphics processing unit suitable for deployment on a multi-chip module.
As shown in fig. 14A, graphics core 1400 includes a shared instruction cache 1402, a texture unit 1418, and a cache/shared memory 1420, which are common to execution resources within graphics core 1400. Graphics core 1400 may include multiple slices 1401A-1401N or partitions for each core, and a graphics processor may include multiple instances of graphics core 1400. The slices 1401A-1401N may include support logic that includes a local instruction cache 1404A-1404N, a thread scheduler 1406A-1406N, a thread dispatcher 1408A-1408N, and a set of registers 1410A-1440N. To perform logical operations, slices 1401A-1401N may include a set of additional functional units (AFUs 1412A-1412N), floating point units (FPUs 1414A-1414N), integer arithmetic logic units (ALUs 1416 and 1416N), address calculation units (ACUs 1413A-1413N), double precision floating point units (DPFPUs 1415A-1415N), and matrix processing units (MPUs 1417A-1417N).
Some computing units operate with a certain precision. For example, FPUs 1414A-1414N may perform single-precision (32-bit) and half-precision (16-bit) floating-point operations, while DPFPUs 1415A-1415N perform double-precision (64-bit) floating-point operations. The ALUs 1416A-1416N may perform variable precision integer operations with 8-bit, 16-bit, and 32-bit precision and may be configured for mixed precision operations. The MPUs 1417A-1417N may also be configured for mixed precision matrix operations, including half precision floating point and 8-bit integer operations. The MPUs 1417A-1417N may perform various matrix operations to accelerate the machine learning application framework, including enabling support for accelerated generalized matrix-to-matrix multiplications (GEMMs). AFUs 1412A-1412N can perform additional logical operations not supported by floating point or integer units, including trigonometric operations (e.g., sine, cosine, etc.).
As shown in FIG. 14B, general purpose processing units (GPGPU) 1430 may be configured to enable highly parallel computing operations to be performed by the array of graphics processing units. In addition, the GPGPU 1430 may be directly linked to other instances of gpgpgpus to create multi-GPU clusters, thereby increasing training speed of deep neural networks in particular. The GPGPU 1430 includes a host interface 1432 that enables connection with a host processor. In one embodiment, host interface 1432 is a PCI Express interface. However, the host interface may also be a vendor-specific communication interface or communication fabric. The GPGPU 1430 receives commands from host processors and uses a global scheduler 1434 to distribute the execution threads associated with the commands to a set of compute clusters 1436A-1436H. Compute clusters 1436A-1436H share cache 1438. The cache memory 1438 may serve as a high level cache of cache memory within the compute clusters 1436A-1436H.
The GPGPU 1430 includes memories 14434A-14434B coupled with compute clusters 1436A-1436H via a set of memory controllers 1442A-1442B. In various embodiments, memories 1434A-1434B may comprise various types of memory devices, including Dynamic Random Access Memory (DRAM) or graphics random access memory, such as Synchronous Graphics Random Access Memory (SGRAM), including Graphics Double Data Rate (GDDR) memory.
In one embodiment, compute clusters 1436A-1436H each include a set of graphics cores (such as graphics core 1400 of fig. 14A), which may include multiple types of integer and floating point logic units capable of performing compute operations with a range of precision including a precision suitable for machine learning computations. For example, in one embodiment, at least a subset of the floating point units in each of the compute clusters 1436A-1436H may be configured to perform 16-bit or 32-bit floating point operations, while a different subset of the floating point units may be configured to perform 64-bit floating point operations.
Multiple instances of the GPGPU 1430 may be configured to operate as a compute cluster. The communication mechanisms used by the compute clusters for synchronization and data exchange vary across embodiments. In one embodiment, multiple instances of GPGPU 1430 communicate through host interface 1432. In one embodiment, GPGPU 1430 includes an I/O hub 1439 that couples GPGPU 1430 and GPU link 1440, which GPU link 1440 enables direct connection to other instances of the GPGPU. In one embodiment, GPU link 1440 is coupled to a dedicated GPU-GPU bridge that enables communication and synchronization between multiple instances of GPGPU 1430. In one embodiment, GPU link 1440 is coupled with a high speed interconnect to transmit and receive data to other GPGPGPUs or parallel processors. In one embodiment, multiple instances of the GPGPU 1430 are located in separate data processing systems and communicate via a network device accessible via the host interface 1432. In one embodiment, GPU link 1440 may be configured to enable connection to a host processor, either in addition to host interface 1432 or as an alternative to host interface 1432.
Although the illustrated configuration of the GPGPU 1430 may be configured to train a neural network, one embodiment provides an alternative configuration of the GPGPU 1430 that may be configured to be deployed within a high performance or low power inference platform. In the inferred configuration, the GPGPU 1430 includes fewer compute clusters 1436A-1436H relative to the training configuration. Additionally, the memory technology associated with memories 1434A-1434B may differ between inferred and training configurations, with higher bandwidth memory technologies dedicated to training configurations. In one embodiment, inferred configuration of the GPGPU 1430 may support inferring specific instructions. For example, the inference configuration may provide support for one or more 8-bit integer dot-product instructions that are typically used during inference operations of a deployed neural network.
Apparatus and method for volume-bounded hierarchical (BVH) compression
An N-wide Bounding Volume Hierarchy (BVH) node includes N bounding volumes corresponding to the N children of the given node. In addition to bounding volumes, references to each child node are included as indexes or pointers. A bit of an index or pointer may be assigned to indicate whether the node is an internal node or a leaf node. A common bounding volume format for ray tracing in particular is the Axis Aligned Bounding Volume (AABV) or the Axis Aligned Bounding Box (AABB). The AABB can be defined with only minimum and maximum extents in each dimension, thereby providing efficient ray intersection testing.
Typically, the AABB is stored in an uncompressed format using a single precision (e.g., 4 bytes) floating point value. To define an uncompressed three-dimensional AABB, two single precision floating point values (min/max) (e.g., 2 × 3 × 4) are used for each of the three axes, resulting in 24 bytes of the range for storing the AAAB plus an index or pointer (e.g., a 4 byte integer or 8 byte pointer) to the child node. Thus, each AABB defined for a BVH node may be up to 32 bytes. Thus, a binary BVH node with a sub-entry may require 64 bytes, a BVH node width 4 may require 128 bytes, and a BVH width 8 may require up to 256 bytes.
Directional bounding boxes using discrete oriented polyhedrons (k-DOPs) in k directions are also a common bounding volume format that may be used with embodiments described herein. For k-DOP, the lower and upper bounds are stored for a number of arbitrary directions. In contrast to the AABB, k-DOP is not limited to boundaries in the direction of coordinate axes only, but encompasses the geometry in space in any number of directions.
To reduce memory size requirements for using a Bounding Volume Hierarchy (BVH), BVH data may be stored in a compressed format. For example, each AABB can be stored in a hierarchically compressed format with respect to its parent. However, during ray traversal, when pushing BVH node references onto the stack, level coding may cause problems with ray tracing implementations. When dereferencing later, the final AABB is computed following a path to the root node, potentially resulting in a long dependency chain. An alternative solution stores the current AABB on a stack, which requires a large amount of stack memory to store additional data, since the stack depth of each ray typically ranges between 40 to 60 entries.
Embodiments described herein provide apparatus, systems, methods, and various logical processes for compressing BVH nodes in a simple and efficient manner without requiring references to parent nodes or additional stack storage space to decompress child boundaries of the nodes, thereby significantly reducing the complexity of implementing ray tracing acceleration hardware.
In one embodiment, to reduce memory requirements, the N child bounding boxes of the N-wide BVH node are encoded relative to the merge box of all children by storing the parent bounding box with absolute coordinates and full (e.g., floating point) precision and storing the child bounding boxes at a lower precision relative to the parent bounding box.
The approach described herein reduces memory storage and bandwidth requirements compared to conventional approaches that store full-precision bounding boxes for all sub-items. Each node may decompress separately from the other nodes. Thus, during traversal, the complete bounding box is not stored on the stack, and the entire path from the root of the tree is not re-traversed on pop operations to decompress the nodes. Additionally, ray-node intersection tests may be performed with reduced precision, thereby reducing the complexity required within the arithmetic hardware unit.
Bounding volume and ray-box intersection testing
Fig. 15 is an illustration of an enclosed volume 1502 according to an embodiment. The enclosed volume 1502 is shown in axial alignment with the three-dimensional axis 1500. However, embodiments are applicable to different bounding representations (e.g., oriented bounding boxes, discrete oriented polyhedrons, spheres, etc.) and any number of dimensions. The bounding volume 1502 defines a minimum and a maximum extent of the three-dimensional object 1504 along each dimension of the axis. To generate a BVH for a scene, a bounding box is constructed for each object in the set of objects in the scene. A set of parent bounding boxes may then be constructed around the grouping of bounding boxes constructed for each object.
16A-B illustrate representations of bounding volume hierarchies of two-dimensional objects. Fig. 16A shows a set of bounding volumes 1600 surrounding a set of geometric objects. FIG. 16B shows the ordered tree 1602 of the bounding volume 1600 of FIG. 16A.
As shown in FIG. 16A, the set of bounding volumes 1600 includes a root bounding volume N1Which is all other enclosed volumes N2-N7The parent of (a) encloses a volume. Enclosing a volume N2And N3Is the root volume N1And leaf volume N4-N7The interior of which encloses a volume. Leaf volume N4-N7Including geometric objects O of a scene1-O8。
FIG. 16B shows the enclosed volume N1-N7And geometric object O1-O8Ordered tree 1602. The ordered tree 1602 shown is a binary tree in which each node of the tree has two child nodes. The data structure configured to contain information for each node may include a bounding volume of the nodeBoundary information (e.g., bounding box) and at least a reference to a node for each child of the node.
The ordered tree 1602 of bounding volumes represents a hierarchy of definitions that may be used to perform hierarchical versions of various operations including, but not limited to, collision detection and ray-box intersection. In the example of ray-box intersection, the root node N may be used1Starting level mode test node, the root node N1Is the parent node of all other bounding volume nodes in the hierarchy. If the root node N1Fails the ray-box intersection test, then all other nodes of the tree may be bypassed. If the root node N1Passes, sub-trees of the tree may be tested and traversed or bypassed in an orderly fashion until at least the intersected leaf node N is determined4-N7A collection of (a). The exact testing and traversal algorithms used may vary according to the embodiment.
FIG. 17 is an illustration of a ray-box intersection test according to an embodiment. During the ray-box intersection test, rays 1702 are projected, and an equation defining the ray may be used to determine whether the ray intersects the plane defining the bounding box 1700 under test. Ray 1702 may be represented asWhereinCorresponding to the origin of the light rays,is the direction of the light, andis a real value. May use variationsTo define any point along the ray. When the maximum incident plane intersection distance is less than or equal to the minimum exit plane distance, the ray 1702 is said to intersect the bounding box 1700. With respect to the light rays 1702 of figure 17,the y-plane incident intersection distance is shown as tmin-y1704. The y-plane exit intersection distance is shown as tmax-y1708. Can be at tmin-xAt 1706, the x-plane incident intersection distance is calculated, and the x-plane exit intersection distance is shown as tmax-x1710. Thus, a given ray 1702 may be mathematically shown to intersect the bounding box along at least the x and y planes because tmin-x1706 is less than tmax-y1708. To perform ray-box intersection testing using the graphics processor, the graphics processor is configured to store an acceleration data structure defining at least each bounding box to be tested. To speed up using the bounding volume hierarchy, at least references to child nodes of the bounding box are stored.
Bounding volume node compression
For axis aligned bounding boxes in 3D space, the acceleration data structure may store the lower and upper boundaries of the bounding box in three dimensions. Software implementations may store these bounds using 32-bit floating point numbers, which add up to 2 x3 x 4=24 bytes per bounding box. For an N wide BVH node, N boxes and N child references must be stored. In total, the storage of BVH nodes of width 4 is N x 24 bytes plus N x4 bytes for sub-references, assuming 4 bytes are referenced each, which results in a total of (24 + 4) x N bytes, a total of 112 bytes for BVH nodes of width 4, and a total of 224 bytes for BVH nodes of width 8.
In one embodiment, the size of the BVH node is reduced by storing a single higher accuracy parent bounding box that encloses all child bounding boxes and storing each child bounding box with lower accuracy relative to the parent box. Depending on the usage scenario, different digital representations may be used to store the high accuracy parent bounding box and the lower accuracy relative child boundaries.
Fig. 18 is a block diagram illustrating an exemplary quantized BVH node 1810, according to an embodiment. The quantized BVH node 1810 may include a higher precision value for defining the parent bounding box of the BVH node. For example, parent _ lower _ x 1812, parent _ lower _ y 1814, parent _ lower _ z 1816, parent _ upper _ x 1822, parent _ upper _ y 1824, and parent _ upper _ z 1826 may be stored using single-precision or double-precision floating point values. For each sub-bounding box stored in a node, the value of the sub-bounding box may be quantized and stored as a lower precision value, such as a fixed-point representation of the bounding box value defined relative to the parent bounding box. For example, child _ lower _ x 1832, child _ lower _ y 1834, child _ lower _ z 1836, and child _ upper _ x 1842, child _ upper _ y 1844, and child _ upper _ z 1846 may be stored as lower precision fixed point values. In addition, a child reference 1852 can be stored for each child node. The child reference 1852 may be an index to a table that stores the location of each child node, or may be a pointer to a child node.
As shown in FIG. 18, a parent bounding box may be stored using single-precision or double-precision floating-point values, while the opposite child bounding box may be encoded using M-bit fixed-point values. The data structure of the quantized BVH node 1810 of fig. 18 may be defined by the quantized N-wide BVH node shown in table 1 below.
Table 1: quantized N-wide BVH node
The quantization nodes of table 1 enable reduced data structure size by quantizing the child values while maintaining baseline level accuracy by storing higher precision values for the range of the parent bounding box. In table 1, Real represents a higher accuracy digital representation (e.g., a 32-bit or 64-bit floating point value), and UintM represents a lower accuracy unsigned integer using M-bit accuracy, which is used to represent fixed point numbers. Reference denotes a type (e.g., a 4-byte index of an 8-byte pointer) used to represent a Reference to a child node.
A typical instantiation of this method may use a 32-bit child reference, a single precision floating point value for the parent boundary, and M =8 bits (1 byte) for the opposite child boundary. The compression node would then require 6x 4+ 6x N + 4x N bytes. For a BVH of width 4, this amounts to 64 bytes (compared to the uncompressed version of 112 bytes), and for a BVH of width 8, this amounts to 104 bytes (compared to the uncompressed version of 224 bytes).
To traverse such compressed BVH nodes, graphics processing logic may decompress the opposing sub-bounding boxes and then intersect the decompressed nodes using standard methods. Then, an uncompressed lower bound can be obtained for each dimension x, y, and z. Equation 1 below shows a formula for obtaining the sub lower _ x value.
Formula 1: child node decompression of BVH node
In equation 1 above, M represents the number of bits of accuracy of the fixed-point representation of the sub-boundary. The logic for decompressing child data for each dimension of a BVH node may be implemented as in Table 2 below.
Table 2: child node decompression of BVH node
Table 2 shows that the floating point value of the lower boundary of the sub-bounding box is calculated based on the floating point value of the range of the parent bounding box and the fixed point value of the sub-bounding box stored as an offset from the range of the parent bounding box. The sub-upper boundary may be calculated in a similar manner.
In one embodiment, decompression performance may be improved by storing a scaled parent bounding box size (e.g., (parent _ upper _ x-parent _ lower _ x)/(2^ M-1) instead of the parent _ upper _ x/y/z value). In such embodiments, the sub-bounding box ranges may be calculated according to the example logic shown in Table 3.
Table 3: enhancer node decompression of BVH nodes
Note that in an optimized version, decompression/dequantization can be formulated as MAD-instructions (multiply and add), where hardware support exists for such instructions. In one embodiment, the operation of each child node may be performed using SIMD/vector logic, enabling each child item within the node to be evaluated simultaneously.
While the above-described method works well for shader or CPU-based implementations, one embodiment provides specialized hardware configured to perform ray tracing operations including ray-box intersection testing using bounding volume levels. In such embodiments, specialized hardware may be configured to store further quantized representations of BVH node data, and to automatically dequantize such data when performing ray-box intersection tests.
FIG. 19 is a block diagram of a compound floating point data block 1900 for use by a quantized BVH node 1910 according to a further embodiment. In one embodiment, the logic for supporting the composite floating point data block 1900 may be defined by specialized logic within the graphics processor, as compared to a 32-bit single precision floating point representation or a 64-bit double precision floating point representation of the range of the parent bounding box. A Complex Floating Point (CFP) data block 1900 may include a 1-bit sign bit 1902, a variable size (E-bit) signed integer exponent 1904, and a variable size (K-bit) mantissa 1906. The plurality of values of E and K may be configured by adjusting values stored in configuration registers of the graphics processor. In one embodiment, the values of E and K may be independently configured within a range of values. In one embodiment, a fixed set of interrelated values for E and K may be selected via a configuration register. In one embodiment, the individual values for E and K are hard-coded into the BVH logic of the graphics processor. The values E and K enable the CFP data block 1900 to be used as a custom (e.g., special purpose) floating point data type that may be appropriate for a data set.
Using CFP data block 1900, the graphics processor may be configured to store bounding box data in quantized BVH node 1910. In one embodiment, the lower bounds of the parent bounding box (parent _ lower _ x 1912, parent _ lower _ y 1914, parent _ lower _ z 1916) are stored at an accuracy level determined by the E and K values selected for the CFP data block 1900. The stored value of the lower boundary of the parent bounding box is typically set to a higher precision level than the values of the child bounding boxes (child _ lower _ x 1924, child _ upper _ x 1926, child _ lower _ y 1934, child _ upper _ y 1936, child _ lower _ z 1944, child _ upper _ z 1946) that are to be stored as fixed-point values. The scaled parent bounding box size is stored as a power of 2 exponents (e.g., exp _ x 1922, exp _ y 1932, exp _ z 1942). In addition, a reference to each child item (e.g., child reference 1952) may be stored. The size of the quantized BVH node 1910 may be scaled based on the width (e.g., number of children) stored in each node, where the amount of storage to store the sub-references and the bounding box values of the children nodes increase with each additional node.
The logic for the implementation of the quantized BVH node of fig. 19 is shown in table 4 below.
Table 4: quantized N-wide BVH node for hardware implementation
As shown in Table 4, a composite floating point data block (e.g., struct Float) may be defined to represent the value of the parent bounding box. The Float structure includes a 1-bit sign (int 1 sign), an E-bit signed integer (inteexp) for storing a power of 2 exponent, and a K-bit unsigned integer (uintK mantissa) representing a mantissa, which is used to store a high precision boundary. For sub-bounding box data, M-bit unsigned integers (uintM child _ lower _ x/y/z; uintM child _ upper _ x/y/z) may be used to store fixed point numbers to encode relative sub-boundaries.
For the example of E =8, K =16, M =8 and using 32 bits for sub-references, the quantizednodeh structure of table 4 has a size of 52 bytes for BVHs of width 4 and a size of 92 bytes for BVHs of width 8, which is reduced in structure size relative to the quantization nodes of table 1 and is significantly reduced in structure size relative to existing implementations. It will be noted that for mantissa values (K = 16), one bit of the mantissa may be implied, reducing the storage requirements to 15 bits.
The layout of the BVH node structure of Table 4 enables reduced hardware to perform ray-box intersection tests on sub-bounding boxes. Hardware complexity is reduced based on several factors. A smaller number of bits of K may be selected because of the additional M bits of precision added with respect to the sub-boundaries. Storing the scaled parent bounding box size as a power of 2 (exp _ x/y/z field) simplifies the computation. In addition, the calculations are reconstructed to reduce the size of the multiplier.
In one embodiment, ray intersection logic of a graphics processor calculates a hit distance of a ray to an axis alignment plane to perform ray-box testing. Ray intersection logic may use BVH node logic, which includes support for the quantized node structure of table 4. The logic may use the quantized relative ranges of the parent lower boundary and the child box with higher precision to calculate the distance to the lower boundary of the parent bounding box. An exemplary logic for the x-plane calculation is shown in table 5 below.
Table 5: ray-box intersection distance determination
With respect to the logic of Table 5, if a ray is represented assuming single precision floating point accuracy, a 23-bit by 15-bit multiplier may be used because the parent _ lower _ x value is stored with 15 bits of the mantissa. The distance to the lower boundary of the parent bounding box in the y and z planes can be calculated in a similar manner as the calculation of dist _ parent _ lower _ x.
Using the parent lower bound, the intersection distance to the opposing child bounding box can be calculated for each child bounding box, as illustrated by the calculation of dist _ child _ lower _ x and dist _ child _ upper _ x in Table 5. The calculation of the dist _ child _ lower/upper _ x/y/z value may be performed using a 23-bit by 8-bit multiplier.
FIG. 20 illustrates ray-box intersections using quantization values to define child bounding boxes 2010 with respect to a parent bounding box 2000, according to an embodiment. Applying the ray-box intersection distance determination equation for the x-plane shown in Table 5, the distance along the ray 2002 where the x-plane intersects the boundary of the parent bounding box 2000 can be determined. A location dist _ parent _ lower _ x 2003 may be determined where the ray 2002 passes through the lower bounding plane 2004 of the parent bounding box 2000. Based on the dist _ parent _ lower _ x 2003, a dist _ child _ lower _ x 2005 may be determined for the location where the ray intersects the minimal bounding plane 2006 of the sub-bounding box 2010. In addition, based on the disk _ parent _ lower _ x 2003, a disk _ child _ upper _ x 2007 may be determined for a position where a ray intersects the maximum bounding plane 2008 of the sub-bounding box 2010. Similar determinations may be performed for each dimension (e.g., along the y-axis and z-axis) in which the parent bounding box 2000 and the child bounding box 2010 are defined. The plane intersection distance may then be used to determine whether a ray intersects a sub-bounding box. In one embodiment, graphics processing logic may determine intersection distances of multiple dimensions and multiple bounding boxes in a parallel manner using SIMD and/or vector logic. Additionally, at least a first portion of the computations described herein may be performed on a graphics processor, while a second portion of the computations may be performed on one or more application processors coupled to the graphics processor.
Fig. 21 is a flow diagram of BVH decompression and traversal logic 2100, according to an embodiment. In one embodiment, BVH decompression and traversal logic resides in dedicated hardware logic of the graphics processor, or may be executed by shader logic executing on execution resources of the graphics processor. BVH decompression and traversal logic 2100 may cause a graphics processor to perform an operation to calculate a distance along a ray to a lower bounding plane of a parent bounding volume, as shown in block 2102. At block 2104, the logic may calculate a distance to the lower bounding plane of the child bounding volume based in part on the calculated distance to the lower bounding plane of the parent bounding volume. At block 2106, the logic may calculate a distance to an upper bounding plane of a child bounding volume based in part on the calculated distance to the lower bounding plane of the parent bounding volume.
At block 2108, BVH decompression and traversal logic 2100 may determine ray intersections of the sub-bounding volume based in part on the distances to the upper and lower bounding planes of the sub-bounding volume, although the intersection would be determined using the intersection distance for each dimension of the bounding box. In one embodiment, BVH decompression and traversal logic 2100 determines ray intersection of the sub-bounding volume by determining whether the maximum incident plane intersection distance of the ray is less than or equal to the minimum exit plane distance. In other words, a ray intersects a sub-bounding volume when it enters the bounding volume along all defined planes and then exits the bounding volume along any defined plane. If at 2110, BVH decompression and traversal logic 2100 determines that the ray intersects a sub-bounding volume, then the logic may traverse the children nodes of the bounding volume to test the sub-bounding volumes within the children nodes, as shown at block 2112. At block 2112, a node traversal may be performed in which references to nodes associated with the intersected bounding box may be accessed. A child bounding volume may become a parent bounding volume, and the children of the intersecting bounding volumes may be evaluated. If at 2110, BVH decompression and traversal logic 2100 determines that the ray does not intersect a sub-bounding volume, then the branches of the boundary levels associated with that sub-bounding volume are skipped, as shown at block 2114, because the ray will not intersect any bounding volumes below the subtree branches associated with non-intersected sub-bounding volumes.
Further compression via shared plane bounding boxes
For any BVH of width N using bounding box, the bounding volume hierarchy may be constructed such that each of the six sides of the 3D bounding box is shared by at least one sub-bounding box. In a 3D shared planar bounding box, a 6 × log may be used2N bits to indicate whether a given plane of a parent bounding box is shared with a child bounding box. For N =4 of the 3D shared plane bounding box, 12 bits will be used to indicate the shared plane, with each of the two bits used to identify which of the four children re-use the respective potentially shared parent plane. Each bit may be used to indicate whether the parent plane is reused by a particular child. In the case of a BVH of 2 width, 6 additional bits may be added to indicate for each plane of a parent bounding box whether the plane (e.g., side) of the bounding box is shared by child items. Although the SPBB concept is applicable to any number of dimensions, in one embodiment, the benefits of SPBB are generally greatest for SPBB widths of 2 (e.g., binary).
Using a shared plane bounding box may further reduce the amount of data stored when quantized using BVH nodes as described herein. In the example of a 3D, 2 wide BVH, six shared plane bits may refer to min _ x, max _ x, min _ y, max _ y, min _ z, and max _ z of the parent bounding box. If the min _ x bit is zero, the first child inherits the shared plane from the parent bounding box. For each child that shares a plane with the parent bounding box, there is no need to store the quantized value of the plane, which reduces the decompression cost and storage cost of the node. In addition, a higher precision value of the plane may be used for the sub-bounding box.
Fig. 22 is an illustration of an exemplary two-dimensional shared plane bounding box 2200. A two-dimensional (2D) Shared Plane Bounding Box (SPBB) 2200 includes a left sub-item 2202 and a right sub-item 2204. For a 2D binary SPBPP, 4 log may be used22 additional bits to indicate which of the four shared planes of the parent bounding box is shared, with a bit associated with each plane. In one embodiment, zero may be associated with the left child term 2202 and one may be associated with the right child term, such that the sharing plane bits of the SPBB 2200 are min _ x =0, max _ x =1, min _ y =0, max _ y =0, because the left child term 2202 shares the lower _ x, upper _ y, and lower _ y planes with the parent SPBB 2200 and the right child term 2204 shares the upper _ x plane.
Fig. 23 is a flow diagram of shared plane BVH logic 2300, according to an embodiment. The shared plane BVH logic 2300 may be used to reduce the number of quantized values stored for the lower and upper ranges of one or more sub-bounding boxes, reduce the decompression/dequantization cost of a BVH node, and improve the accuracy of the values used for ray-box intersection tests on the sub-bounding boxes of a BVH node. In one embodiment, the shared plane BVH logic 2300 comprises defining a parent bounding box over a set of child bounding boxes such that the parent bounding box shares one or more planes with one or more child bounding boxes, as shown at block 2302. In one embodiment, a parent bounding box may be defined by selecting an existing set of axis-aligned bounding boxes for geometric objects in a scene and defining the parent bounding box based on the minimum and maximum extents of the set of bounding boxes in each plane. For example, the upper plane value of each plane of a parent bounding box is defined as the maximum value of each plane within the set of child bounding boxes. At block 2304, the shared plane BVH logic 2300 may encode a shared child plane for each plane of the parent bounding box. As shown at block 2306, shared plane BVH logic 2300 may inherit the parent plane values for the child planes with shared planes during ray-box intersection testing. The shared plane values of the child planes may be inherited with higher precision, with the parent plane values stored in the BVH node structure, and the generation and storage of lower precision quantized values for the shared planes may be bypassed.
Fig. 24 is a block diagram of a computing device 2400 including a graphics processor 2404 with bounding volume hierarchy logic 2424, according to an embodiment. Computing device 2400 may be a computing device such as data processing system 100 in FIG. 1. The computing device 2400 can also be or be included within a communication device such as a set-top box (e.g., an internet-based cable set-top box, etc.), a Global Positioning System (GPS) based device, and the like. The computing device 2400 may also be or be included within a mobile computing device such as a cellular phone, a smart phone, a Personal Digital Assistant (PDA), a tablet computer, a laptop computer, an e-reader, a smart television, a television platform, a wearable device (e.g., glasses, watches, bracelets, smart cards, jewelry, clothing, etc.), a media player, and so forth. For example, in one embodiment, the computing device 2400 comprises a mobile computing device employing an integrated circuit ("IC"), such as a system on a chip ("SoC" or "SoC"), that integrates various hardware and/or software components of the computing device 2400 on a single chip.
In one embodiment, bounding volume level (BVH) logic 2424 includes logic to encode the compressed representation of the bounding volume level and additional logic to decode and interpret the compressed representation of the bounding volume level. BVH logic 2424 may work in conjunction with ray tracing logic 2434 to perform hardware accelerated ray-box intersection tests. In one embodiment, BVH logic 2424 is configured to encode the plurality of sub-bounding volumes with respect to the reference bounding volume. For example, BVH logic 2424 may encode the reference bounding volume and the sub-bounding volume in multiple directions using an upper boundary and a lower boundary, where the reference bounding volume is encoded using a floating point value and the sub-bounding volume is encoded using a fixed point value. BVH logic 2424 may be configured to encode the reference bounding volume as a scaled range of lower and upper bounds and sub-bounding volumes in multiple directions using the lower and upper bounds. In one embodiment, BVH logic 2424 is configured to encode a bounding volume level node using the encoded plurality of sub-bounding volumes.
The ray tracing logic 2434 may operate, at least in part, in conjunction with execution resources 2444 of the graphics processor 2404, the execution resources 2444 including execution units and associated logic, such as the logic in the graphics cores 580A-N of fig. 5 and/or the execution logic 600 shown in fig. 6. The ray tracing logic 2434 may perform ray traversal through the bounding volume level and test whether the ray intersects the encoded sub-bounding volume of the node. The ray tracing logic 2434 may be configured to calculate bounding plane distances to test ray bounding volume intersections by calculating distances to planes of the lower reference bounding plane and adding to the distances the arithmetic product of the child bounding plane position, the scaled range of the reference boundary, and the reciprocal ray direction to calculate distances to all of the child bounding planes.
In one embodiment, a set of registers 2454 may also be included to store configuration and operational data for the components of the graphics processor 2404. The graphics processor 2404 may additionally include a memory device configured as a cache 2414. In one embodiment, cache 2414 is a render cache for performing rendering operations. In one embodiment, the cache 2414 may also include additional levels of the memory hierarchy, such as a last level cache stored in the embedded memory module 218 of fig. 2.
As shown, in one embodiment, in addition to graphics processor 2404, computing device 2400 may further include any number and type of hardware components and/or software components, such as (but not limited to) an application processor 2406, memory 2408, and input/output (I/O) sources 2410. The application processor 2406 may interact with a hardware graphics pipeline as shown with reference to fig. 3 to share graphics pipeline functionality. The processed data is stored in buffers in the hardware graphics pipeline and state information is stored in memory 2408. The resulting image is then transmitted to a display controller for output via a display device, such as display device 320 of FIG. 3. The display device may be of various types, such as a Cathode Ray Tube (CRT), a Thin Film Transistor (TFT), a Liquid Crystal Display (LCD), an Organic Light Emitting Diode (OLED) array, etc., and may be configured to display information to a user.
Applications processor 2406 may include one or more processors, such as processor(s) 102 of fig. 1, and may be a Central Processing Unit (CPU) that executes, at least in part, an Operating System (OS) 2402 of computing device 2400. The OS 2402 may act as an interface between hardware and/or physical resources of the computer device 2400 and a user. The OS 2402 may include driver logic 2422 for various hardware devices in the computing device 2400. The driver logic 2422 may include graphics driver logic 2423, such as the user mode graphics driver 1026 and/or the kernel mode graphics driver 1029 of fig. 10. In one embodiment, graphics driver logic 2423 may be used to configure BVH logic 2424 and ray tracing logic 2434 of graphics processor 2404.
It is contemplated that in some embodiments, graphics processor 2404 may exist as part of application processor 2406 (such as part of a physical CPU package), in which case at least a portion of memory 2408 may be shared by application processor 2406 and graphics processor 2404, although at least a portion of memory 2408 may be exclusive to graphics processor 2404, or graphics processor 2404 may have a separate memory bank. Memory 2408 may include pre-allocated regions of buffers (e.g., frame buffers); however, those skilled in the art will appreciate that embodiments are not so limited and any memory accessible to the lower graphics pipeline may be used. Memory 2408 may include various forms of Random Access Memory (RAM) (e.g., SDRAM, SRAM, etc.) including applications that utilize graphics processor 2404 to render desktop or 3D graphics scenes. A memory controller hub, such as the memory controller hub 116 of figure 1, can access data in the memory 2408 and forward it to the graphics processor 2404 for graphics pipeline processing. Memory 2408 may be made available to other components within computing device 2400. For example, in an implementation of a software program or application, any data (e.g., input graphics data) received from the various I/O sources 2410 of the computing device 2400 may be temporarily queued into memory 2408 before it is operated on by one or more processors (e.g., application processor 2406). Similarly, data that the software program determines should be sent from the computing device 2400 to an external entity through one of the computing system interfaces or stored into an internal storage element is typically temporarily queued in memory 2408 before it is transmitted or stored.
The I/O sources may include devices such as touch screens, touch panels, touch pads, virtual or conventional keyboards, virtual or conventional mice, ports, connectors, network devices, etc., and may be attached via an input/output (I/O) control hub (ICH) 130 as referenced in fig. 1. Additionally, I/O source 2010 may include one or more I/O devices implemented to transfer data to and/or from computing device 2400 (e.g., a network adapter); or, for large-scale non-volatile storage (e.g., hard disk drives) within the computing device 2400. User input devices, including alphanumeric and other keys, may be used to communicate information and command selections to graphics processor 2404. Another type of user input device is a cursor control device, such as a mouse, a trackball, a touch screen, a touchpad, or cursor direction keys, for communicating direction information and command selections to the GPU and controlling cursor movement on the display device. The camera and microphone array of computer device 2400 may be employed to observe gestures, record audio and video, and receive and transmit visual and audio commands.
The I/O sources 2410, configured as network interfaces, may provide access to a network, such as a LAN, a Wide Area Network (WAN), a Metropolitan Area Network (MAN), a Personal Area Network (PAN), bluetooth, a cloud network, a cellular or mobile network (e.g., third generation (3G), fourth generation (4G), etc.), an intranet, the internet, and so forth. The network interface(s) may include, for example, a wireless network interface having one or more antennas. The network interface(s) may also include, for example, a wired network interface for communicating with remote devices over a network cable, which may be, for example, an ethernet cable, a coaxial cable, a fiber optic cable, a serial cable, or a parallel cable.
The network interface(s) may provide access to a LAN, e.g., by conforming to IEEE 802.11 standards, and/or the wireless network interface may provide access to a personal area network, e.g., by conforming to a bluetooth standard. Other wireless network interfaces and/or protocols may also be supported, including previous and subsequent versions of the standard. In addition to, or in lieu of, communication via wireless LAN standards, the network interface(s) may provide wireless communication using, for example, Time Division Multiple Access (TDMA) protocols, global system for mobile communications (GSM) protocols, Code Division Multiple Access (CDMA) protocols, and/or any other type of wireless communication protocol.
It will be appreciated that for some implementations, fewer or more equipped systems than the examples described above may be preferred. Thus, the configuration of computing device 2400 may vary from implementation to implementation depending on numerous factors, such as price constraints, performance requirements, technological improvements, or other conditions. Examples include, without limitation, a mobile device, a personal digital assistant, a mobile computing device, a smartphone, a cellular phone, a handset, a one-way pager, a two-way pager, a messaging device, a computer, a Personal Computer (PC), a desktop computer, a laptop computer, a notebook computer, a handheld computer, a tablet computer, a server array or server farm, a web server, a network server, an Internet server, a workstation, a microcomputer, a mainframe computer, a supercomputer, a network appliance, a web appliance, a distributed computing system, multiprocessor systems, processor-based systems, consumer electronics, programmable consumer electronics, television, digital television, set-top box, wireless access point, base station, subscriber station, mobile subscriber center, radio network controller, wireless network controller, wireless network, and wireless network, A router, hub, gateway, bridge, switch, machine, or combination thereof.
Apparatus and method for compressing leaf nodes of bounding volume hierarchies
A disadvantage of accelerated structures such as Bounding Volume Hierarchies (BVHs) and k-d trees is that they require time and memory to build and store. One way to reduce this overhead is to employ some compression and/or quantization of the acceleration data structure, which is particularly effective for BVHs, which naturally speeds up conservative delta coding. Advantageously, this may significantly reduce the size of the acceleration structure, typically halving the size of the BVH node. Disadvantageously, compressing BVH nodes also introduces overhead, which may fall into different categories. First, there is an obvious cost of decompressing each BVH node during traversal; second, the need to track parent information slightly complicates stack operations, particularly for hierarchical encoding schemes; and third, to quantify the boundaries conservatively means that the bounding box is somewhat less compact than the uncompressed bounding box, which triggers a measurable increase in the number of nodes and primitives that must be traversed and intersected, respectively.
Compressing the BVH by local quantization is a known method to reduce its size. The N-wide BVH node contains an axis-aligned bounding box (AABB) of its "N" children in single precision floating point format. Local quantization represents "n" child AABBs relative to the parent's AABB and these values are stored in a quantized, e.g., 8-bit format, thereby reducing the size of the BVH node.
Local quantization of the entire BVH introduces multiple overhead factors because: (a) the dequantized AABB is coarser than the original single precision floating point AABB, introducing additional traversal and intersection steps for each ray; and (b) the dequantization operation itself is costly, adding overhead to each ray traversal step. Because of these drawbacks, compressed BVHs are only used in certain application scenarios and are not widely adopted.
One embodiment of the present invention employs a technique for compressing leaf nodes that enclose hair primitives in a volume hierarchy. In particular, in one embodiment, sets of directed primitives are stored with the parent bounding box, eliminating child pointer storage in leaf nodes. The oriented bounding boxes are then stored for each primitive using 16-bit coordinates quantized with respect to the corners of the parent box. Finally, a quantized normal is stored for each group of primitives to indicate orientation. This approach may result in a significant reduction in bandwidth and memory footprint of BVH hair primitives.
In some embodiments, BVH nodes (e.g., for a BVH of 8 width) are compressed by storing a parent bounding box and encoding N child bounding boxes (e.g., 8 child entries) with less precision relative to the parent bounding box. A drawback of applying this idea to each node of the BVH is that some decompression overhead is introduced at each node when traversing the ray through the structure, which may degrade performance.
To address this issue, one embodiment of the present invention uses compression nodes only at the lowest level of the BVH. This provides the advantage that higher BVH levels operate with optimal performance (i.e. the larger the boxes, the easier they are to touch, but they are very few) and compression on the lower/lowest level is also very efficient, since most of the data for BVH is in the lowest level(s).
Additionally, in one embodiment, quantization is also applied for BVH nodes that store the orientation bounding box. As discussed below, the operation is somewhat more complex than that of an axis-aligned bounding box. In one implementation, the use of compressed BVH nodes with directional bounding boxes is combined with the use of compressed nodes only at the lowest level (or lower level) of BVH.
Thus, one embodiment improves the fully compressed BVH by introducing a single dedicated layer of compressed leaf nodes while using the conventional uncompressed BVH nodes for the internal nodes. One motivation behind this approach is that almost all compression savings come from the lowest level of BVHs (which occupies most of all nodes, especially for BVHs of width 4 and width 8), while most of the overhead comes from internal nodes. Thus, introducing a dedicated "compressed leaf node" of a single layer gives almost the same (and in some cases even better) compression gain as a fully compressed BVH, while maintaining almost the same traversal performance as an uncompressed BVH.
In one embodiment, the techniques described herein are integrated within traversal/intersection circuitry within a graphics processor (such as GPU 2505 as shown in FIG. 25) that includes a dedicated set of graphics processing resources arranged into multi-core groups 2500A-N. Although only the details of a single multi-core group 2500A are provided, it will be appreciated that other multi-core groups 2500B-N may be equipped with the same or similar sets of graphics processing resources.
As shown, the multi-core group 2500A can include a set of graphics cores 2530, a set of tensor cores 2540, and a set of ray tracing cores 2550. Scheduler/dispatcher 2510 schedules and dispatches graphics threads for execution on the various cores 2530, 2540, 2550. A set of register files 2520 store operand values used by the cores 2530, 2540, 2550 in executing a graphics thread. These may include, for example, integer registers for storing integer values, floating point registers for storing floating point values, vector registers for storing packed data elements (integer and/or floating point data elements), and patch registers for storing tensor/matrix values. In one embodiment, the tile registers are implemented as a combined set of vector registers.
One or more level 1 caches and texture units 2560 locally store graphics data, such as texture data, vertex data, pixel data, ray data, bounding volume data, and the like, within each multi-core group 2500A. A level 2 (L2) cache 2580 shared by all or a subset of the multi-core groups 2500A-N stores graphics data and/or instructions for multiple concurrent graphics threads. One or more memory controllers 2570 couple the GPU 2505 to memory 2598, which memory 2598 can be system memory (e.g., DRAM) and/or dedicated graphics memory (e.g., GDDR6 memory).
Input/output (IO) circuitry 2595 couples GPU 2505 to one or more IO devices 2590, such as Digital Signal Processors (DSPs), network controllers, or user input devices. The I/O device 2590 may be coupled to the GPU 2505 and memory 2598 using on-chip interconnects. One or more IO memory management units (IOMMU) 2570 of the IO circuitry 2595 directly couple the IO devices 2590 to the system memory 2598. In one embodiment, the IOMMU 2570 manages multiple sets of page tables to map virtual addresses to physical addresses in the system memory 2598. In this embodiment, IO device 2590, CPU(s) 2599, and GPU(s) 2505 may share the same virtual address space.
In one implementation, the IOMMU 2570 supports virtualization. In this case, it may use a first set of page tables to map guest/graphics virtual addresses to guest/graphics physical addresses, and may use a second set of page tables to map guest/graphics physical addresses to system/host physical addresses (e.g., within system memory 2598).
In one embodiment, CPU 2599, GPU 2505, and IO devices 2590 are integrated on a single semiconductor chip and/or chip package. The memory 2598 shown may be integrated on the same chip or may be coupled to the memory controller 2570 via an off-chip interface. In one implementation, memory 2598 includes GDDR6 memory that shares the same virtual address space as other physical system-level memory, although the underlying principles of the invention are not limited to this particular implementation.
In one embodiment, the tensor core 2540 includes multiple execution units specifically designed to perform matrix operations, which are the basic computational operations for performing deep learning operations. For example, a synchronous matrix multiplication operation may be used for neural network training and inference. The tensor core 2540 may perform matrix processing using various operand precisions including single precision floating point (e.g., 32 bits), half precision floating point (e.g., 16 bits), integer word (16 bits), byte (8 bits), and nibble (4 bits). In one embodiment, a neural network enables extraction of features of each rendered scene, potentially combining details from multiple frames, to construct a high quality final image.
In one embodiment, ray tracing core 2550 accelerates ray tracing operations for both real-time ray tracing and non-real-time ray tracing implementations. For example, with respect to embodiments of the invention, ray trace core 2550 may include circuitry/logic to compress the leaf nodes of BVHs. Additionally, ray tracing core 2550 may include ray traversal/intersection circuitry to perform ray traversal using the BVH and identify intersections between rays and primitives contained within the BVH volume. Ray tracing core 2550 may also include circuitry for performing depth testing and culling (e.g., using a Z-buffer or similar arrangement). Using a dedicated ray tracing core 2550 for traversal/intersection operations significantly reduces the load on the graphics core 2530. Without these ray tracing cores 2550, traversal and intersection operations would be implemented using shaders running on the graphics core 2530, which would consume most of the graphics processing resources of the GPU 2505, making real-time ray tracing impractical.
Figure 26 illustrates an exemplary ray trace engine 2600 that performs the leaf node compression and decompression operations described herein. In one embodiment, the ray trace engine 2600 includes the circuitry of one or more of the ray trace cores 2550 described above. Alternatively, the ray tracing engine 2600 may be implemented on a core of the CPU 2599 or on other types of graphics cores (e.g., Gfx core 2530, tensor core 2540, etc.).
In one embodiment, ray generator 2602 generates a ray that traversal/intersection unit 2603 traces through a scene that includes a plurality of input primitives 2606. For example, an app such as a virtual reality game may generate a command stream from which input primitives 2606 are generated. Traversal/intersection unit 2603 traverses the ray through BVH 2605 generated by BVH builder 2607 and identifies hit points where the ray intersects one or more of the graphical elements 2606. Although shown as a single cell, traversal/intersection cell 2603 may include traversal cells coupled to different intersection cells. These units may be implemented in circuitry, software/commands executed by a GPU or CPU, or any combination thereof.
Node compression/decompression
In one embodiment, BVH processing circuitry/logic 2604 includes BVH builder 2607 that generates BVH 2605 as described herein based on spatial relationships between primitives 2606 in a scene. Additionally, BVH processing circuitry/logic 2604 includes a BVH compressor 2609 and a BVH decompressor 2609 for compressing and decompressing leaf nodes, respectively, as described herein. For purposes of illustration, the following description will focus on a BVH of 8 in width (BVH 8).
As shown in FIG. 27, one embodiment of a single 8 wide BVH node 2700A contains 8 enclosures 2701-2708 and 8 (64-bit) child pointers/references 2710 that point to the enclosures/leaf data 2701-2708. In one embodiment, the BVH compressor 2625 performs encoding in which 8 child bounding boxes 2701A-2708A are represented relative to the parent bounding box 2700A and quantized to an 8-bit uniform value, which is shown as bounding box leaf data 2701B-2708B. The quantized 8 wide BVH QBVH8node 2700B is encoded by BVH compression 2725 using start and range values stored as two three dimensional single precision vectors (2 × 12 bytes). Eight quantized sub-bounding boxes 2701B-2708B are stored as 2 times the 8 bytes (48 bytes total) of the lower and upper boundaries of each dimension of the bounding box. Note that this layout differs from existing implementations in that the range is stored with full precision, which generally provides tighter boundaries but requires more space.
In one embodiment, BVH decompressor 2626 decompresses QBVH8node 2700B as follows. Can pass QBHH8. starti+(byte-to-float) QBVH8.loweri∗QBVH8.extendi(this requires five instructions per dimension on the CPU 4099) and box: 2 loads (start, extend), byte-to-int load + update, int-to-float update and one multiplex-add to calculate the dimensioniThe decompressed lower boundary of (1). In one embodiment, all 8 quantized sub-bounding boxes 2701B-2708B are decompressed in parallel using SIMD instructions, which adds about 10 instructions overhead to the ray-node intersection test, making it at least twice as high as in the case of a standard uncompressed node. In one embodiment, these instructions are executed on the core of the CPU 4099. Alternatively, a comparable set of instructions is executed by ray tracing core 4050.
Without a pointer, the QBVH8node requires 72 bytes, while the uncompressed BVH8node requires 192 bytes, which results in a reduction factor of 2.66 x. For 8 (64-bit) pointers, the reduction factor is reduced to 1.88x, which makes it necessary to address the storage cost for handling the leaf pointers.
Leaf stage compression and layout
In one embodiment, when only the leaf level of the BVH8node is compressed into the QBHH 8node, all child pointers of the 8 child entries 2701-2708 will only point to leaf primitive data. In one implementation, this fact is exploited by storing all referenced primitive data directly after the QBVH8node 2700B itself, as shown in fig. 27. This facilitates reducing the complete 64-bit sub-pointer 2710 of QBVH8 to only an 8-bit offset 2722. In one embodiment, if the primitive data is of fixed size, the offsets 2722 are skipped entirely because they can be computed directly from the indices of the intersecting bounding boxes and the pointer to the QBVH8node 2700B itself.
BVH builder modification
When using a top-down BVH8 builder, compressing only BVH8 leaf levels requires only slight modifications to the build process. In one embodiment, these build modifications are implemented in BVH builder 2607. During the recursive build phase, BVH builder 2607 keeps track of whether the current number of primitives is below some threshold. In one implementation, N M is a threshold, where N refers to the width of the BVH, and M is the number of primitives within the BVH leaf. The threshold is 32 for BVH8 nodes and, for example, four triangles per leaf. Thus, for all subtrees less than 32 primitives, BVH processing circuitry/logic 2604 will enter a special code path where it will continue the Surface Area Heuristic (SAH) based splitting process, but create a single QBVH8node 2700B. When the QBVH8node 2700B is finally created, the BVH compressor 2609 then collects all the referenced primitive data and copies it to immediately after the QBVH8 node.
Go through
The actual BVH8 traversal performed by ray trace core 2750 or CPU 2799 is only slightly affected by leaf level compression. Essentially, the leaf stage QBVH8node 2700B is considered to be an extended leaf type (e.g., it is labeled as a leaf). This means that the conventional BVH8 traversal continues from top to bottom until reaching QBVH node 2700B. At this point, a single ray-QBHV node intersection is performed, and for all of its intersected sub-entries 2701B-2708B, the corresponding leaf pointers are reconstructed, and a conventional ray-primitive intersection is performed. Interestingly, the ordering of the intersecting sub-terms 2701B-2708B of QBHH based on intersection distance may not provide any measurable benefit, since in most cases, a ray intersects only a single sub-term anyway.
Leaf data compression
One embodiment of the leaf-level compression scheme even allows for lossless compression of the actual primitive leaf data by extracting common features. For example, triangles within the Compressed Leaf BVH (CLBVH) node are likely to share vertex/vertex indices and attributes, such as the same objectID. Memory consumption is further reduced by storing these shared attributes only once per CLBVH node and using a small local byte-size index in the primitive.
In one embodiment, the techniques used to exploit common spatial coherence geometry in BVH leaves are also used for other more complex primitive types. It is possible that primitives such as hair segments share a common direction for each BVH leaf. In one embodiment, BVH compressor 2609 implements a compression scheme that takes into account this common directional property to efficiently compress directional bounding boxes (OBBs), which have been shown to be very useful for bounding long diagonal primitive types.
The leaf-level compressed BVHs described herein introduce BVH node quantization only at the lowest BVH level and thus allow for additional memory reduction optimizations while preserving traversal performance of uncompressed BVHs. Since only BVH nodes at the lowest level are quantized, all of its children point to leaf data 2701B-2708B, which leaf data 2701B-2708B may be stored contiguously in a memory block or one or more cache lines 2698.
This idea can also be applied to hierarchies using Oriented Bounding Boxes (OBBs), which are typically used to speed up the rendering of hair primitives. To illustrate one particular embodiment, the memory reduction would be evaluated for the typical case of a standard 8 wide BVH on a triangle.
The layout of BVH node 2700 with width 8 is represented by the following core sequence:
struct BVH8Node {
float lowerX[8], upperX[8];
8 lower and upper bounds in the/X dimension
float lowerY[8], upperY[8];
8 lower and upper bounds in the/Y dimension
float lowerZ[8], upperZ[8];
8 lower and upper boundaries in the/Z dimension
void *ptr[8];
// 8 64-bit pointers to 8 child nodes or leaf data
};
And requires 276 bytes of memory. The standard 8 wide quantization node layout can be defined as:
struct QBVH8Node {
Vec3f start, scale;
char lowerX[8], upperX[8];
lower/upper bound of 8 byte quantization in/X dimension
char lowerY[8], upperY[8];
Lower/upper bound of 8 byte quantization in/Y dimension
char lowerZ[8], upperZ[8];
Lower/upper bound of 8 byte quantization in// Z dimension
void *ptr[8];
// 8 64-bit pointers to 8 child nodes or leaf data
};
And requires 136 bytes.
Because quantized BVH nodes are only used at the leaf level, all child pointers will actually point to leaf data 2701A-2708A. In one embodiment, the 8 child pointers in quantized BVH node 2700B are removed by storing quantized node 2700B and all of the leaf data 2701B-2708B that it points to as a child in a single contiguous block of memory 2698. Saving the child pointers reduces the quantized node layout to:
struct QBVH8NodeLeaf {
Vec3f start, scale;
// parent AABB's starting position, extension vector
char lowerX[8], upperX[8];
Lower and upper bounds for 8 byte quantization in the// X dimension
char lowerY[8], upperY[8];
Lower and upper bounds for 8 byte quantization in the// Y dimension
char lowerZ[8], upperZ[8];
Lower and upper bounds for 8 byte quantization in// Z dimension
};
This requires only 72 bytes. Due to the continuous layout in memory/cache 2698, the child pointer for the ith child entry can now be simply calculated by: childptr (i) = addr (QBVH8NodeLeaf) + sizeof (QBVH8NodeLeaf) + i sizeof (leavdatatype).
Since the lowest level node of a BVH occupies more than half of the overall size of the BVH, the leaf-only compression described herein provides a reduction to 0.5 + 0.5 × 72/256 =.64 x original size.
In addition, the overhead with coarser boundaries and the cost of decompressing the quantized BVH node itself only occurs at the BVH leaf level (compared to all levels when quantizing the entire BVH). Thus, the usually quite significant traversal and intersection overhead due to coarser boundaries (introduced by quantization) is largely avoided.
Another benefit of embodiments of the present invention is improved hardware and software pre-fetch efficiency. This is due to the fact that: all of the leaf data is stored in a relatively small contiguous block of cache line(s) or memory.
Because the geometry at the BVH leaf level is spatially coherent, it is highly likely that all primitives referenced by the QBVH8NodeLeaf node share common attributes/features, such as objectID, one or more vertices, etc. Accordingly, one embodiment of the present invention further reduces storage by removing primitive data duplicates. For example, primitives and associated data may be stored only once per QBVH8NodeLeaf node, further reducing memory consumption of leaf data.
Quantization Orientation Bounding Box (OBB) of BVH leaf level
Effective bounding of hair primitives is described below as one example of significant memory reduction achieved by utilizing common geometric attributes at the BVH leaf level. In order to accurately encompass hair primitives (which are long but thin structures oriented in space), it is well known to compute an oriented bounding box to closely encompass the geometry. First, a coordinate space aligned with the hair direction is calculated. For example, it may be determined that the z-axis points in the direction of the hair, while the x-axis and the y-axis are perpendicular to the z-axis. Using this orientation space, the hair primitive can now be tightly enclosed using the standard AABB. Intersecting a ray with such an orientation boundary first calls for transforming the ray into an orientation space and then performs a standard ray/box intersection test.
The problem with this approach is its memory usage. Transformation into the orientation space requires 9 floating point values, while storage bounding boxes require an additional 6 floating point values, resulting in 60 bytes total.
In one embodiment of the invention, BVH compressor 2625 compresses this orientation space and the bounding box of multiple hair primitives that are close together in space. These compressed boundaries may then be stored within the compressed leaf level to tightly encompass hair primitives stored within the leaf. In one embodiment, the following method is used to compress the orientation boundaries. Can be represented by three normalized vectors orthogonal to each otherv x 、v y Andv z to represent the orientation space. Will be dottedpTransformation into this space works by projecting it onto these axes:
due to the vectorv x 、v y Andv z are normalized so that their components are in the range [ -1, 1 [ ]]In (1). Thus, instead of using 8-bit signed integers and constant scaling, the vectors are quantized using 8-bit signed fixed point numbers. Thus, quantized is generatedv x ′、v y ' andv z '. This approach reduces the memory required to encode the orientation space from 36 bytes (9 floating point values) to only 9 bytes (9 fixed point values, each of which has 1 byte).
In one embodiment, this is further achieved by exploiting the fact that all vectors are orthogonal to each otherSteps reduce memory consumption of the orientation space. Thus, only two vectors must be stored (e.g.,p y ' andp z ') and may be calculatedThereby further reducing the required storage to only six bytes.
What remains is to quantize the AABB within the quantized directional space. The problem here is that the point p is projected onto the compressed coordinate axis of the space (e.g. by calculation)) Resulting in a potentially larger range of values (since the value p is typically encoded as a floating point number). For this reason, the boundaries would need to be encoded using floating point numbers, reducing potential savings.
To address this problem, one embodiment of the present invention first transforms a plurality of hair primitives into space with its coordinates at [0, 1/√ 3 [ ]]Within the range. This may be accomplished by determining a world-space axis-aligned bounding box of the plurality of hair primitivesbAnd is performed using a transformation T which is first passedbLower to the left and then scale by 1 ⁄ max in each coordinate (b:)b.size.x, b.size.y, b.size.z):
One embodiment ensures that the geometry after this transformation remains at [0, 1/√ 3 []In range, because then transformed point-to-quantization vectorpx′、p y ' andp z the projection on' is kept in the range [ -1, 1]And (4) the following steps. This means that the AABB of the curve geometry can be quantized when transformed using T and then transformed into quantized directional space. In one embodiment, 8-bit signed fixed point arithmetic is used. However, for accuracy reasons, a 16-bit signed fixed point number may be used (e.g., using a 16-bit signed integer and constant scaling)Coded). This reduces the memory requirement for encoding the axis-aligned bounding box from 24 bytes (6 floating point values) to only 12 bytes (6 words) plus an offset shared for multiple hair primitivesbLower (3 floating points) and scale (1 floating point).
For example, in the case of 8 hair elements to be enclosed, this embodiment reduces memory consumption from 8 × 60 bytes =480 bytes to only 8 × (6+12) +3 × 4+4=160 bytes, which is a reduction of one third. Intersecting rays with these quantized orientation boundaries works by: first transforming light using transformation T, and then using quantizationv x ′、v y ' andv z ' to project light. Finally, the ray is intersected with the quantized AABB.
The memory consumption (in MB) and the overall rendering performance (in fps) of one embodiment of the invention (CLBVH) implemented on an Intel Embree architecture including the conventional BVH8 (ref) of Embree and the fully compressed QBCH 8 variant of Embree are shown in FIG. 29; in a typical two-to-two (two-in-two) Embree BVH configuration: highest performance (SBVH + pre-collected triangle data) and lowest memory consumption (BVH + triangle index). In general, in its two possible configurations ("fast" and "compact"), embodiments of the present invention have the same memory savings ("fast") as Embree's QBVH, with much lower performance impact, or achieve even better compression ("compact") with approximately the same performance impact.
The memory consumption (in MB), traversal statistics, and overall performance for the two Embree BVH configurations are shown in FIG. 30: highest performance (SBVH + pre-collected triangle data) and lowest memory consumption (BVH + triangle index). One embodiment of the invention (CLBVH) achieves similar or sometimes even greater memory savings than a fully compressed BVH, while reducing the runtime overhead to only a few percent.
One embodiment utilizes a modified version of the Embree 3.0 [11] CPU ray-tracing framework. As a comparative framework, a publicly available primary ray path tracker [1] was used. For benchmark testing, the path tracker is set to pure diffuse path tracking (up to 8 bounces), while each CPU HW thread tracks a single ray. For this reference, 15-20% of the time spent coloring. The hardware platform setup was a dual socket Xeon workstation with 2 x 28 cores and 96 GB of memory, and as a reference scenario four different models with complexity ranging from 10M to 350M triangles were tested (using many different lens positions). Performance and memory consumption were measured for two settings: "best performance" and "lowest memory consumption". These two modes require different BVH settings and primitive layouts: the first pre-gathers all triangles of each BVH leaf into a compact layout and uses BVH with spatial partitioning (SBVH), while the second mode stores only the vertex indices of each triangle and uses regular BVH without spatial partitioning.
For best performance, the table in fig. 30 shows that the overhead of decompressing BVH nodes degrades rendering performance by 10-20%. Rather, the CLBVH approach results in only 2-4% of throttling compared to a fully compressed BVH, while providing a similar or sometimes even slightly larger size reduction (43-45%) of BVH nodes. The size of the primitive data is unchanged. These embodiments provide a similar 8-10% reduction in overall size (BVH + leaf primitive data) to a fully compressed BVH.
Reducing memory consumption of BVH nodes is more efficient in a memory setting where the size of the primitive data (only storing vertex indices and not the full pre-collected vertices) is smaller relative to the size of the BVH nodes. The overall memory consumption reduction increases to 16-24% when using the fully compressed BVH node or CLBVH method. However, the CLBVH method only has a 0-3.7% runtime overhead, which ranges between 7% and 14% for fully compressed BVH nodes.
To achieve maximum memory reduction, a lossless leaf data compression scheme is employed for the CLBVH method (see above). This CLBVH variant has a larger runtime overhead than CLBVH, but facilitates a 15-23% reduction in leaf data (vertex index, objectID, etc. per triangle) size, increasing the overall size reduction to 26-37% compared to the uncompressed baseline.
Reference documents:
[1] Attila T. Áfra, Carsten Benthin, Ingo Wald, and Jacob Munkberg. 2016. Local Shading Coherence Extraction for SIMD-Efficient Path Tracing on CPUs. In Proceedings of High Performance Graphics (HPG ’16). Eurographics Association, 119–128.
[2] Holger Dammertz, Johannes Hanika, and Alexander Keller. 2008. Shallow Bounding Volume Hierarchies for Fast SIMD Ray Tracing of Incoherent Rays. In Computer Graphics Forum (Proc. 19th Eurographics Symposium on Rendering). 1225–1234.
[3] Manfred Ernst and Gunter Greiner. 2008. Multi Bounding Volume Hierarchies. In Proceedings of the 2008 IEEE/EG Symposium on Interactive Ray Tracing. 35–40.
[4] Vlastimil Havran. 2001. Heuristic Ray Shooting Algorithms. Ph.D. Dissertation. Faculty of Electrical Engineering, Czech TU in Prague.
[5] Sean Keely. 2014. Reduced Precision for Hardware Ray Tracing in GPUs. In Proceedings of the Conference on High Performance Graphics 2014.
[6] Christian Lauterbach, Sung-Eui Yoon, Ming Tang, and Dinesh Manocha. 2008. ReduceM: Interactive and Memory Efficient Ray Tracing of Large Models. Computer Graphics Forum 27, 4 (2008), 1313–1321.
[7] Jeffrey Mahovsky and BrianWyvill. 2006. Memory-Conserving Bounding Volume Hierarchies with Coherent Raytracing. Computer Graphics Forum 25, 2 (June 2006).
[8] S.G. Parker, J. Bigler, A. Dietrich, H. Friedrich, J. Hoberock, D. Luebke, D. McAllister, M. McGuire, K. Morley, A. Robison, and others. 2010. OptiX: a general purpose ray tracing engine. ACM Transactions on Graphics (TOG) 29, 4 (2010).
[9] Benjamin Segovia and Manfred Ernst. 2010. Memory Efficient Ray Tracing with Hierarchical Mesh Quantization. In Graphics Interface 2010. 153–160.
[10] Ingo Wald, Carsten Benthin, and Solomon Boulos. 2008. Getting Rid of Packets: Efficient SIMD Single-Ray Traversal using Multi-branching BVHs. In Proc. of the IEEE/EG Symposium on Interactive Ray Tracing. 49–57.
[11] Ingo Wald, Sven Woop, Carsten Benthin, Gregory S. Johnson, and Manfred Ernst. 2014. Embree: A Kernel Framework for Efficient CPU Ray Tracing. ACM Transactions on Graphics 33, 4, Article 143 (2014), 8 pages.
[12] Henri Ylitie, Tero Karras, and Samuli Laine. 2017. Efficient Incoherent Ray Traversal on GPUs Through Compressed Wide BVHs. In Eurographics/ ACM SIGGRAPH Symposium on High Performance Graphics. ACM.
apparatus and method for motion blur using dynamic quantization grid
As mentioned, motion blur may be used to simulate the effect of objects moving in a scene when the camera shutter is open. Simulating this effect can result in blurring of the orientation of the moving object, which makes the animation appear smooth when played. Rendering motion blur requires random sampling of the time of each ray path evaluated and averaging over many of these paths provides the desired blurring effect. To implement this technique, the underlying ray tracing engine must be able to trace rays through the scene at any time within the camera shutter interval. This requires encoding of the motion of geometric objects within the spatial acceleration structure for ray tracing.
In practice, such a data structure is constructed by building a bounding volume level (BVH) on the linear motion segment of the triangle where the triangle vertices are only linearly blended from start to end time. Using many such motion segments facilitates encoding complex motion by using linear boundaries to encompass the complex motion during the camera shutter interval. These linear boundaries store bounding boxes of the start and end times of the motion, such that linearly interpolating these boundaries between them at any time results in the appropriate bounding of the geometry at that particular time.
For ray tracing hardware implementations, it is important that each BVH node consumes as little memory as possible to reduce node fetch bandwidth. In one embodiment, local per-node quantization of the bounding box of all sub-items of the wide BVH node is applied. In particular, the quantization grid of the wide BVH node encodes the bounding box of each sub-item using grid coordinates with a small number of bits (e.g., 8 bits versus 32 bits in full floating point precision).
One embodiment extends the method to linear boundaries of motion blur by using the quantization scheme to store quantization boundaries for start and end times for each sub-term. However, for fast motions of very detailed geometries, this naive extension is prone to performance problems. The problem is that small triangles that move far relative to their size will cause the BVH node to store a fairly large and therefore coarse quantization mesh that cannot properly enclose the small triangle features.
One embodiment of the present invention solves this problem by using not a static quantization grid (as in the current implementation) but a dynamic quantization grid that moves according to the motion of the enclosed child nodes. This embodiment takes advantage of the fact that adjacent geometries typically move in a very similar manner; thus, during movement, the children of the BVH node remain fairly close together and often move in the same direction.
In one implementation, this property is exploited by determining a quantization grid with a fixed range that moves linearly along the common motion of the sub-items of the BVH node. The linear boundary of each sub-term can now be mapped into this moving quantization grid, since the residual motion obtained by subtracting the linear grid motion from the linear sub-motion is also a linear motion, whose linear quantization boundary can be directly derived.
The advantage of this technique is that the range of the moving interpolation grid need only be large enough to cover all geometries at the start time when it is placed at the start grid position and at the end time when it is placed at the end grid position. Thus, the size depends on the approximate size of the geometry contained within the BVH node at the start and end times, and does not depend on the volume spanned by the entire animation path. Therefore, the quantization grid will be much smaller, reducing storage requirements.
FIG. 31 shows an implementation of a na iotave expansion of the quantization bounding boxes of the motion blurred triangle 3101 and 3103. Assume that the BVH node has three illustrated triangles 3101 and 3103 as children, which move from left to right as shown. Therefore, the quantized grid 3100 for this BVH node over the entire motion is large and can only roughly enclose a triangle at the start and end times.
Fig. 32 illustrates a variation employed in one embodiment of the present invention that uses a much smaller quantization grid that significantly more closely surrounds the same triangles 3201 and 3203. In particular, the start time quantization grid 3200A is transformed into the end time quantization grid 3200B based on the detected movement of the triangles 3201-. The quantization grid 3200A-B moves linearly along the collective motion of the sub-terms of the BVH node. The linear boundaries of each sub-term can now be mapped into this moving quantization grid 3200A-B, since the residual motion obtained by subtracting the linear grid motion from the linear sub-motion is also a linear motion, whose linear quantization boundaries can be directly derived.
FIG. 33 illustrates one embodiment of an architecture for implementing the motion blur techniques described herein. In operation, BVH processor 3304 constructs BVH 3300 based on the current set of input primitives 3309 for the graphics scene. Ray generator 3301 generates rays that traversal circuit 3305 traverses through BVH 3307. The intersection circuitry 3310 identifies ray-primitive intersections to generate hits 3315 for further processing (e.g., generating secondary rays based on material specifications, etc.). One or more shaders may perform specified shading operations to render the image frame.
In one embodiment, motion blur processing logic 3312 implements the motion blur techniques described herein based on the mesh data 3318 and the motion of the graphics primitives detected within the BVH node. In one embodiment, the quantization grid motion evaluator 3314 determines the motion of the quantization grid over a specified time period, which the motion blur processing logic 3312 utilizes to perform its motion blur operation. The motion blur processing logic 3312 may be implemented as program code (e.g., an executable shader), circuitry, or using a combination of circuitry and program code. The underlying principles of the invention are not limited to any particular implementation of the motion blur processing logic 3312.
One embodiment of a method for motion blur processing is shown in fig. 34. The method may be implemented within the context of the architecture described above, but is not limited to any particular architecture.
At 3400, a volume-bounding-level (BVH) including a hierarchically arranged BVH node is generated based on the input primitive. At 3401, a quantization grid is generated that contains a set of BVH nodes, wherein each BVH node includes one or more children nodes. At 3402, based on the detected motion of the child node of the particular BVH node, the motion of the quantization grid is determined. At 3404, the linear boundaries of each child node are mapped to a moving quantization grid. In one embodiment, to perform the mapping, one or more residual motion values are obtained by subtracting the linear quantized grid motion from the linear sub-node motion. Then, a linear quantization boundary is derived from the residual motion value.
If it is determined at 3404 that a child node of another BVH node needs to be processed, then the process returns to 3401 where the new quantization for the current BVH node is computed at 3401. If not, the process ends.
Additional details of one embodiment of the present invention will now be provided. It should be noted, however, that the underlying principles of the invention are not limited to these specific details.
In one embodiment, the interpolated mesh data 3318 includes a start position (grid start), an end position (grid end), and a mesh size (grid size) that is the same for all time values (i.e., when the mesh moves based on the movement of primitives in the scene). In one embodiment, all of these mesh attributes are stored as 3D vectors.
The quantized grid motion evaluator 3314 represents the grid motion as:
grid_base(time) = lerp(grid_start, grid_end, time)
= (1.0-time) * grid_start + time *grid_end
this is a linear mixture for the special case of shutter times 0 and 1. The linear motion of the bounding box (where bounding _ start refers to the bounding box at the start time and bounding _ end refers to the bounding box at the end time) can be expressed as:
bounds(time) = lerp(bounds_start, bounds_end, time)
= (1.0-time) * bounds_start + time *bounds_end
this is also a linear motion. In one embodiment, the motion quantification mesh motion evaluator 3314 translates the linear boundaries of the triangular motion boundaries (in time) into mesh coordinate space to obtain the residual motion residual _ centers (time) relative to the moving mesh:
residual_bounds(time) =
(bounds(time) – grid_base(time)) / grid_size =
(lerp(bounds_start, bounds_end, time) – lerp(grid_start, grid_end, time)) / grid_size =
lerp(bounds_start-grid_start,bounds_end-grid_end, time) / grid_size =
lerp((bounds_start-grid_start) / grid_size, (bounds_end-grid_end) / grid_size, time) =
lerp( residual_bounds_start, residual_bounds_end, time)
residual_bounds_start = (bounds_start-grid_start) / grid_size
residual_bounds_end = (bounds_end-grid_end) / grid_size
therefore, the grid relative boundary of the triangle at the start time is residual _ boundaries _ start = (boundaries _ start-grid _ start)/grid _ size, and the grid relative boundary of the triangle at the end time is residual _ boundaries _ end = (boundaries _ end-grid _ end)/grid _ size. The residual motion relative to the moving grid is simply a linear mixture of these residual _ boundaries _ start and residual _ boundaries _ end positions. Thus, the linear boundary with respect to the moving grid moves linearly within the grid itself.
Note that this embodiment only obtains the remaining linear motion, since the grid _ size is not linearly mixed, but only one fixed grid _ size. If grid size also varies linearly, the quantization grid motion evaluator 3314 determines the product of the two lerp operations, which is not decomposed into the sum of the two lerp.
The remaining boundaries residual _ boundaries _ start and residual _ boundaries _ end can be easily quantized conservatively using a quantization grid at the beginning and end positions to obtain quantized remaining boundaries quantized _ residual _ boundaries _ start and quantized _ residual _ boundaries _ end with corresponding linear interpolation:
quantized_residual_bounds(time) =
lerp(quantized_residual_bounds_start, quantized_residual_bounds_ end, time)
to obtain world-space dequantization boundaries for these, the quantization grid motion evaluator 3314 blends the quantization boundaries scaled by the grid _ size factor and then adds to the blended grid positions:
dequantized_bounds(time) = quantized_residual_bounds(time) * grid_size + grid_base(time)
to intersect the rays with the linear equation org + t dir with these boundaries, the distance to the bounding plane is determined:
t_lower = (dequantized_bounds(time).lower – org) * rcp(dir)
t_upper = (dequantized_bounds(time).upper – org) * rcp(dir)
this provides the distance to the 3 lower bounding planes and the 3 upper bounding planes, which is then used by the intersection circuit 3310 to test whether the boundary hits using ray/box testing.
In one embodiment, the processing required for the above distance calculations is reduced using the techniques described above (i.e., decompression and traversal of the bounding volume levels). These techniques include: complexity is reduced with higher precision distance calculations shared between all children of the node; and adding some corrections, the corrections being determined using a quantization boundary of decreasing precision:
t_lower = (dequantized_bounds(time).lower – org) * rcp(dir)
= (quantized_residual_bounds(time) * grid_size + grid_base(time) – org) * rcp(dir)
= (grid_base(time) – org) * rcp(dir) + quantized_residual_bounds(time) * grid_size * rcp(dir)
the first term (grid _ base (time) -org) × rcp (dir) is determined only once for all sub-terms, since it depends only on the quantization grid. The second term quantized _ residual _ bones (time) grid size rcp (dir) is determined for each sub-term. However, when the interpolation of the quantization boundaries yields a lower precision output and grid size is chosen to be a power of 2, then this term is simply the product of a small number of bits and the floating point number rcp (dir), which is also less costly to implement in hardware.
In one embodiment, the complexity of computing the first term is reduced even more as follows:
Term1 = (grid_base(time) – org) * rcp(dir)
= (lerp(grid_start, grid_end, time) – org) * rcp(dir)
= (grid_start + time *(grid_end-grid_start) – org) * rcp(dir)
= (grid_start + time *(grid_end_start) – org) * rcp(dir)
where grid end start = grid end start is a vector from grid start to grid end. When no motion blur is performed, the formula will look almost the same, but will lack the time x (grid end start) term. We try to reduce the extra complexity of computing this term by assessing how much precision of grid end start is required and how much temporal precision is required. It is sufficient to store the 8 mantissa bits of the grid end start term and use only the 16 mantissa bits of time. This significantly reduces the hardware complexity of the operation. The reduction of bits of grid end start must be done in such a way that the grid end start vector becomes longer (so the moving grid still contains all the geometries). Further reduction in temporal accuracy adds some ambiguity to the grid position, which must be corrected using a properly extended residual motion boundary (the boundary can be simply extended by the maximum grid misalignment introduced by this temporal quantization).
The statistical evaluation of the embodiments described above for scenes with many small triangles and large motions reduces the number of intersection steps per ray by more than an order of magnitude compared to the naive quantization method.
In embodiments, the term "engine" or "module" or "logic" may refer to, be part of, or include the following: an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) and/or memory (shared, dedicated, or group) that executes one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that provide the described functionality. In an embodiment, an engine, module, or logic may be implemented in firmware, hardware, software, or any combination of firmware, hardware, and software.
Examples of the invention
The following are example implementations of different embodiments of the present invention.
Example 1. a method, comprising: generating a volume-bounding-level (BVH) comprising a hierarchically arranged BVH node based on an input primitive, at least one BVH node comprising one or more child nodes; determining a motion value of a quantization grid based on motion values of the one or more child nodes of the at least one BVH node; and mapping the linear boundaries of each of the child nodes to the quantization grid.
Example 2. the method of example 1, wherein mapping the linear boundary of each of the child nodes further comprises: obtaining one or more residual motion values by subtracting motion values of the quantization grid from motion values associated with the one or more child nodes; and deriving a quantization boundary for the one or more child nodes from the one or more residual motion values.
Example 3. the method of example 2, wherein the one or more child nodes comprise primitives.
Example 4. the method of example 3, wherein the primitives are in motion.
Example 5 the method of example 4, wherein the motion values associated with the one or more child nodes are determined based on motion of the primitive.
Example 6. the method of example 3, wherein the primitives comprise triangles.
Example 7. the method of example 2, further comprising: performing ray traversal and/or intersection operations according to the quantization boundaries of the one or more child nodes to determine one or more intersection points of rays.
Example 8 the method of example 7, further comprising: deriving one or more shaders to perform a graphics operation with respect to the one or more intersection points.
An example 9. a machine-readable medium having program code stored thereon, which when executed by a machine, causes the machine to perform operations comprising: generating a volume-bounding-level (BVH) comprising a hierarchically arranged BVH node based on an input primitive, at least one BVH node comprising one or more child nodes; determining a motion value of a quantization grid based on motion values of the one or more child nodes of the at least one BVH node; and mapping the linear boundaries of each of the child nodes to the quantization grid.
Example 10 the machine-readable medium of example 9, wherein mapping the linear boundary of each of the child nodes further comprises: obtaining one or more residual motion values by subtracting motion values of the quantization grid from motion values associated with the one or more child nodes; and deriving a quantization boundary for the one or more child nodes from the one or more residual motion values.
Example 11 the machine-readable medium of example 10, wherein the one or more child nodes comprise primitives.
Example 12 the machine-readable medium of example 11, wherein the primitives are in motion.
Example 13 the machine-readable medium of example 12, wherein the motion values associated with the one or more child nodes are determined based on motion of the primitive.
Example 14 the machine-readable medium of example 11, wherein the primitives include triangles.
Example 15 the machine-readable medium of example 10, further comprising program code to cause the machine to: performing ray traversal and/or intersection operations according to the quantization boundaries of the one or more child nodes to determine one or more intersection points of rays.
Example 16 the machine-readable medium of example 15, further comprising program code to cause the machine to: deriving one or more shaders to perform a graphics operation with respect to the one or more intersection points.
Example 17a graphics processor, comprising: a Bounding Volume Hierarchy (BVH) generator to construct a BVH comprising hierarchically arranged BVH nodes based on input primitives, at least one BVH node comprising one or more child nodes; and motion blur processing hardware logic to determine motion values for a quantization grid based on the motion values of the one or more child nodes of the at least one BVH node and to map linear boundaries of each of the child nodes to the quantization grid.
Example 18 the graphics processor of example 17, wherein to map linear boundaries of each of the child nodes, the motion blur processing hardware logic is to: obtaining one or more residual motion values by subtracting motion values of the quantization grid from motion values associated with the one or more child nodes; and deriving a quantization boundary for the one or more child nodes from the one or more residual motion values.
Example 19 the graphics processor of example 18, wherein the one or more child nodes comprise primitives.
Example 20. the graphics processor of example 19, wherein the primitives are in motion.
Example 21. the graphics processor of example 20, wherein the motion values associated with the one or more child nodes are determined based on motion of the primitive.
Example 22 the graphics processor of example 19, wherein the primitives comprise triangles.
Example 23. the graphics processor of example 18, further comprising: ray traversal and intersection hardware logic to perform ray traversal and/or intersection operations according to the quantization boundaries of the one or more child nodes to determine one or more intersection points of a ray.
Example 24. the graphics processor of example 23, further comprising: a plurality of execution circuits to execute one or more shaders to perform graphics operations with respect to the one or more intersections.
Embodiments of the invention may include various steps that have been described above. The steps may be embodied in machine-executable instructions, which may be used to cause a general-purpose or special-purpose processor to perform the steps. Alternatively, these steps may be performed by specific hardware components that contain hardwired logic for performing the steps, or by any combination of programmed computer components and custom hardware components.
As described herein, an instruction may refer to: a particular configuration of hardware, such as an Application Specific Integrated Circuit (ASIC) configured to perform certain operations or having predetermined functionality; or software instructions stored in a memory embodied with a non-transitory computer readable medium. Thus, the techniques illustrated in the figures can be implemented using code and data stored and executed on one or more electronic devices (e.g., an end station, a network element, etc.). Such electronic devices use computer-machine-readable media, such as non-transitory computer-machine-readable storage media (e.g., magnetic disks; optical disks; random access memories; read-only memories; flash memory devices; phase-change memories) and transitory computer-machine-readable communication media (e.g., electrical, optical, acoustical or other forms of propagated signals, such as carrier waves, infrared signals, digital signals, etc.) to store and communicate code and data (internally and/or over a network with other electronic devices).
Additionally, such electronic devices typically include a set of one or more processors coupled to one or more other components, such as one or more storage devices (non-transitory machine-readable storage media), user input/output devices (e.g., a keyboard, a touchscreen, and/or a display), and network connections. The coupling of the set of processors and other components is typically through one or more buses and bridges (also known as bus controllers). The storage devices and the signals carrying the network traffic represent one or more machine-readable storage media and machine-readable communication media, respectively. Thus, the storage of a given electronic device typically stores code and/or data for execution on a set of one or more processors of the electronic device. Of course, one or more portions of embodiments of the invention may be implemented using different combinations of software, firmware, and/or hardware. Throughout this detailed description, for purposes of explanation, numerous specific details are set forth in order to provide a thorough understanding of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced without some of these specific details. In some instances, well-known structures and functions have not been described in detail so as not to obscure the subject matter of the present invention. Therefore, the scope and spirit of the present invention should be judged in terms of the claims which follow.
- 上一篇:石墨接头机器人自动装卡簧、装栓机
- 下一篇:一种生成3D文字的方法及装置