SIGGRAPH 2008: Beyond Programmable Shading Class

This class was about non-traditional processing performed on GPUs, similar to GPGPU but for graphics. As we discuss in the “Futures” chapter at the end of our book, this is a particularly interesting direction of research and may well represent the future of rendering. The recent disclosures on Direct3D 11 Compute Shaders and Larrabee make this a particularly hot topic.

The full course notes are available at the course web site.

The talk by Jon Olick from id software was perhaps the most interesting. He discussed a sparse voxel octree data structure which is rendered directly using CUDA. This extends id’s megatexture idea to geometry and may very well find its way into id’s next engine in some form.

SIGGRAPH 2008: The Authors Meet

All the work on the book was done remotely, via email and CVS. In fact, I had never met Tomas until this morning at SIGGRAPH. Here you can see all three of us, pleased as punch that the book is finally done. Left to right, Eric, Naty, Tomas:

(Eric here. I guess this is a tradition: Tomas and I didn’t meet until after the first edition was published.)

SIGGRAPH 2008: Advances in Real-Time Rendering in 3D Graphics and Games

I attended the “Advances in Real-Time Rendering in 3D Graphics and Games” class today at SIGGRAPH. This is the third year in a row Natasha Tatarchuk from AMD has organized this class. Each year different game developers as well as people from the AMD demo team are brought in to talk about graphics, and some of the best real-time stuff at SIGGRAPH in the last two years has been in this course.

Unfortunately, due to Little Big Planet crunch, Alex Evans from Media Molecule was unable to give his planned talk and a different speaker was brought in instead. This was a bit of a bummer since Alex’s SIGGRAPH 2006 talk was very good and I was hoping to hear more about his unorthodox take on real-time rendering.

The remaining talks were of high quality, including talks by the developers of games such as Halo 3, Starcraft 2 and Crysis. Unlike previous years, where it took many weeks for the course notes to be available online, the full course notes are already available at AMD’s Technical Publications page – check them out!

Direct 3D Details Part V: Other Features

This grab-bag of a post summarizes the various other features of Direct3D 11 which Microsoft described at Gamefest.

Dynamic shader linkage is supported (similar to the interfaces feature of Cg). This allows for separate light and material shaders to be written and compiled. These are later linked when the shader is set. This offers a solution to the combinatorial explosion resulting from a variety of lights and materials (this explosion, and some other solutions to it, are discussed in section 7.9 of our book).

Two new compressed texture formats have been added. BC6 supports high dynamic range RGB textures, using 1 byte per texel (instead of 6 bytes for an RGB 16-bit float texture). BC7 supports low dynamic range RGB or RGBA textures. It also uses one byte per texel (like DXT5/BC3), but offers significantly higher quality than texture formats available in D3D10. Both formats offer multiple block types (the compression tool selects the appropriate block type based on its content).

The block compression formats in D3D9 and D3D10 are based on the idea that each 4×4 texel block has all its values arranged along a single line, and the bits for each texel encode where on the line it is placed. For example, in DXT1/BC1, a line in RGB space is represented by two RGB endpoints, and each texel gets two bits to select one of four points along the line.

The new D3D11 formats support block types with one, two or even three (in the case of BC7) color lines. There is a tradeoff between the number of lines and the number of points along each line, since each block takes up the same amount of memory.

In principle, a 4×4 block with two color lines would need 16 additional bits per block to determine which line each texel was associated with (even more bits are needed for three color lines). To reduce storage requirements, only a subset of possible line association patterns are supported. The compression tool selects the best association out of this subset for each block.

Direct3D11 also tightens up the texture specifications. Decompression results must be bit-accurate, and subtexel/submip filtering precision is required to be at least 8 bits.

Direct3D11 increases the texture size limits from 8K texels to 16K texels. Note that a 16K x 16K DXT1/BC1 texture takes up 128MB – not many games will have textures this large! In general, D3D11 allows for resources as large as 2GB.

Hardware can optionally support double-precision floats. This was the only optional feature of D3D11 mentioned at Gamefest.

There was a slide listing a bunch of other features without further explanation. Most are a bit mysterious, but I list them here in case someone else is able to puzzle out what they mean:

  • Addressable Stream Out
  • Draw Indirect
  • Pull-model attribute eval
  • Improved Gather4
  • Min-LOD texture clamps
  • Conservative oDepth
  • Geometry shader instance programming model
  • Read-only depth or stencil views

This completes my report on Direct3D11 from Gamefest. Check the XNA Presentations Page for the slides and audio – they are not up there yet, but hopefully will be there soon.

Direct3D 11 Details Part IV: Multithreaded Rendering

Direct3D 10 only allows graphics commands to be issued from a single thread (there is a multithreaded mode, but Microsoft explicitly warns against using it due to its poor performance). In an API such as Direct3D, issuing graphics commands involves a fair amount of CPU overhead. Given the trend towards increasing the number of cores on a processor rather than the performance of a single core, it is desirable to efficiently spread this work among multiple threads.

Direct3D 11 adds the ability to create display lists from multiple threads and execute them from the main rendering thread. In addition, the Device (which creates resources) has been separated from the Context (which issues graphics commands). This enables creating resources asynchronously. Deferred Contexts are used to create display lists and the Immediate Context issues graphics commands to the GPU, including the execution of display lists created on Deferred Contexts.

Unlike the other features in Direct3D 11, multithreaded rendering is not a hardware feature at all. With the appropriate drivers, D3D10 (perhaps even D3D9) hardware will be able to perform multithreaded rendering efficiently (some level of multithreaded performance will be available even without new drivers, but it was unclear what the limitations would be in this case).

Direct 3D 11 Details Part III: Compute Shaders & Unordered Memory

GPGPU (General-Purpose computation on GPU) approaches such as NVIDIA’s CUDA have become increasingly popular the last few years, recently coming full-circle with various non-traditional rendering algorithms (perhaps this should be called GPGPUG?). However, the existing solutions are vendor-specific, often requiring reprogramming even for different GPUs from the same vendor. They also tend not to “play well” with the traditional graphics pipeline. For example, on GeForce 8000-series GPUs using CUDA there is a large delay when switching between CUDA and traditional graphics rendering.

Direct3D 11 introduces a new kind of shader called a Compute Shader. A compute shader is invoked as a regular array of threads. The threads are divided into groups. Each group has 32KB of memory shared among the threads in the group. Thus the threads can use partial results computed by other threads in the same group, improving performance. Threads can also perform random-access reads and writes to graphics resources such as textures, vertex arrays or render targets. These memory accesses are unordered, although various synchronization instructions exist to impose ordering when needed.

Pixel shaders can also perform random-access (unordered) writes. This allows them to write data structures such as linked lists that can then be processed by a compute shader, or vice-versa (pixel shaders have always had the ability to perform random access reads via texture lookups).

Several examples of compute shaders were shown at Gamefest, performing post-process operations such as finding the average luminance of a render target, or computing a luminance histogram (both used in tone mapping). For these operations, a 2X speedup was quoted over the best performance possible using pixel shaders.

Compute shaders can also perform operations such as computing summed-area tables and fast-Fourier transforms significantly faster than traditional GPU methods. Microsoft is looking into providing library functions to perform such operations.

Microsoft speculated that algorithms such as A-buffer rendering and ray tracing could also be performed efficiently, but they don’t have any hard performance numbers for those.

Larrabee

Solid information about Intel’s new Larrabee architecture came out a few days ago, the Level of Detail blog has a good set of links. The major news is that Intel’s SIGGRAPH paper is now available for download from ACM’s Digital Library. Unfortunately, not everyone has access to this site’s resources (it costs money to subscribe). My contribution to the cause:

http://softwarecommunity.intel.com/UserFiles/en-us/File/larrabee_manycore.pdf

Thanks to Tom Forsyth for the link.

I’m excited by Larrabee not because of any particular technical feature (though I’m entirely savoring the paper itself, reading two pages a day at lunch), but rather by the fact that it opens up a whole new ecosystem for implementing graphics algorithms. Regardless of whether Larrabee wins or loses in the long-run, it will have a huge effect in increasing our knowledge by helping us explore different hardware and software designs for rendering.

Direct3D 11 Details Part II: Tessellation

Direct3D 11 adds three new pipeline stages, with the goal of enabling efficient tessellation of higher order surfaces. This is the Direct3D 10 pipeline, as shown in “Real-Time Rendering, 3rd Edition”:

Direct3D 10 Pipeline

The color of each stage indicates whether it is fully programmable (green), configurable (yellow) or fixed function (blue). The stages are described more fully in the “Graphics Processing Unit” chapter of the book. Note that the “Geometry Shader” stage is new to Direct3D 10, but the other stages have been in the pipeline for quite a while.

The Direct3D 11 pipeline adds three new stages between the vertex and geometry shader stages (framed in red). Two of the new stages are programmable (the hull and domain shader stages) and one is configurable (the tessellator stage):

Direct3D 11 Pipeline

This pipeline operates on meshes represented as a series of surface patches. Triangle and quad surface patches are primitives in Direct3D 11 (there is also a tessellated line primitive). The shape of each patch is defined by a number of control points. These control points are transformed, skinned and / or morphed one by one in the vertex shader.

The hull shader is called for each patch, using the patch control points from the vertex shader as inputs. The hull shader has two main responsibilities. The first is to (optionally) convert the control points from one representation (basis) to another. for example, it can implement the technique introduced in Loop and Schaefer‘s paper “Approximating Catmull-Clark Subdivision Surfaces with Bicubic Patches“. The control points are sent directly to the domain shader, bypassing the tessellator. The hull shader’s second responsibility is to compute appropriate tessellation factors, which are passed to the tessellation stage. This allows for adaptive tessellation, which can be used for continuous view-dependent LOD (level of detail). The tessellation factors are specified per patch edge, and range from 2 to 64. This means that each edge of the patch may be split into at least 2 (and as many as 64) triangle (or quad) edges.

The tessellator is a fixed-function (but highly configurable) stage, which uses the tessellation factors to tessellate (subdivide) the patch into multiple triangle or quad primitives. The tessellator does not have access to the control points – all tessellation decisions are made based on configuration and the tessellation factors passed on from the hull shader. Each vertex resulting from the tessellation is output to the domain shader. Only the patch parametrization coordinates are passed on for each vertex.

The domain shader operates on the patch parametrization coordinates of each vertex separately, although it can also access the transformed control points for the entire patch. The domain shader sends the complete data for the vertex (position, texture coordinates, etc.) to the geometry shader (or the clipping stage if no geometry shader is present). Effectively, it evaluates the surface representation at each vertex. Techniques such as displacement mapping can also be applied by this shader stage.

Although Microsoft gave an example using Catmull-Clark subdivision surfaces, the programmability of the pipeline enables other surface representations to be used. Alternatively, the tessellation stages can be turned off and traditional triangle or quad meshes can be used.

Direct3D 11 Details Part I: Intro

I attended Gamefest 2008 last week. Gamefest (formerly called Meltdown) is a Microsoft-run Windows and Xbox 360 game development conference. This year there were two notable announcements: XNA Community games (discussed in a previous blog post) and the first public disclosure of Direct3D 11.

Direct3D is, of course, the API used by most Windows games, but its importance extends beyond Windows. Direct3D features guide the development of graphics hardware in general, so these features are bound to show up in future consoles, as well as in OpenGL.

The announcement that Direct3D 11 would not be tied to the next version of Windows (as many had feared), and would be available on Windows Vista was very significant to Windows developers, many of whom complained about the tying of Direct3D 10 to Windows Vista. Direct3D 11 will also be available on Direct3D 9, 10, and 10.1 level graphics hardware (although the new features will not be available there, with the exception of some multithreading enhancements).

The fact that the Direct3D 11 API is a strict superset of the 10/10.1 API is also cause for relief among game developers. From Direct3D 9 to 10, the API went through extensive changes. These changes were mostly long-overdue cleanups and improvements, but they left developers supporting two very different APIs if they wanted to support the many customers using Windows XP and also expose the new Direct3D 10 hardware features.

This is the first part of a multi-part post which will summarize the essential facts about Direct3D 11, as known from the Gamefest slides. Eventually, the slides should show up on the XNA Presentations page.

Full disclosure of Direct3D 11 should occur later this year – the November 2008 DirectX SDK release will feature a preview version of the API, including full documentation and code samples.