There are two speed tricks with Gaussian filtering using the pixel shader. The first is that the Gaussian filter (along with the box filter) is separable: you can filter horizontally, then vertically (or vice versa, of course). So for a 9×9 filter kernel you then have 18 texture samples in 2 passes instead of 81 samples in a single pass. The second trick is that each of the samples you use can actually be in-between two texels, e.g. if you need to sample texels 1 through 9, you could sample just once in between 1 and 2 and use the GPU to linearly interpolate between the two, between 3 and 4, etc., for a total of 5 samples. So instead of 18 samples you could get by with 10 samples. This is old news, dating back to at least ShaderX2 and GPU Gems, and we talk about it in our 3rd edition around page 469 on.
Some bits I didn’t know were discussed by this article by Daniel Rákos in his article, and also coded up by JeGX in a GLSL shader demo collection. First, I hadn’t thought of using the Pascal’s triangle numbers as the weights for the Gaussian (nice visualization here). To be honest, I’m not 100% sure that’s right, seems like you want the area under the Gaussian’s curve and not discrete samples, but the numbers are in the ball park. It’s also a heck of a lot easier than messing with the standard deviation; let’s face it, it’s a blur and we chop off the ends of the (infinite) Gaussian somewhat arbitrarily. That said, if a filtering expert wants to set me straight, please do.
The second tidbit: by using the linear interpolation trick, this shader was found to be 60% faster. Which sounds about right, if you assume that the taps are the main cost: the discrete version uses 9 taps, the interpolated version 5. Still, guessing and knowing are two different things, so I’m now glad to know this trick actually pays off for real, and by a significant amount.
The last interesting bit I learned was from a comment by heliosdev on Daniel’s article. He noted that computing the offset locations for the texture samples once (well, 4 times, one for each corner) in the vertex shader and passes these values to the pixel shader is a win. For him, it sped the process by 10%-15% on his GPU; another commenter, Panos, verified this result with his own benchmarks. Daniel is planning on benchmarking this version himself, and I’ll be interested what he finds. Daniel points out that it’s surprising that this trick actually gives any benefit. I was also under the impression that because texture fetches take so long compared to floating-point operations, that you could do a few “free” flops (as long as they weren’t dependent on the texture’s result) in between taps.
Long and short, I thought this was a good little trick, though one you want to benchmark to make sure it’s helping. Certainly, constants don’t want to get passed from VS to PS, that sort of thing gets optimized by the compiler (discussed here, for example). But I can certainly imagine computing numbers in the VS and passing them down could be more efficient – my main worry was that the cost of registering these constants in the PS inputs might have some overhead. You usually want to minimize the number of registers used in a PS, so that more fragments can be put in flight.
Keep in mind that on a certain platform that can’t directly texture from its render targets, a non-separable 9×9 Gaussians is faster than a separable 9+9 version with a resolve (transfer from RT to texture) between the H and V passes.
Whether it’s faster to calculate offsets in the VS and pass down via interpolators to the PS, or directly calculate there should be benchmarked on your particular hardware – some GPUs really hate additional vertex attributes.
And now there are tricks for saving bandwidth and storing texture access results in thread local storage with OpenCL/DirectX Compute/CUDA.
A similar article that predates these: http://prideout.net/archive/bloom/
> A similar article that predates these: http://prideout.net/archive/bloom/
Interestingly enough, the author of that article has the first comment on Daniel’s blog post, warning people away from it (which begs the question, why not just fix that article?):
Sweet. This is similar to my old bloom tutorial, which I wrote when I was young and stupid: http://prideout.net/archive/bloom/
Some aspects of my tutorial are an embarrassment to me now, so I hope folks will encounter your article before mine. Keep up the good work Daniel!
No, Philip is just being too modest. I suppose he’s referring to the mipmap chain stuff which is not really necessary for a gaussian filter, but it’s nevertheless useful for some effects (Halo 2 for example uses a similar approach for their bloom filter). In any case, that’s completely orthogonal to the main points (pascal triangle, separability and bilinear interpolation).
The mipmap chain stuff is very useful for very-large-kernel gaussian blurs (for example, you can get results identical to applying a 40×40 gaussian kernel by recursively applying a 5×5 gaussian blur kernel over 4 mip chains). I’m not sure of any other way to do big kernels efficiently on DX9 hardware.