There are two speed tricks with Gaussian filtering using the pixel shader. The first is that the Gaussian filter (along with the box filter) is separable: you can filter horizontally, then vertically (or vice versa, of course). So for a 9×9 filter kernel you then have 18 texture samples in 2 passes instead of 81 samples in a single pass. The second trick is that each of the samples you use can actually be in-between two texels, e.g. if you need to sample texels 1 through 9, you could sample just once in between 1 and 2 and use the GPU to linearly interpolate between the two, between 3 and 4, etc., for a total of 5 samples. So instead of 18 samples you could get by with 10 samples. This is old news, dating back to at least ShaderX2 and GPU Gems, and we talk about it in our 3rd edition around page 469 on.
Some bits I didn’t know were discussed by this article by Daniel Rákos in his article, and also coded up by JeGX in a GLSL shader demo collection. First, I hadn’t thought of using the Pascal’s triangle numbers as the weights for the Gaussian (nice visualization here). To be honest, I’m not 100% sure that’s right, seems like you want the area under the Gaussian’s curve and not discrete samples, but the numbers are in the ball park. It’s also a heck of a lot easier than messing with the standard deviation; let’s face it, it’s a blur and we chop off the ends of the (infinite) Gaussian somewhat arbitrarily. That said, if a filtering expert wants to set me straight, please do.
The second tidbit: by using the linear interpolation trick, this shader was found to be 60% faster. Which sounds about right, if you assume that the taps are the main cost: the discrete version uses 9 taps, the interpolated version 5. Still, guessing and knowing are two different things, so I’m now glad to know this trick actually pays off for real, and by a significant amount.
The last interesting bit I learned was from a comment by heliosdev on Daniel’s article. He noted that computing the offset locations for the texture samples once (well, 4 times, one for each corner) in the vertex shader and passes these values to the pixel shader is a win. For him, it sped the process by 10%-15% on his GPU; another commenter, Panos, verified this result with his own benchmarks. Daniel is planning on benchmarking this version himself, and I’ll be interested what he finds. Daniel points out that it’s surprising that this trick actually gives any benefit. I was also under the impression that because texture fetches take so long compared to floating-point operations, that you could do a few “free” flops (as long as they weren’t dependent on the texture’s result) in between taps.
Long and short, I thought this was a good little trick, though one you want to benchmark to make sure it’s helping. Certainly, constants don’t want to get passed from VS to PS, that sort of thing gets optimized by the compiler (discussed here, for example). But I can certainly imagine computing numbers in the VS and passing them down could be more efficient – my main worry was that the cost of registering these constants in the PS inputs might have some overhead. You usually want to minimize the number of registers used in a PS, so that more fragments can be put in flight.