Right, since we will be requiring quite a bit of post-processing, I went and wrote some code to render to a fullscreen quad first. In D3D10+, you NEED to use a vertexshader at all times, you can’t just pass XYZRHW coordinates directly to the rasterizer stage anymore. Even if you wanted to pass XYZRHW coordinates, you would have to create a passthrough shader to make them work.
Anyway, it’s a bit of a blessing in disguise, since I can now use a fullscreen quad that is -1…1 size, and it will be stretched to fullscreen automatically by the viewport projection step that is done after the vertexshader. This means I can just use a static vertexbuffer, no need to update it when the screen size changes.
And since D3D10+ has a 1:1 pixel:texel mapping, I don’t need to do any kind of corrections either.
For D3D9 backwards compatibility though, I have added some shift factors to the vertex shader, so that it can shift the positions to map the texels to pixels (which is relative to screen space, now that you have normalized coordinates in the shader). Still nicer than updating the vertexbuffer itself, like I used to do (then again, that code is from DX7 days, so back then I didn’t really have a choice. I just never bothered to rewrite it with shaders).
Once the basics of rendering a fullscreen texture worked, I also worked on render-to-texture, and then I made my own StretchRect()-clone. That was a very useful function in D3D9, but since it was basically just a wrapper around some standard D3D calls, Microsoft decided not to supply one for D3D10+. You have to roll your own now.
With this code in place, I could now do some nice post-processing. One of the effects I wanted to do was a boxfilter. I first started with a naive pixelshader version, which simply sampled the surrounding texels and averaged them. This caused a lot of redundant texel reads, and proved to be very inefficient for larger filter kernel sizes.
Back in the DX7 days I had developed an implementation, exploiting the fact that a boxfilter is separable into horizontal and vertical passes, and using the two texture stages of my GeForce2 at the time (those were the days) to build a filter with logarithmic complexity. I would set the same texture to both stages, and align them so that I could average two samples at each pass.
Pass 1 produces averages of pixels: (0+1), (1+2), (2+3), (3+4) etc.
Pass 2 uses these factors to generate: (0+1+2+3), (1+2+3+4), (2+3+4+5) etc.
In short, a filter of size N in one direction takes log(N) passes. A filter of NxM would take log(N)+log(M) passes.
This turned out to be extremely efficient, even back in those days. You could run extremely large filter kernels, blurring the screen beyond recognition in realtime.
I still have an old demo online which used this boxfilter effect, see here. If you still have an old box with a GeForce2 or similar card, you should try it, to see just how fast it still is.
Anyway, I have now ported that routine over to the D3D10+ framework. The code always had a deficiency though, and that is that it only supported kernels with sizes that are powers-of-two, because of the way it works. There is a solution to that, and it is actually not all that difficult: if you use an extra accumulator rendertarget, and factor your kernel size into powers-of-two, you can accumulate all the terms as you go through the passes, and support any kernel size.
I’ve decided to do some other experiment though, before implementing this improvement on this boxfilter algorithm. And that experiment was (drum roll): Compute shaders!
Yes… there was something I wanted to do for a while now. My Java engine used a boxfilter based on the Summed Area Table algorithm (SAT). It consists of two passes. First pass builds a table with the sums of all pixels from left-to-right, and top-to-bottom. The second pass does the actual boxfilter by taking the values from the four corners of the box from the SAT and subtracting them to get the sum of just that box.
This means that the filter has O(1) complexity: it has constant speed, regardless of kernel size. Building the table is always the same, and the lookup always needs only 4 samples from the table, regardless of size (granted, because of caching, larger kernels tend to be a tad slower, since the samples have poorer cache coherency).
This filter can be seen in action in the Croissant 9 demo in a few places, such as the opening scene.
The reason why I could never implement this algorithm on GPU is because building the table requires you to sum up neighbouring pixels. You can not read back from a rendertarget directly. Only in a following pass.
However, Compute Shaders do not have this restriction. They support Unordered Access buffers. These buffers can be read from and written to in any order. This gives you the same freedom as a regular CPU in terms of memory access. This is nothing short of a revolutionary feature of modern GPUs.
So, I decided to put my D3D11 codebase to good use for a change, and I tried to build a Compute Shader that calculates the SAT. I did the actual sampling of the table in a regular pixelshader. On downlevel hardware (Direct3D 10 hardware can also support Compute shaders, CS4.0 or 4.1 rather than CS5.0), you have the restriction that textures cannot be used as unordered access views (the Compute Shader equivalent of a render target, so to say. So you cannot output directly to a texture). Pixel shaders can read unordered access buffers though, so with an extra pixelshader pass, you can copy your compute shader output buffer to a ‘real’ texture. I figured I could kill two birds with one stone, and do the table-based filtering in the pixelshader, while rendering to a texture.
And, after a bit of a struggle, I managed to get it to work:
I have to say though, the performance is not as good as I hoped it would be. Since it is ~constant time, I can get ~700 fps regardless of kernel size… But that is also its ‘top speed’. The above-mentioned multipass algorithm dating from the DX7 era is much faster for small kernels. The scene renders at about 5000 fps with no filter applied, and doesn’t drop to around ~800 fps until I pump the filter up to about 512×512 size. By then the scene is already blurred beyond recognition… so sadly this means that the point where the SAT algorithm starts outperforming the multipass algorithm is past the area of interest.
Anyway, it was fun writing some REAL D3D11 code, and it was educational to get the hang of how Compute Shaders fit into the D3D11 model, and how they cooperate with textures and pixelshaders (I have dabbled with Cuda a bit, when I had my GeForce 8800, but Compute Shaders are very different in some ways. You still write them as regular HLSL, and compile and execute them much like other shaders in D3D, where Cuda was more similar to C/C++ and blended into your application code almost seamlessly). After all, Compute is one of the biggest new features of D3D11.
I may look into alternative algorithms in the future, since SAT is not all that ideal for CS yet. But first I will extend my multipass algorithm with the accumulator rendertarget that I mentioned, so that I can support filter kernels of arbitrary size.