Things have been going quite smoothly. The state handling mechanism was done rather quickly, and really works like a charm. Now I can just have global and local settings again, and I can toggle them however I like, much like how D3D9 and earlier versions worked. The new approach may make the API and driver implementation simpler and more efficient, but it’s far from convenient from a 3d engine point-of-view. Those differences are now abstracted away, and I have good control over the renderstates again.
That pretty much wrapped up the basic framework of the engine (although admittedly it’s still rather rough around the edges, but I’ll just add to the current interface as and when required), so it’s time to actually DO something with the engine now. So I looked at this idea of mine for making a custom tessellation pipeline via GPGPU. I’ve looked at some of the DX11 CS code, and at some Cuda. My first impression of DX11 CS is that it’s rather cumbersome to use. Microsoft made it work almost exactly like the other shader types in the DX11 API, which takes away from the GP-part of GPGPU in my opinion. There also didn’t seem to be all that much documentation on CS in the current DX SDK, so I figured I’d give Cuda a try first.
Cuda was a very pleasant surprise. I had played around a bit with some of the Cuda SDK samples before, but I’ve never actually tried to put any Cuda code into a project of my own. Setting up the project and having the Cuda kernel compile and link properly was actually the most work… But even that was quite simple once you know where to look. The programming manual itself isn’t much help for setting up Visual Studio projects. But after some poking around, I found that they had some Cuda custom build scripts in the common directory of the SDK. They also had a file for syntax highlighting the Cuda keywords and a small readme on how to get Visual Studio to recognize .cu files as C++-like code and highlight it.
Once I got it all set up, it worked like a dream, actually. You only need a few lines to attach a Cuda context to your D3D device, and to register whichever D3D resources you may want to map into Cuda later. Then the actual Cuda code is just compiled and linked against your code, so you can just call it almost as if it was a regular function in your program. In the .cu file you create a function with C++ code, from which you launch the actual kernel. This function is then exported, so you can just call it from your own code. Very simple, very clean.
So I wrote a simple test program where I took a vertexbuffer and moved all vertices along their normal, to basically ‘blow up’ the object. I had everything up and running very quickly, it’s quite intuitive since it’s so similar to regular C++ code. And the parallelism is implicit. You just create a function which can be called many times in parallel. So you don’t need to worry about starting and stopping threads, synchronizing, waiting for completion and all that.
The only thing that was a bit strange was the performance. When I ran my test program from Visual Studio the first time, I got around 300 fps (in release mode though). At some point I closed down a Task Manager which had been open for a while, and suddenly the framerate of my program jumped to over 500 fps, and stayed there consistently. I then experimented with resizing the window, and it seemed that the speed also varied with resizing. Not so much because of the size of the window itself, but rather because the framerate seemed almost random everytime it had to reinitialize. Or perhaps it had something to do with the width of the screen. Perhaps some values are suboptimal… Not sure what went on exactly. When I ran it again outside of Visual Studio, I got over 850 fps right away. It also seemed pretty much indifferent to window size now.
I must say I was very happy to see this figure of 850 fps. After all, before I added the Cuda code, my test scene rendered at about 1000-1100 fps. Going down to 300 fps would mean that Cuda has huge overhead for mapping D3D resources and modifying them (I did actually comment out the actual kernel, having it only map and unmap the resource, that still gave the performance hit). But with 850 fps, that means it’s actually quite efficient. So this idea may have potential. Currently I don’t actually replace the vertex shaders. You can’t completely replace them anyway, but you could have vertexshaders that merely copy vertex data from the buffer and send it off to the rasterizer, and have the Cuda code handle the T&L instead, while it’s processing the geometry anyway. At this point I don’t really know what would be more efficient. I’ll have to toy around a bit and see what performance is like… and I also have to decide on what kind of algorithm I’d like to implement exactly, and how I will be mapping that to Cuda.