A few days ago, nVidia released the Cuda 3.0 beta SDK. This went together with a new GeForce 195.39 beta driver. I’ve given them a quick look and I like what I see. Some highlights:
- The problem with name conventions and calling conventions is now solved. In this release nVidia also uses stdcall, and uses undecorated names. This means that there now is binary compatibility between nVidia’s and AMD’s OpenCL.dll. I tested it by running a few of the ATi Stream examples without recompiling, and they worked. Sadly I couldn’t try running the nVidia examples on the AMD CPU driver, because for some reason it complained about a missing aticalcl64.dll. Not sure if I didn’t configure something correctly, or if their CPU drivers only work when you have the Catalyst driver installed.
- nVidia now supports the double precision features of OpenCL. I believe this is again a first for nVidia.
- I’ve complained about the poor performance of some of the included OpenCL samples before (eg the Volume Rendering sample). I was very happy to see that there was a massive increase in performance there. I haven’t done any in-depth performance benchmarking, but it does look like OpenCL and C for Cuda now run at more or less the same speeds. I’m not sure how much of that is because of compiler improvements, and how much of that is because of nVidia supporting new features with OpenCL now, like the new image formats, byte addressing, better interoperability with OpenGL etc… But OpenCL support is maturing fast.
- The release notes mention “Early support for the Fermi architecture”, so with a bit of luck, Fermi’s release isn’t too far away now.
nVidia’s OpenCL support seems to be quite mature now. It supports a lot of features, the performance is good, and the function naming and calling convention seems to be sorted out. I hope there will be a public release of this driver soon.
On AMD’s side, the driver for the Stream beta 4 SDK was labeled Catalyst 9.11 beta, so it is apparently scheduled to be released in November. However, I’ve heard people complain about very poor performance on AMD GPUs:
The explanation seems to be in how the OpenCL memory model doesn’t quite work the same as AMD’s GPUs:
Well, as I said earlier, OpenCL *was* based on Cuda, and nVidia would probably have an advantage there. It seems AMD is having trouble shoehorning it into their architecture.
As one developer said though, we don’t really care WHY it doesn’t perform:
"So these are not exactly apple to apple comparison" – I think you are badly mistaken here.
The real and most important question for developer is "when I write normal program in Brook+ ( or CAL ) and compare it to program written in OpenCL, what will be faster". This is the question behind decision about using Brook+ and OpenCL ( Of course there are other factors as ease of developing, but these is not important when speed differnece is big ).
So for developer it really doesn’t matter how did ATI implement OpenCL. If it’s too slow then sorry, but we will not use it ( and probably switch to nvidia’s cards and better opencl implementation )
There might be perfectly valid and logical explanations for why it’s not as fast as you want it to be, but in the end it’s a problem that AMD has to solve, by improving their software and possibly also their hardware. After all, the point of using OpenCL is not only that it works on all hardware, but also that you get acceptable performance, just like with OpenGL and Direct3D for example.
So although AMD has GPU support ‘on paper’ now, with their first beta release, it doesn’t quite seem like they’re ready to take on nVidia yet, in the GPGPU arena. At this point it looks like nVidia will have an advantage because of its architecture, and their software is also more mature and feature-rich.
Update: I found that indeed there is a dependency problem with AMD’s OpenCL driver:
After manually placing the ATi Catalyst drivers into the same directory as OpenCL.dll, I could run the AMD CPU implementation. And indeed it is now more or less a drop-in replacement for nVidia’s… It’s just that nVidia’s samples query GPU devices only by default, so that had to be fixed. Many of them also use some kind of OpenGL interop and such, so some of them still won’t work, or as in the case of the Volume Rendering sample, it works, but the resulting image is wrong. But at least it should now be possible, even without an ICD, to get some lowest-common-denominator OpenCL code running, which is binary compatible with both vendor’s implementations.