Who was first, DirectX 12 or Mantle? nVidia or AMD?

There has been quite a bit of speculation on which API and/or which vendor was first… I will just list a number of facts, and then everyone can decide for themselves.

  • Microsoft’s first demonstrations of working DX12 software (3DMark and a Forza demo, Forza being a port from the AMD-powered XBox One), were running on an nVidia GeForce Titan card, not AMD (despite the XBox One connection and the low-level API work done there).
  • For these two applications to be ported to DX12, the API and drivers had to have been reasonable stable for a few months before the demonstration. Turn 10, developers of Forza, claimed that the port to DX12 was done in about 4 man-months.
  • nVidia has been working on lowering CPU-overhead with things like bindless resources in OpenGL since 2009 at least.
  • AMD has yet to reveal the Mantle API to the general public. Currently only insiders know exactly what the API looks like. So far AMD has only given a rough global overview in some presentations, which were released only a few months ago. And actual beta drivers have only been around since January 30th. Microsoft/nVidia could only have copied its design through corporate espionage and/or reverse engineering in an unrealistically short timeframe.
  • AMD was a part of all DX12 development, and was intimately familiar with the API details and requirements.
  • DX12 will be supported on all DX11 hardware from nVidia, from Fermi and up. DX12 will only be supported on GCN-based hardware from AMD.
  • The GCN-architecture marked a remarkable change of direction for AMD, moving their architecture much closer to nVidia’s Fermi.

Update: This article at Tech Report also gives some background on DirectX 12 and Mantle development: http://techreport.com/review/26239/a-closer-look-at-directx-12

Posted in Direct3D, Hardware news, OpenGL, Software development, Software news | Tagged , , , , , , , , , , , , | 24 Comments

DirectX 12: A first quick look

Today, Microsoft has presented the first information on DirectX 12 at the Game Developers Conference, and also published a blog on DirectX 12. nVidia also responded with a blog on DirectX 12.

In Direct3D 12, the idea of deferred contexts for preparing workloads on other threads/cores is refined a lot further than what Direct3D 11 offered. This should improve the efficiency and reduce the CPU load. Another change is that the state is distributed along even less state objects than before, which should make state calculation for the native GPU even more efficient in the driver.

Other changes include a more lightweight way of binding resources (probably similar to the ‘bindless resources’ that nVidia introduced in OpenGL extensions last year), and dynamic indexing of resources inside shaders. That sounds quite interesting, as it should make new rendering algorithms possible.

And to prove that isn’t just marketing talk, they ported 3DMark from D3D11 to D3D12 in order to demonstrate the improved CPU scaling and utilization. The CPU time was roughly cut in half. Also, this demonstration seems to imply that porting code from D3D11 to D3D12 will not be all that much work.

But most importantly: the API will work on existing hardware! (and apparently it already does, since they demonstrated 3DMark and a demo of Forza Motorsport 5 running under D3D12).

NVIDIA will support the DX12 API on all the DX11-class GPUs it has shipped; these belong to the Fermi, Kepler and Maxwell architectural families.

As an aside, nVidia also paints a more realistic picture about API developments than AMD does with Mantle:

Developers have been asking for a thinner, more efficient API that allows them to control hardware resources more directly. Despite significant efficiency improvements delivered by continuous advancement of existing API implementations, next-generation applications want to extract all possible performance from multi-core systems.

Indeed, it’s not like Direct3D and OpenGL have completely ignored the CPU-overhead problems. But you have to be realistic: Direct3D 11 dates back to 2009, so it was designed for entirely different hardware. You can’t expect an API from 2009 to take full advantage of the CPUs and GPUs in 2014. Microsoft did however introduce multiple contexts in D3D11, allowing for multi-core optimizations.

As nVidia also says:

Our work with Microsoft on DirectX 12 began more than four years ago with discussions about reducing resource overhead.

Which is actually BEFORE we heard anything about Mantle. The strange thing is that AMD actually claimed, less than a year ago, that there would not be a DirectX 12

For the past year, NVIDIA has been working closely with the DirectX team to deliver a working design and implementation of DX12 at GDC.

That’s right, they have a working implementation already. Which would be impossible to pull off if they had just copied Mantle (which itself is not even out of beta yet). There simply would not have been enough time, especially since AMD has not even publicly released any documentation, let alone an SDK.

Sadly, we still have to wait quite a while though:

We are targeting Holiday 2015 games

Although… a preview of DirectX 12 is ‘due out later this year’

Update: AMD also wrote a press release on DirectX 12. Their compatibility only includes all GCN-based cards:

AMD revealed that it will support DirectX® 12 on all AMD Radeon™ GPUs that feature the Graphics Core Next (GCN) architecture.

Posted in Direct3D, Software development, Software news | Tagged , , , , , , , , , , , , | 6 Comments

GnuTLS: Just because people can read the source, doesn’t mean that they do

I don’t want to spend too much time on this topic… Just want to get this out there. As you may have heard, a vulnerability was discovered in GnuTLS, because of sloppy coding: http://blog.existentialize.com/the-story-of-the-gnutls-bug.html

I want to stress two points here:

1) This code has been in GnuTLS since 2005. So the bug went unnoticed for some 9 years.

2) The code was discovered by Red Hat themselves. Possibly because the recent TLS bug discovered in iOS and OS X (which had only been in there since 2012) inspired them to double-check their own code.

So what this means is: even though the code is open source, it took many years to find the bug, and when the bug was eventually found, it was found by Red Hat themselves, during an audit, rather than an independent party. This once again dispels the myth that Open Source is more secure because bugs are found quickly because of the thousands of eyes watching the code (also known as Linus’ Law). Apparently serious security flaws lie dorment in the code for many years.

For more background information, see Ars Technica.


Posted in Software development, Software news | Tagged , , , , , , , , | 10 Comments

Independent Mantle benchmarks start to trickle in

Traditionally I like to look at the benchmarks from Anandtech’s review. And their findings fit exactly with the prediction I made earlier: a high-end setup will likely see 10-15% gain at best. The lower the graphics detail, the more Mantle is able to gain… But why would anyone run on less than Ultra settings with an R9 290X?

Although Direct3D is slightly slower, it still gets well over 60 fps on average, so those few extra FPS that Mantle gives you, aren’t all that relevant. Note also that even underclocking the CPU to 2 GHz and disabling some cores does not affect performance. So you certainly don’t need a fullblown i7-4960X to get these framerates in Direct3D mode. It also seems that Mantle does not do all that much for multithreading, although this was one of the claims made by AMD/DICE initially.

So the first impression is that Mantle is just not that interesting. Its gains are mainly in unrealistic scenarios: running very low detail settings on a high-end rig, or combining a low-end CPU with a top-of-the-line GPU.

I don’t think these gains are much incentive for most developers to start supporting Mantle in their games. Also, these gains will only get smaller over time. CPUs keep getting faster, so the low-end will continue to move up as well. And Direct3D and OpenGL will also continue to reduce CPU overhead in future versions. Which also means that there is little incentive for Intel and nVidia to want to support Mantle themselves.

After all the hype, these results are a bit of a letdown. Mantle does not look like it’s going to be the revolution it was hyped to be. It looks like a short-term solution to a problem that is disappearing anyway.

Update: At Tech Report, they actually bothered to test an nVidia card as well. It turned out to be a lot less CPU-limited in Direct3D than the AMD cards:

One thing we didn’t expect to see was Nvidia’s Direct3D driver performing so much better than AMD’s. We don’t often test different GPU brands in CPU-constrained scenarios, but perhaps we should. Looks like Nvidia has done quite a bit of work polishing its D3D driver for low CPU overhead.

Posted in Direct3D, Hardware news, Software development, Software news | Tagged , , , , , , , , | 54 Comments

Windows 8 doesn’t detect my sound card!

Or does it? When I recently installed a Creative X-Fi Titanium Fatal1ty Pro PCI-e card in a Windows 8-system, nothing appeared to happen.
When I looked at the Device Manager, there was no Creative X-Fi card to be found. No undetected devices either, and the system did not do any driver search and install.
So what happened? Did it just ignore the card? Or was the card somehow incompatible with my system’s PCI-e controller, and did the system fail to detect it altogether?

I booted into Windows 7, to see what would happen there. And much to my surprise, Windows 7 immediately detected new hardware, and started downloading the X-Fi drivers. Right, so on the hardware-side, everything seemed to work. After the drivers were installed, I plugged in some headphones, and indeed, sound was playing from the card.

So what happened in Windows 8 then? As you may know, any modern system will have a number of “High Definition Audio” devices. Usually you have an onboard sound card, and then a few devices for your HDMI ports and whatnot. Upon closer inspection, I noticed that one of these generic HD Audio devices was only just installed, using the standard Microsoft drivers. Hum, wait a minute? Is that what happened then? Windows actually DID detect the sound card, but it just automatically put the generic HD Audio driver on it, so you never even saw it detecting and installing the card?
Yes indeed. I checked the card’s hardware ID, and it indicated that it was a Creative card. And when I used this device, the audio would play through the X-Fi card.

Okay, so that explains where the X-Fi went then! Apparently the X-Fi drivers are available through Windows Update on Windows 7, but not on Windows 8. So I just downloaded them manually, and luckily, they installed just fine on Windows 8. And the HD Audio device disappeared from Device Manager now, and the card was now detected as a proper Creative X-Fi. Now everything is back to normal again!

Posted in Hardware news, Software news | Tagged , , , , , , , , , , , , , , , , | 1 Comment

nVidia stability issues on GeForce 400/500 series

I am still using the trusty old GeForce GTX460 in one of my machines. When I upgraded my drivers to version 320.18, I started having problems. Every now and then, the driver would reset, and in some cases it would even fail to reset, and your desktop would freeze. The only way out was to reset the machine. I noticed that both Windows 7 and Windows 8 were suffering the same issue.

I found out that when I went back to the previous drivers, version 314.22, that the problems disappeared again. At the time I did not pay too much attention to it. The old drivers worked fine anyway.
Everytime a new driver was released, I tried it, and everytime I kept running into the same problem, so I kept going back to 314.22.

When version 327.23 came around, and still suffered the same issue, I started to worry a bit though. These were the first drivers with official Windows 8.1 support. So I was wondering what would happen once the Windows 8.1 update was released. Could I still use my 314.22 drivers?

I started looking around on the GeForce forums, and noticed that other people had been suffering the same issue. I decided to report the issue myself as well, perhaps it would help. Initially I just got a standard reply, telling me to check my PSU and do a clean reinstall and all that. I replied that I’m not an idiot, and as I say, the system works fine when I put 314.22 back on, so we can rule out any kind of hardware or installation problems.

After a while, I got a reply from second-line report, and it seemed that nVidia was now taking the issue quite seriously. I told them that I suspected that it had something to do with the power management, and it mostly seemed to occur while browsing, generally when you want to play a movie or scroll a page, so when the videocard has to switch from idle to some kind of accelerated mode.
They had released a number of beta drivers attempting to solve the issue, but so far they failed to localize the issue. I was kept up-to-date on new beta’s by email, and was asked to report back.

In the meantime, Windows 8.1 came out. I was hoping that the stability problems had solved itself, but after a while I started seeing the same issues again in 8.1. Luckily, 314.22 still worked fine in Windows 8.1, but Windows Update kept trying to install 327.23, so you had to ignore that update.

Then nVidia got REALLY serious, and even asked people to send in their problematic cards. A few people responded, and sent in their cards (it could be related to cards with modified BIOS, such as my own GTX460, which is a factory-overclocked card from Gigabyte). After a number of updates, nVidia posted a specially modified driver based on version 331.58 on page 50, that seemed to do the trick.

nVidia also contacted me about these drivers by email, and asked my feedback again. After trying them for 3 days without issues, I reported that they were looking good so far. nVidia thanked me for testing, and they were sure that the issue was fixed now. A regular beta driver including this fix would be included in a regular beta release soon. This beta arrived in the form of version 331.93 a few days later. As the release notes say: “[326.80, Fermi]: End users report browser freezes and crashes. [1358403]“. I have been running this driver since the release without any issues.

So, it’s a shame that it took nVidia so long to locate the bug (it’s been in there since their driver release in late May, and they finally fixed it in late November). But on the other hand, it’s nice that they committed themselves to fixing the bug, even though the 400/500 series are already a few years old, and two generations behind already. If anyone has been suffering the same issues with their card, I can recommend the 331.93 beta drivers. And within a few weeks, we should see a WHQL release incorporating this fix, so it is truly history.


On 1-7-2014, nVidia released the 332.21 WHQL drivers, which include the fix for this problem: [Fermi-class GPU]: Browser freezes and crashes. [1358403]

Posted in Hardware news, Software news | Tagged , , , , , , , , , , , , , , , | 6 Comments

Just keeping it real… like it’s 1991

Last weekend, there was a special demoscene party, the 1991 party, with, obviously, 1991 as a theme. Well, that is just my bag, baby! The focus was mainly on C64 and Amiga, which were the most popular platforms for gaming and demoscene activity in those days. I wanted to do a small production myself as well. I decided to go with a PC production, because of my oldskool code experiments, the PC stuff was in the most mature state. And also, because the PC platforms of those early days have not been explored very much, so it’s still possible to create some refreshing things and work out interesting ideas for early PC.


Speaking of early PCs… Because of my early PC escapades, I have come into contact with Trixter a while ago, and we started bouncing ideas back and forth. Trixter’s platform of choice is the PCjr, the slowest PC ever made, but it does have slightly more capable graphics and audio hardware than a regular PC with CGA and PC speaker. Earlier this year, Trixter released the world’s first intro on PCjr, namely INTROjr:

I especially liked the rasterbars. There is no raster interrupt on the PC, so there is no easy way to time code to the screen position. In INTROjr, the rasterbars may not be perfectly stable, but that is understandable, given that the CPU is far too slow to do any remotely accurate polling for the horizontal blank interval, and unlike a C64, it is virtually impossible to count out cycle-exact code. Firstly because the 8088 CPU itself has some internal buffers and things that make it very difficult to predict what state the CPU is in at any given time (eg, is the next instruction already fetched, or not?). Secondly, because of the way DRAM is refreshed on a PC.

A refresher

The ‘D’ in DRAM stands for ‘Dynamic’. This means that the memory can only hold its contents for a limited time, as opposed to static RAM (SRAM). So the RAM needs to be ‘refreshed’ (contents read and written back) periodically in order to store data. On many systems, this memory refresh is performed automatically by the chipset. For example, on a C64, the VIC-II chip takes care of the refresh. And on the Amiga, one of the DMA channels of the Agnus chip performs the refresh.

A PC is built up from generic parts, rather than a custom chipset. It does not have any hardware to automatically refresh the memory. Instead, a ‘software’ solution is employed: one of the timers in the chipset is set up to trigger the DMA controller periodically to read a single byte (whenever memory is read, the contents of that cell are lost, so a read will always trigger a refresh).

The problem with this solution for cycle-exact coding is that you never quite know when the timer interrupt will steal away some cycles to refresh memory. On systems like the C64 and Amiga, the refresh is synced to the display output, so it is very predictable. Memory is always refreshed in the exact same cycles on screen, so the whole process happens in the background and can basically be ignored.

If you want to read more about DRAM refreshes on PC, and how to get around it, there is some more in-depth information on Andrew Jenner’s blog, and also some information on how to get the CPU and CGA adapter into a synchronized state which he calls ‘lockstep’. This idea is similar to the ‘stable raster’ I described for C64 earlier, but on PC it is even harder to do. The short version of it is that this is for a real PC only, and the PCjr performs its DRAM refresh in a slightly different way. Trixter has not yet found a reliable workaround for the problem that works on PCjr. So, given these limitations, the rasterbars in INTROjr are very nice indeed.

Another thing I really liked is that the scroller and the rasterbars run in full framerate, even though the other effects may not. Running effects ‘in 1 frame’ is the hallmark of good C64 and Amiga demos. So it is very nice to see that on the PC, where it is even harder to synchronize things, and perform asynchronous processing (remember, we are coding on the bare metal here, we don’t have an OS with threading functionality, let alone multiple CPU cores. Even if we did have threading functionality, we wouldn’t have the resources to run multiple threads this efficiently and well-synchronized).

1-bit ought to be enough for everyone

Another really cool thing that Trixter has made, is MONOTONE, a tracker aimed at the PC speaker. In the hands of a skilled musician, it is capable of some very interesting sounds:

So, hopefully Trixter and I can combine forces to push the limits of early PCs. The first idea was to do something like INTROjr, with a logo on top, a scroller at the bottom, some music, and some 3d objects in the center of the screen. That would look like a classic Amiga demo, such as Phenomena’s Animotion for example:

Anyway, for the 1991 party, I wanted to do a simple PC intro, based on the subpixel-correct donut renderer that I developed earlier, for 16-bit x86 systems. I wanted to add a logo and a scroller. This is what I came up with:

I took Trixter’s idea for the timer interrupt, and modified it to work on different PCs. Namely, a PCjr is a fully synchronized system, much like a C64 or Amiga, where all timings are based off the same crystal. This means that CPU, video chip and things like timers all run in sync (which is why early PCs had such peculiar clockspeeds, such as 4.77, 7.16 and 9.54 MHz, they were based off the NTSC refresh rate of 59.94 Hz). So for the PCjr, you can create a raster interrupt by waiting for the vertical blank interval once, then setting up a timer at the refresh rate of your screen, and it will trigger at the vertical blank interval at every frame.

On later PCs, different parts of the system would have their own clock generator, and they would run asynchronously. This is a problem that I ran into for my target machine (a fast 286, or an entry-level 386sx/dx). If you set up a timer interrupt, it will not be in sync with the video refresh, so it will drift quickly. So instead, I used a one-shot timer, which I resynchronized every frame. I set up a timer to trigger a few scanlines before the end of the screen (an arbitrary safety margin), and then I enter a loop to wait for the vertical blank interval to start. Now I am re-synchronized to the screen, and I can set up a new one-shot timer to act as a faux raster interrupt. There will always be some inaccuracy, because not all systems run their timers at exactly the same speed, but there is no drift. So worst case you may not hit the exact scanline you were aiming for, but at least you will always hit the same scanline (give or take some jitter), rather than drifting up or down the screen over time.

And that is good enough for this particular intro. The intro runs in a 16-colour videomode. The timer interrupt is used to switch palettes between the logo, the donut and the scroller, so that more than 16 colours are visible on screen at a time (as you can see, I have a few black scanlines between the different parts, so I can easily hide any inaccuracies of the timer interrupt). The scroller is also updated as part of the timer interrupt, so it will always run at full framerate, regardless of how fast the system can render the donut.

Palette banking

The palette switching is not done by just overwriting all the RGB registers. Instead it uses a lesser-known feature of the VGA card, which allows you to have either 4 banks of 64 colours or 16 banks of 16 colours. It is quite a simple trick, if you know how. Namely, you can override the highest bits of the palette index via a register in the attribute controller (port 0x3C0). You can do this with the Color Select Register, which is at index 14h. For more information, see this page. This allows you to switch palettes with just a single command, rather than having to write 48 bytes of palette data, which can take quite a bit of CPU-time.

Another way to scroll

Getting a scroller to work in the planar 16-colour videomode was quite an interesting problem as well. EGA/VGA do have support for horizontal scrolling (panning), but as far as I know, it can only be used for scrolling the top of the screen, and optionally keeping the bottom part of the screen fixed, using the line-compare function. So where you can trigger scrolling at any given time via a raster interrupt or copper on C64/Amiga, this approach is not likely to be compatible with most EGA/VGA hardware. I have to do some more testing on this, because I am not 100% sure that my code was working properly at the time, but it seems logical that the scroll register is latched for the line-compare function, and is only read at vertical blank.

So, instead I opted for a ‘software’ scroller. The scroller in INTROjr is a software scroller as well, since the PCjr hardware does not have any horizontal scrolling support. However, the pixelformat for PCjr is much like CGA: two 4-bit pixels are packed together in a single byte. So you can make an acceptable scroller by just moving the data one byte in memory, which effectively scrolls 2 pixels.

In my case, a byte-oriented scroller would scroll 8 pixels at a time, which would be far too much. So the scroller would have to perform bit-oriented movement of the data. If you were to do this on the CPU, it would get very expensive, since shift/rotate are very slow operations on these early CPUs.

However, the EGA ALU has support for rotating and masking built in! So I decided to use that instead. On the CPU side, it would be reasonably cheap: I just have to write all bytes for the scrolltext twice, so that I can handle the overlap of 2 characters within 1 byte. The EGA ALU will then rotate and mask the bits into place, so they can be displaced in a pixel-accurate way.


The donut will have to be drawn to a backbuffer in order to avoid nasty flashing and garbage during drawing. For the logo, this is not really a problem: I can just place the logo in the front- and backbuffer, and make sure I never overwrite it while drawing. The scroller will be racing the beam on the frontbuffer however. The flip will have to be performed by the timer interrupt now, rather than having the 3d rendering routine itself wait for VBL after it has finished a frame. A simple solution is to set a global flag when a frame has completed rendering, and then enter a loop to wait until the flag is reset. The timer interrupt checks the flag everytime it is triggered and reached the VBL. If the flag is set, it will perform a flip and reset the flag. This makes the 3d rendering loop exit its wait loop, and it can start rendering the next frame in the new backbuffer.

This way you can get nicely asynchronous effects, while still making use of backbuffering in videomemory and using hardware page flipping.

Unusual bugs

An interesting bug that occurs here is the following: the polygon renderer makes use of the EGA latches. When the timer interrupt kicks in to update the scroller, these latches will be invalidated. When the polygon renderer then continues, it may render garbage, because the latches had been initialized just before the scroller kicked in, and have now been overwritten with scroll-data.

Although this is an interesting and somewhat peculiar bug, I wanted to have a pixel-perfect intro, so I decided to look for a workaround. I came up with the following: Before the scroller starts, it assumes that the latches contain useful data. It will write a byte to the backbuffer in the scroll area, in order to save the latches. After it is done updating the scroller, it will load this byte again, so that the latches contain the same data as they did before the timer interrupt kicked in.

Since the scroller runs on the frontbuffer, the pixels are not used in the backbuffer, and can freely be used as temporary storage. The scroller always runs at full framerate, so the data is always overwritten with valid scroll data before it is displayed after the buffers have been flipped.

Posted in Oldskool/retro programming | Tagged , , , , , , , , , , , , | 1 Comment