Putting the things together

So, over time I have discussed various isolated things related to 8088-based PCs. Specifically:

These topics are not as isolated as they seem at first. Namely, I was already using the auto-EOI trick for the streaming data program, to get the best possible performance. And I streamed audio data, which is related to sound cards. When I discussed the latching timer, I also hinted at music already (and the auto-EOI feature). And again, when I discussed auto-EOI in detail, I mentioned digital audio playback.

Once I had built my Tandy sound card (using the PCB that I ordered from lo-tech, many thanks to James Pearce for making this possible), I needed software to do something with it. The easiest way to get it to play something, is to use VGM files. There are quite a few captured songs from games (mostly from Sega Master System, which used the same audio chip), and various trackers can also export songs to VGM.

VGM is a very simple file format: it simply stores the raw commands sent to the sound chip(s). Between commands, it stores delays. There are simple VGM files, which only update the music 50 times or 60 times per second (synchronized to PAL or NTSC screen updates). These are easy to play: Just set up your timer interrupt to fire at that rate, and output the commands. But then there’s the more interesting files, which contain digital samples, which play at much higher rates, and they are stored with very fine-grained delay commands. These delays are in ticks of 44.1 kHz resolution. So the format is flexible enough to support very fast updates to sound chips, eg for outputting single samples at a time, up to 44.1 kHz sample rate.

The question of course is: how do you play that? On a modern system, it’s no problem to process data at 44.1 kHz in realtime. But on an 8088 at 4.77 MHz, not so much. You have about 108 CPU cycles to process each sample. That is barely enough to just process an interrupt, let alone actually processing any logic and outputting data to the sound chip. A single write to the SN76489 takes about 42 CPU cycles by the way.

So the naïve way of just firing a timer interrupt at 44.1 kHz is not going to work. Polling a timer is also going to be difficult, because it takes quite some time to read a 16-bit value from the PIT. And those 16-bit values can only account for an 18.2 Hz rate at the lowest, so you will want to detect wraparound and extend it to 32-bit to be able to handle longer delays in the file. This will make it difficult to keep up high-frequency data as well. It would also tie up the CPU 100%, so you can’t do anything else while playing music.

But what if we view VGM not as a file containing 44.1 kHz samples, but rather as a timeline of events, where the resolution is 44.1 kHz, but the actual event rate is generally much lower than 44.1 kHz? Now this sounds an awful lot like the earlier raster effect with the latched timer interrupt! We’ve seen there that it’s possible to reprogram the timer interrupt from inside the interrupt handler. By not resetting the timer, but merely setting a new countdown value, we avoid any jitter, so we remain at an ‘absolute’ time scale. The only downside is that the countdown value gets activated immediately after the counter goes 0, so that is before the CPU can reach your interrupt handler. Effectively that means you always need to plan 1 interrupt handler ahead.

We basically have two challenges here, with the VGM format:

  1. We need a way to ‘snoop ahead’ to get the proper delay to set in the current handler.
  2. We need to process the VGM data as quickly as possible.

For the first challenge, I initially divided up my VGM command processor into two: one that would send commands to the SN76489 until it encountered a delay command. The other would skip all data until it encountered a delay command, and returned its value.

Each processor has its own internal pointer, so in theory they could be in completely different places in the file. Making the delay processor be ‘one ahead’ in the stream was simple this way.

There was still that second challenge however: Firstly, I had to process the VGM stream byte-by-byte, and act on each command explicitly in a switch-case statement. Secondly, the delay values were in 44.1 kHz ticks, so I had to translate them to 1.19 MHz ticks for the PIT. Even though I initially tried with a look-up-table for short delays, it still wasn’t all that fast.

So eventually I decided that I would just preprocess the data into my own format, and play from there. The format could be really simple, just runs of:

uint16_t delay;
uint8_t data_count;
uint8_t data[data_count]

Where ‘delay’ is already in PIT ticks, and since there is only one command in VGM for the SN76489, which sends a single byte to its command port, I can just group them together in a single buffer. This is nice and compact.

Since I support a data_count of 0, I can get around the limitation of the PIT only being able to wait for 65536 ticks at most: I can just split up longer delays into multiple blocks with 0 commands.

I only use a byte for the data_count. That means I can only support 255 command bytes at most. Is that a problem? Well no, because as mentioned above, a single write takes 42 CPU cycles, and there are about 108 CPU cycles in a single tick at 44.1 kHz. Therefore, you couldn’t physically send more than 2 bytes to the SN76489 in a single tick. The third byte would already trickle over into the next tick. So if I were ever to encounter more than 255 bytes with no delays, then I could just add a delay of 1 in my stream, and split up the commands. In practice it is highly unlikely that you’ll ever encounter this. There are only 4 different channels on the chip, and the longest commands you can send are two bytes. You might also want to adjust the volume of each channel, which is 1 byte. So worst-case, you’d probably send 12 bytes at a time to the chip. Then you’d want a delay so you could actually hear the change take effect.

That’s all there is to it! This system can now play VGM data at the resolution of 44.1 kHz, with the only limitation being that you can’t have too many commands in too short a period of time, because the CPU and/or the SN76489 chip will not be able to keep up.

Well, not really, because there is a third challenge:

  1. VGM files (or the preprocessed data derived from it) may exceed 64k (a single segment) or even 640k (the maximum amount of conventional memory in an 8088 system).

Initially I just wanted to accept these limitations, and just load as much of the file into memory as possible, and play only that portion. But then I figured: technically this routine is a ‘background’-routine since it is entirely driven by an interrupt, and I can still run other code in the ‘foreground’, as long as the music doesn’t keep the CPU too busy.

This brought me back to the earlier experiment with streaming PWM/PCM data to PC speaker and Covox. The idea of loading the data into a ringbuffer of 64k and placing the int handler inside this ringbuffer makes a lot of sense in this scenario as well.

Since the data is all preprocessed, the actual interrupt handler is very compact and simple, and copying it around is very little overhead. The data rate should also be relatively low, unless VGMs use a lot of samples. In most cases, a HDD, or even a floppy, should be able to keep up with the data. So I gave it a try, and indeed, it works:

Or well, it would, if the floppy could keep up! This is a VGM capture of the music from Skate or Die, by Rob Hubbard. It uses samples extensively, so it is a bit of ‘worst case’ for my player. But as you can hear, it plays the samples properly, even while it is loading from disk. It only messes up when there’s a buffer underrun, but eventually recovers. Simpler VGM files play perfectly from floppy. Sadly this machine does not have a HDD, so I will need to try the Skate or Die music again some other time, when I have installed the card into a system with a HDD. I’m confident that it will then keep up and play the music perfectly.

But for now, I have other plans. They are also music-related, and I hope to have a quick demonstration of those before long.

Posted in Oldskool/retro programming | Tagged , , , , , , , , , , , , , , , | Leave a comment

Programming: What separates the men from the boys?

I have come up with the following list of topics:

  • Pointer arithmetic
  • Unicode vs ASCII strings
  • Memory management
  • Calling conventions
  • Basic mathematics, such as linear algebra (eg 2d rotations, translations and scaling, things you’ll regularly find in basic GUI stuff).
  • Multithreading/concurrency

Over time I have found that some programmers master such concepts at an early stage in their career, while others continue to struggle with these things for the rest of their lives.

Thoughts? Some more topics we can add to the list?


Posted in Oldskool/retro programming, Software development | Tagged , , | 21 Comments

Any real-keeping lately?

The 5-year anniversary of my inaugural ‘Just keeping it real’-article came and went. Has it been that long already? It’s also been quite some time since I’ve last written about some oldskool coding, or even anything at all. Things have been rather busy.

However, I have still been doing things on and off. So I might as well give a short overview of things that I have been doing, or things that I may plan to do in the future.

Libraries: fortresses of knowledge

One thing that has been an ongoing process, has been to streamline my 8088-related code, and to extract the ‘knowledge’ from the effects and utilities that I have been developing into easy-to-use include and source files for assembly and C. Basically I want to create some headers for each chip that you regularly interact with, such as the 8253, the 8259A and the 6845. And at a slightly higher level, also routines for dealing with MDA, Hercules, CGA, EGA and VGA, and also audio, such as the PC speaker or the SN76489.

For example, to make it easy to set up a timer interrupt at a given frequency, or to enable/disable auto-EOI mode, or to perform horizontal or vertical sync for low-level display hacking like in 8088 MPH. That is the real challenge here. The header files should be easy to use, while at the same time giving maximum control and performance.

I am somewhat inspired by the Amiga’s NDK. It contains header files that allow easy access to all the custom chip registers. For some reason, something similar is not around for the PC hardware, as far as I know. There’s very extensive documentation, such as Ralf Brown’s Interrupt List, and the BOCHS ports list. But these are not in a format that can be readily used in a programming environment. So I would like to make a list of constants that describes all registers and flags, in a way they can be used immediately in a programming context (in this case assembly and C, but it should be easy to take a header file and translate it to another language. In fact, currently I generally write the assembly headers first, then convert them to C). On top of that, I try to use macros where possible, to add basic support routines. Macros have the advantage that they are inlined in the code, so there is no calling overhead. If you design your macro just right, it can be just as efficient as hand-written code. It can even take care of basic loop unrolling and such.

Once this library gets more mature, I might release it for everyone to use and extend.

Standards are great, you can never have too many of them

As I was creating these header files, I came to the conclusion that I was doing it wrong, at least, for the graphics part. Namely, when I first started doing my oldskool PC graphics stuff, I started with VGA and then worked my way down to the older standards. I created some basic library routines in C, where I considered EGA to be a subset of VGA, and CGA a subset of EGA in turn. I tried to create a single set of routines that could work in CGA, EGA or VGA mode, depending on a #define that you could set. Aside from that, I also added a Hercules mode, which didn’t quite fit in there, since Hercules is not an IBM standard, and is not compatible at all.

There are two problems with that approach:

  1. As we know from software such as 8088 MPH, EGA and VGA are in fact not fully backward compatible with CGA at all.  Where CGA uses a real 6845 chip, EGA and VGA do not. So some of the 6845 registers are repurposed/redefined on EGA/VGA. Various special modes and tricks work entirely differently on CGA than they do on EGA or VGA (eg, you can program an 80×50 textmode on all, but not in the same way).
  2. If you set a #define to select the mode in which the library operates, then by definition it can only operate in one mode at a time. This doesn’t work for example in the scenario where you want to be able to support multiple display adapters in a single program, and allow the user to select which mode to use (you could of course build multiple programs, one for each mode, and put them behind some menu frontend or such. Various games actually did that, so you often find separate CGA, EGA, VGA and/or Tandy executables. But it is somewhat cumbersome). Another issue is that certain videocards can actually co-exist in a single system, and can work at the same time (yes, early multi-monitor). For example, you can combine a ‘monochrome’ card with a ‘color’ card, because the IBM 5150 was originally designed that way, with MDA and CGA. They each used different IO and memory ranges, so that both could be installed and used at the same time. By extension, Hercules can also be used together with CGA/EGA/VGA.

So now that I have seen the error of my ways, I have decided to only build header files on top of other header files when they truly are supersets. For example, I have a 6845 header file, and MDA, Hercules and CGA use this header. That is because they all use a physical 6845 chip. For EGA and VGA, I do not use it. Also, I use separate symbol names for all graphics cards. For example, I don’t just make a single WaitVBL-macro, but I make one specific for every relevant graphics card. So you get a CGA_WaitVBL, a HERC_WaitVBL etc. You can still masquerade them behind a single alias yourself, if you so please. But you can also use the symbols specific to each given card side-by-side.

And on his farm he had some PICs, E-O, E-O-I

The last oldskool article I did was focused around the 8259A Programmable Interrupt Controller, and the automatic End-of-Interrupt functionality. At the time I already mentioned that it would be interesting for high-frequency timer interrupt routines, such as playing back digital audio on the PC speaker. That was actually the main reason why I was interested in shaving a few cycles off. I have since used the auto-EOI code in a modified version of the chipmod routine from the endpart of 8088 MPH. Instead of the music player taking all CPU, it can now be run from a timer interrupt in the background. By reducing the mixing rate, you can free up some time to do graphics effects in the foreground.

That routine was the result of some crazy experimentation. Namely, for a foreground routine, the entire CPU is yours. But when you want to run a routine that triggers from an interrupt, then you need to save the CPU state, do your routines, and then restore the CPU state. So the less CPU state you need to save, the better. One big problem with the segmented memory model of the 8088 is how to get access to your data. When the interrupt triggers, the only segment you can be sure of is the code segment. You have no idea what DS and ES are pointing to. You can have some control over SS, because you can make sure that your entire program only uses a single segment for the stack throughout.

So one idea was to reserve some space at the top of the stack, to store data there. But then I figured that it might be easier to just use self-modifying code to store data directly in the operands of some instructions.

Then I had an even better idea: what if I were to use an entire segment for my sample data? It can effectively be a 64k ringbuffer, where wraparound is automatic, saving the need to do any boundschecking on my sample pointer. It is a 16-bit pointer, so it will wrap around automatically. And what if I would put this in the code segment? I only need a few bytes of code for an interrupt handler that plays a sample, increments the sample pointer, and returns from the interrupt. I can divide the ringbuffer in two segments. When the sample pointer is in the low segment, I put the interrupt handler in the high segment, and when the sample pointer switches to the high segment, I move the interrupt handler in the low segment.

Since each segment is so large, I do not need to check at every single sample. I can just do it in the logic of the foreground routine, every frame or such. This makes it a very efficient approach.

I also had this idea of placing the interrupt handlers in segment 0. The advantage here is that CS will point to 0, which means that you can modify the interrupt vector table directly, by just doing something like mov cs:[20h], offset myhandler. This allows you to have a separate handler for each sample, and include the sample in the code itself, so the code effectively becomes the sample buffer. But at the time I thought it may be too much of a hassle. But then reenigne suggested the exact same thing, so I thought about it once more. There may be something here yet.

I ended up giving it a try. I decided to place my handlers 32 bytes apart. 32 bytes was enough to make a handler that plays a sample and updates the interrupt vector. The advantage of spacing all handlers evenly in memory is that they all had the instruction that loaded the sample in the same place, so they were all spaced 32 bytes apart as well. This made it easy to address these samples, and update them with self-modifying code from a mixing loop. It required some tricky code that backs up the existing interrupt vector table, then disables all interrupts except irq 0 (the timer interrupt), and restores everything upon exist. But after some trial-and-error I managed to get it working.

As we were discussing these routines, we were wondering if this would perhaps be good enough as a ‘replacement’ for Trixter’s Sound Blaster routines in 8088 Corruption and 8088 Domination. Namely, the Sound Blaster was the only anachronism in these productions, because streaming audio would have been impossible without the Sound Blaster and its DMA functionality.

So I decided to make a proof-of-concept streaming audio player for my 5160:

As you can see, or well, hear, it actually works quite well. At least, with the original controller and Seagate ST-225, as in my machine. Apparently this system uses the DMA controller, and as such, disk transfers can work in the background nicely. It introduces a small amount of jitter in the sample playback, since the DMA steals bus cycles. But for a 4.77 MHz 8088 system, it’s quite impressive just how well this works. With other disk controllers you may get worse results, when they use PIO rather than DMA. Fun fact: the floppy drive also uses DMA, and the samples are of a low enough bitrate that they can play from a floppy as well, without a problem.

Where we’re going, we don’t need… Wait, where are we going anyway?

So yes, audio programming. That has been my main focus since 8088 MPH. Because, aside from the endpart, the weakest link in the demo is the audio. The beeper is just a very limited piece of hardware. There must be some kind of sweet-spot somewhere between the MONOTONE audio and the chipmod player of the endpart. Something that sounds more advanced than MONOTONE, but which doesn’t hog the entire CPU like the chipmod player, so you can still do cool graphics effects.

Since there has not been any 8088 production to challenge us,  audio still remains our biggest challenge. Aside from the above-mentioned disk streaming and background chipmod routine, I also have some other ideas. However, to actually experiment with those, I need to develop a tool that lets me compose simple music and sound effects. I haven’t gotten too far with that yet.

We could also approach it from a different angle, and use some audio hardware. One option is the Covox LPT DAC. It will require the same high-frequency timer interrupt trickery to bang out each sample. However, the main advantage is that it does not use PWM, and therefore it has no annoying carrier wave. This means that you can get more acceptable sound, even at relatively low sample rates.

A slightly more interesting option is the Disney Sound Source. It includes a small 16-byte buffer. It is limited to 7 kHz playback, but at least you won’t need to send every sample indvidually, so it is less CPU-intensive.

Yet another option is looking at alternative PC-compatible systems. There’s the PCjr and Tandy, which have an SN76489 chip on board. This allows 3 square wave channels and a noise channel. Aside from that, you can also make any of the square wave channels play 4-bit PCM samples relatively easily (and again no carrier wave). Listen to one of Rob Hubbard’s tunes on it, for example:

What is interesting is that there’s a home-brew Tandy clone card being developed as we speak. I am building my own as well. This card allows you to add an SN76489 chip to any PC, making its audio compatible with Tandy/PCjr. It would be very interesting if this card became somewhat ‘standard’ for demoscene productions.

(Why not just take an AdLib, you ask? Well, various reasons. For one, it was rather late to the market, not so much an integral part of 8088 culture. Also, it’s very difficult and CPU-consuming to program. Lastly, it’s not as easy to play samples on as the other devices mentioned. So the SN76489 seems like a better choice for the 8088. The fact that it was also used in the 8088-based PCjr and Tandy 1000 gives it some extra ‘street cred’).

Aside from that, I also got myself an original Hercules GB102 card some time ago. I don’t think it would be interesting to do another demo on exactly the same platform as 8088 MPH. Instead, it would be more interesting to explore other hardware from the 8088 era. The Hercules is also built around a 6845 chip, so some of the trickery done in 8088 MPH may be translated to Hercules. At the same time, the Hercules also offers unique features, such as 64 kB of memory, arranged in two pages of 32 kB. So we may be able to make it do some tricks of its own. Sadly, it would not be a ‘world’s first’ Hercules demo, because someone already beat me to it some months ago:

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , | 6 Comments

AMD Zen: a bit of a deja-vu?

AMD has released the first proper information on their new Zen architecture. Anandtech seems to have done some of the most in-depth coverage, as usual. My first impression is that of a deja-vu… in more than one way.

Firstly, it reminds me of what AMD did a few years ago on the GPU-front: They ditched their VLIW-based architecture, and moved to a SIMD-based architecture, which was remarkably similar to nVidia’s architecture (nVidia had been using SIMD-based architectures since their 8800GTX). In this case, Zen seems to follow Intel’s Core i7-architecture quite closely. They are moving back to high-IPC cores, just as in their K7/K8 heyday (which at the time was following Intel’s P6-architecture closely), and they seem to target lower clockspeeds, around the 3-4 GHz area where Intel also operates. They are also adopting a micro-op cache. Something that Intel has been doing for a long time.

Secondly, AMD is abandoning their CMT-approach, and going for a more conventional SMT-approach. This is another one of those “I told you so”-moments. Even before Bulldozer was launched, I already said that having 2 ALUs hardwired per core is not going to work well. Zen is now using 4 ALUs per two logical cores. So technically they still have the same amount of ALUs per ‘module’. However, like the Core i7, they can now use all 4 cores with each thread, so you get much better IPC for single threads. This again is something I said a few years ago already. AMD apparently agrees with that. Their fanbase did not, sadly.

We can only wonder why AMD did not go for SMT right away with Bulldozer. I personally think that AMD knew all along that SMT was the better option. However, their CMT was effectively a ‘lightweight’ SMT, where only the FPU portion did proper SMT. I think it may be a combination of two factors here:

  1. SMT was originally developed by IBM, and Intel has been using their HyperThreading variation for many years. Both companies have collected various patents on the technology over the years. Perhaps for AMD it was not worthwhile to use fullblown SMT, because it would touch on too many patents and the licensing costs would be prohibitive. It could be that some of these patents have now expired, so the equation has changed to AMD’s favour. It could also be that AMD is now willing to take a bigger risk, because they have to get back in the CPU race at all cost.
  2. Doing a fullblown SMT implementation for the entire CPU may have been to much of a step for AMD in a single generation. AMD only has a limited R&D budget, so they may have had to spread SMT out over two generations. We don’t know how long it took Intel to develop HyperThreading, but we do know that even though their first implementation in Pentium 4 worked well enough in practice, there were still various small bugs and glitches in their implementations. Not necessarily stability-wise, but also security-wise. The concept of SMT is not that complicated, but shoehorning it into the massively complex x86 architecture, which has tons of legacy software which needs to continue working flawlessly, is an entirely different matter. This is quite a risky undertaking, and proper validation can take a long time.

At any rate, Zen looks more promising than Bulldozer ever did. I think AMD made a wise choice in going back to ‘follow the leader’-mode. Not necessarily because Intel’s architecture is the right one, but because Intel’s architecture is the most widespread one. I have said the same thing about Pentium 4 in the past: the architecture itself was not necessarily as bad as people think. Its biggest disadvantage was that it did not handle code optimized for the P6-architecture very well, and most applications had been developed for P6. If all applications would be recompiled with Pentium 4 optimizations, it would already have made quite a different impression. Let alone if developers actually optimized their code specifically for Pentium 4’s strengths (something we mainly saw with video encoding/decoding and 3D rendering).

Bulldozer was facing a similar problem: it required a different type of software. If Intel couldn’t pull off a big change in software optimization with the Pentium 4, then a smaller player like AMD certainly wouldn’t either. That is the main reason why I never understood Bulldozer.

Posted in Hardware news | Tagged , , , , , , , | 25 Comments

DirectX 12 and Vulkan: what it is, and what it isn’t

I often read comments in the vein of: “… but vendor A’s hardware is designed more for DX12/Vulkan than vendor B’s”. It’s a bit more complicated than that, because it is somewhat of a chicken-and-egg problem. So I thought I’d do a quick blog to try and explain it.

APIs vs hardware features

A large part of the confusion seems to be because the capabilities of hardware tend to be categorized by versions of the DirectX API. In a way that makes sense, since each new version of the DirectX API also introduces support for new hardware features. So this became a de-facto way of categorizing the hardware capabilities of GPUs. Since DirectX 11, we even have different feature levels that can be referred to.

As you can see, the main new features in DirectX 12 are Conservative Rasterization, Volume Tiled Resources and Rasterizer ordered views. But, as you can also see, these have been ‘backported’ to DirectX 11.3 as well, so apparently they are not specific to the DirectX 12 API.

But what is an API really? API stands for Application Programming Interface. It is the ‘thing’ that you ‘talk to’ when programming something, in this case graphics. And the ‘new thing’ about DirectX 12, Vulkan (and Metal and Mantle) is that the interface follows a new paradigm, a new programming model. In earlier versions of DirectX, the driver was responsible for tasks such as resource management and synchronization (eg, if you first render to a buffer, and later want to use that buffer as a texture on some surface, the driver makes sure that the rendering to the buffer is complete before rendering with the texture starts).

These ‘next-gen’ APIs however, work on a lower level, and give the programmer control over such tasks. Leaving it to the driver can work well in the general case, and makes things easier and less error-prone for the programmer. However, the driver has to work with all software, and will use a generic approach. By giving the programmer fine-grained control over the hardware, these tasks can be optimized specifically for an engine or game. This way the programmer can shave off redundant work and reduce overhead on the CPU side. The API calls are now lighter and simpler, because they don’t have to take care of all the bookkeeping, validation and other types of management. These have now been pushed to the engine code instead. On the GPU side, things generally stay the same however, but more on that later.

Command lists

Another change in the programming model is that GPU commands are now ‘decoupled’ at the API side: Sending rendering commands to the GPU is now a two-step process:

  1. Add all the commands you want to execute to a list
  2. Execute your command list(s)

Classic rendering APIs are ‘immediate’ renderers: with an API call you send the command directly to the driver/GPU. Drivers might internally buffer this, and create/optimize their own command lists, but this is transparent to the programmer. A big problem in this programming model is that the order in which the commands are executed, are important. That basically means that you can only use a single thread to send commands. If you were to use multiple threads, you’d have to synchronize them so that they all sent their commands in-order, which basically would mean they’d run one-after-another, so you might as well use a single thread.

DirectX 11 tried to work around this by introducing ‘deferred’ contexts. You would have one ‘immediate’ context, which would execute all commands immediately. But you could create additional contexts, which would buffer commands in a list, which you could later hand down to the immediate context to execute.

There were however two problems with this approach:

  1. The deferred contexts supported only a subset of all commands
  2. Only nVidia’s implementation managed to get significant performance from this

To clarify that second point, FutureMark built an API overhead test, which includes tests with DX11 using immediate and deferred contexts, with a single or multiple threads. See Anandtech’s review of this test.

As you can see, this feature does absolutely nothing on AMD hardware. They are stuck at 1.1M calls regardless of what technique you use, or how many cores you throw at it.

With nVidia however, you see that with 4 or 6 cores, it goes up to 2.2M-2.3M calls. Funny enough, nVidia’s performance on the single-threaded DX11 code also goes up with the 6-core machine, so the total gains from this technique are not very dramatic. Apparently nVidia already performs some parallel processing inside the driver.

DirectX 12 takes this concept further. You now have a command queue, in which you can queue up command lists, which will be executed in-order. The commands inside the command list will also be executed in-order. You can create multiple command lists, and create a thread for each list, to add the commands to it, so that they can all work in parallel. There are no restrictions on the command lists anymore, like there were with the deferred context in DX11 (technically you no longer have an ‘immediate’ context in DX12, they are all ‘deferred’).

An added advantage is that you can re-use these command lists. In various cases, you want to send the same commands every frame (to render the same objects and such), so you can now remove redundant work by just using the same command list over and over again.

Honourable mention for Direct3D 1 here: The first version of Direct3D actually used a very similar concept to command lists, known as ‘execute buffers’. You would first store your commands as bytecode in an ‘execute buffer’, and then execute the buffer. Technically this could be used in multi-threaded environments in much the same way: use multiple threads, which each fill their own execute buffer in parallel.

Asynchronous compute

Why is there a queue for the command lists, you might ask? Can’t you just send the command lists directly to an Execute( myList ) function? The answer is: there can be more than one queue. You can see this as a form of ‘GPU multithreading’: you can have multiple command lists executing at the same time. If you want to compare it to CPU mutithreading, you could view a command queue as a thread, and a command list as an instruction stream (a ‘ThreadProc’ that is called when the thread is running).

There are three different classes of queues and command lists:

  1. Graphics
  2. Compute
  3. DMA/Copy

The idea behind this is that modern GPUs are capable of performing multiple tasks at the same time, since they use different parts of the GPU. Eg, you can upload a texture to VRAM via DMA while you are also rendering and/or performing compute tasks (previously this was done automatically by the driver).

The most interesting new feature here is that you can run a graphics task and a compute task together. The classic example of how you can use this is rendering shadowmaps; Shadowmaps do not need any pixel shading, they just need to store a depth value. So you are mainly running vertex shaders and using the rasterizer. In most cases, your geometry is not all that complex, so there are relatively few vertices that need processing, leaving a lot of ALUs on the GPU sitting idle. With these next-gen APIs you can now execute a compute task at the same time, and make use of the ALUs that would otherwise sit idle (compute does not need the rasterizer).

This is called ‘asynchronous’ compute, because, like with conventional multithreading on the CPU, you are scheduling two (or more) tasks to run concurrently, and you don’t really care about which order they run in. They can be run at the same time if the hardware is capable of it, or they can be run one-after-another, or they can switch multiple times (time-slicing) until they are both complete (on CPUs there are a number of ways to run multiple threads, single-core, multi-core, multi-CPU, HyperThreading. And the OS will use a combination of techniques to schedule threads on the available hardware. see also my earlier blog). You may care about the priority of both, so that you can allocate more resources to one of them, to make it complete faster. But in general, they are running asynchronously. You need to re-synchronize by checking that they have both triggered their event to signal that they have completed.

Now that the introduction is over…

So, how does it look when you actually want to render something? Well, let’s have a (slightly simplified) look at rendering an object in DirectX 11:

// Set the viewport and scissor rectangle.
D3D11_VIEWPORT viewport = m_deviceResources->GetScreenViewport();
m_immediateContext->RSSetViewports(1, &viewport);
m_immediateContext->RSSetScissorRects(1, &m_scissorRect);

// Send drawing commands.
ID3D11RenderTargetView* renderTargetView = m_deviceResources->GetRenderTargetView();
ID3D11DepthStencilView* depthStencilView = m_deviceResources->GetDepthStencilView();
m_immediateContext->ClearRenderTargetView(renderTargetView, DirectX::Colors::CornflowerBlue);
m_immediateContext->ClearDepthStencilView(depthStencilView, D3D11_CLEAR_DEPTH, 1.0f, 0);

m_immediateContext->OMSetRenderTargets(1, &renderTargetView, &depthStencilView);

m_immediateContext->IASetVertexBuffers(0, 1, &m_vertexBuffers);
m_immediateContext->IASetIndexBuffer(&m_indexBuffer, DXGI_FORMAT_R16_UINT, 0);
m_immediateContext->DrawIndexedInstanced(36, 1, 0, 0, 0);

And in DirectX 12 it would look like this (again, somewhat simplified):

// Set the viewport and scissor rectangle.
D3D12_VIEWPORT viewport = m_deviceResources->GetScreenViewport();
m_commandList->RSSetViewports(1, &viewport);
m_commandList->RSSetScissorRects(1, &m_scissorRect);

// Indicate this resource will be in use as a render target.
CD3DX12_RESOURCE_BARRIER renderTargetResourceBarrier =
m_commandList->ResourceBarrier(1, &renderTargetResourceBarrier);

// Record drawing commands.
D3D12_CPU_DESCRIPTOR_HANDLE renderTargetView = m_deviceResources->GetRenderTargetView();
D3D12_CPU_DESCRIPTOR_HANDLE depthStencilView = m_deviceResources->GetDepthStencilView();
m_commandList->ClearRenderTargetView(renderTargetView, DirectX::Colors::CornflowerBlue, 0, nullptr);
m_commandList->ClearDepthStencilView(depthStencilView, D3D12_CLEAR_FLAG_DEPTH, 1.0f, 0, 0, nullptr);

m_commandList->OMSetRenderTargets(1, &renderTargetView, false, &depthStencilView);

m_commandList->IASetVertexBuffers(0, 1, &m_vertexBufferView);
m_commandList->DrawIndexedInstanced(36, 1, 0, 0, 0);

// Indicate that the render target will now be used to present when the command list is done executing.
CD3DX12_RESOURCE_BARRIER presentResourceBarrier =
m_commandList->ResourceBarrier(1, &presentResourceBarrier);


// Execute the command list.
ID3D12CommandList* ppCommandLists[] = { m_commandList.Get() };
m_deviceResources->GetCommandQueue()->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists);

As you can see, the actual calls are very similar. The functions mostly have the same names, and even the parameters are mostly the same. At a higher level, most of what you do is exactly the same: you use a rendertarget, a depth/stencil surface, you set up a viewport and scissor rectangle. Then you clear the rendertarget and depth/stencil for a new frame, and send a list of triangles to the GPU, which is stored in a vertex buffer and index buffer pair (you would already have initialized a vertex shader and pixel shader at an earlier stage, and already uploaded the geometry to the vertex- and index buffers, but I left those parts out for simplicity. The code there is again very similar between the APIs, where DirectX 12 again requires a bit more code, because you have to tell the API in more detail what you actually want. Uploading the geometry also requires a command list there).
So what the GPU actually has to do, is exactly the same, regardless of whether you use DirectX 11 or DirectX 12. The differences are mainly on the CPU-side, as you can see.

The same argument also extends to Vulkan. The code may look a bit different from what you’re doing in DirectX 12, but in essence, you’re still creating the same vertex and index buffers, and sending the same triangle list draw command to the GPU, rendering to a rendertarget and depth/stencil buffer.

So, what this means is that you do not really need to ‘design’ your hardware for DirectX 12 or Vulkan at all. The changes are mainly on the API side, and affect the workload of the CPU and driver, not the GPU. Which is also why DirectX 12 supports feature levels of 11_x: the API can also support hardware that pre-dates DirectX 12.

Chickens and eggs

But, how exactly did these new hardware features arrive in the DirectX 11.3 and 12 APIs? And why exactly did this new programming model emerge in these new APIs?

The first thing to point out is that Microsoft does not develop hardware. This means that Microsoft can’t just think up new hardware features out-of-the-blue and hope that hardware will support it. For each update of DirectX, Microsoft will have meetings with the big players on the hardware market, such as Intel, nVidia and AMD (in recent years, also Qualcomm, for mobile devices). These IHVs will give Microsoft input on what kind of features they would like to include in the next DirectX. Together with Microsoft, these features will then be standardized in a way that they can be implemented by all IHVs. So this is a somewhat democratic process.

Aside from IHVs, Microsoft also includes some of the larger game engine developers, to get more input from the software/API side of things. Together they will try to work out solutions to current problems and bottlenecks in the API, and also work out ways to include the new features presented by the IHVs. In some cases, the IHVs will think of ways to solve these bottlenecks by adding new hardware features. After all, game engines and rendering algorithms also evolve over time, as does the way they use the API and hardware. For example, in the early days of shaders, you wrote separate shaders for each material, and these used few parameters. These days, you tend to use the same shader for most materials, and only the parameters change. So at one point, switching between shaders efficiently was important, but now updating shader parameters efficiently is more important. Different requirements call for different APIs (so it’s not like older APIs were ‘bad’, they were just designed against different requirements, a different era with different hardware and different rendering approaches).

So, as you can see, it is a bit of cross-pollination between all the parties. Sometimes an IHV comes up with a new approach first, and it is included in the API. Other times, developers come up with problems/bottlenecks first, and then the API is modified, and hardware redesigned to make it work. Since not all hardware is equal, it is also often the case that one vendor had a feature already, while others had to modify their hardware to include it. So for one vendor, the chicken came first, for others the egg came first.

The development of the API is an iterative process, where the IHVs will work closely with Microsoft over a period of time, to make new hardware and drivers available for testing, and move towards a final state of the API and driver model for the new version of DirectX.

But what does it mean?

In short, it means it is pretty much impossible to say how much of a given GPU was ‘designed for API X or Y’, and how much of the API was ‘designed for GPU A or B’. It is a combination of things.

In DirectX 12/Vulkan, it seems clear that Rasterizer Ordered Views came from Intel, since they already had the feature on their DirectX 11 hardware. It looks like nVidia designed this feature into Maxwell v2 for the upcoming DX11.3 and DX12 APIs. AMD has yet to implement the feature.

Conservative rasterization is not entirely clear. nVidia was the first to market with the feature, in Maxwell v2. However, Intel followed not much later, and implemented it at Tier 3 level. So I cannot say with certainty whether it originated from nVidia or Intel. AMD has yet to implement the feature.

Asynchronous compute is a bit of a special case. The API does not consider it a specific hardware feature, and leaves it up to the driver how to handle multiple queues. The idea most likely originated from AMD, since they have had support for running graphics and compute at the same time since the first GCN architecture, and the only way to make good use of that is to have multiple command queues. nVidia added limited support in Maxwell v2 (they had asynchronous compute support in CUDA since Kepler, but they could not run graphics tasks in parallel), and more flexible/efficient support in Pascal. Intel has yet to support this feature (that is, they support code that uses multiple queues, but as far as I know, they cannot actually run graphics and compute tasks in parallel, so they cannot use it to improve performance by better ALU usage).

Also, you can compare performance differences from DirectX 11 to 12, or from OpenGL to Vulkan… but it is impossible to draw conclusions from these results. Is the DirectX 11 driver that good, or is the DirectX 12 engine that bad for a given game/GPU? Or perhaps the other way around? Was OpenGL that bad, and is the Vulkan engine that good for a given game/GPU?

Okay, but what about performance?

The main advantage of this new programming model in DirectX 12/Vulkan is also a potential disadvantage. I see a parallel with the situation of compilers versus assemblers on the CPU: it is possible for an assembly programmer to outperform a compiler, but there are two main issues here:

  1. Compilers have become very good at what they do, so you have to be REALLY good to even write assembly that is on par with what a modern compiler will generate.
  2. Optimizing assembly code can only be done for one CPU at a time. Chances are that the tricks you use to maximize performance on CPU A, will not work, or even be detrimental to performance on CPU B. Writing code that works well on all CPUs is even more difficult. Compilers however are very good at this, and can also easily optimize code for multiple CPUs, and include multiple codepaths in the executable.

In the case of DirectX 12/Vulkan, you will be taking on the driver development team. In DirectX 11/OpenGL, you had the advantage that the low-level resource management, synchronization and such, was always done by the driver, which was optimized for a specific GPU, by the people who built that GPU. So like with compilers, you had a very good baseline of performance. As an engine developer, you have to design and optimize your engine very well, before you get on par with these drivers (writing a benchmark that shows that you can do more calls per second in DX12 than in DX11 is one thing. Rendering an actual game more efficiently is another).

Likewise, because of the low-level nature of DirectX 12/Vulkan, you need to pay more attention to the specific GPUs and videocards you are targeting. The best way to manage your resources on GPU A might not be the best way on GPU B. Normally the driver would take care of it. Now you may need to write multiple paths, and select the fastest one for each GPU.

Asynchronous compute is especially difficult to optimize for. Running two things at the same time means you have to share your resources. If this is not balanced well, then one task may be starving the other of resources, and you may actually get lower performance than if you would just run the tasks one after another.

What makes it even more complicated is that this balance is specific not only to the GPU architecture, but even to the specific model of video card, to a certain extent. If we take the above example of rendering shadowmaps while doing a compute task (say postprocessing the previous frame)… What if GPU A renders shadowmaps quickly and compute tasks slowly, but GPU B renders the shadowmaps slowly and compute tasks quickly? This would throw off the balance. For example, once the shadowmaps are done, the next graphics task might require a lot of ALU power, and the compute task that is still running will be starving that graphics task.

And things like rendering speed will depend on various factors, including the relative speed of the rasterizers to the VRAM. So, even if two videocards use the same GPU with the same rasterizers, variations in VRAM bandwidth could still disturb the balance.

On Xbox One and PlayStation 4, asynchronous compute makes perfect sense. You only have a single target to optimize for, so you can carefully tune your code for the best performance. On a Windows system however, things are quite unpredictable. Especially looking to the future. Even if you were to optimize for all videocards available today, that is no guarantee that the code will still perform well on future videocards.

So we will have to see what the future brings. Firstly, will engine developers actually be able to extract significant gains from this new programming model? Secondly, will these gains stand the test of time? As in, are these gains still available 1, 2 or 3 generations of GPUs from now, or will some code actually become suboptimal on future GPUs? Code which, when handled by an optimized ‘high-level’ driver such as in DirectX 11 or OpenGL, will actually be faster than the DirectX 12/Vulkan equivalent in the engine code? I think this is a more interesting aspect than which GPU is currently better for a given API.

Posted in Direct3D, Hardware news, OpenGL, Software development, Software news, Vulkan | Tagged , , , , , , , , , , , , , , , , , , , , | 11 Comments

FutureMark’s Time Spy: some people still don’t get it

Today I read a review of AMD’s new Radeon RX470 on Tweakers.net, by Jelle Stuip. He used Time Spy as a benchmark, and added the following description:

About 3DMark Time Spy has recently been some controversy, because it is found that the benchmark does not use vendor specific code paths, but a generic code path used for all hardware. Futuremark see that as a plus; it therefore does not matter which video card you test, because when turning Time Spy they all follow the same code path, so you can make fair comparisons. That also means that there is no specific optimization for AMD GPUs and AMD’s implementation of asynchronous compute is not fully exploited. In games that do use can be the relationship between AMD and Nvidia GPUs also different from the Time Spy benchmark represents.

Sorry Jelle, but you don’t get it.

Indeed, Time Spy does not use vendor specific code paths. However, ‘vendor’ is a misnomer here anyway. I mean, you can write a path specific for AMD or NVidia’s current GPU architecture, but that is no guarantee that it is going to be any good on architectures from the past, or architectures from the future. You need to write architecture-specific paths, is what people of the “vendor specific code path”-school of thought really mean. In this case, it is not just the microarchitecture itself, but the actual configuration of the video card has a direct effect on how async compute code performs as well (balance between number of shaders, shader performance and memory bandwidth and such factors).

However, in practice that is not going to happen of course, because that means that games have to receive updates to their code indefinitely, for each new videocard that arrives, until the end of time. So in practice your shiny new videocard will not have any specific paths for yesterday’s games either.

But, more importantly… He completely misinterprets the results of Time Spy. Yes, there is less of a difference between Pascal and Polaris than in most games/benchmarks using Async Compute. However, the reason for that is obvious: Currently Time Spy is the only piece of software where Async Compute is enabled on nVidia devices *and* results in a performance gain. The performance gains on AMD hardware are as expected (around 10-15%). However, since nVidia hardware now benefits from this feature as well, the difference between AMD and nVidia hardware is smaller than in other async compute scenarios.

Important to note also is that both nVidia and AMD are part of FutureMark’s Benchmark Development Program: http://www.futuremark.com/business/benchmark-development-program

As such, both vendors have been actively involved in the development process, had access to the source code throughout the development of the benchmark, and have actively worked with FutureMark on designing the tests and optimizing the code for their hardware. If anything, Time Spy might not be representative of games because it is actually MORE fair than various games out there, which are skewed towards one vendor.

So not only does Time Spy exploit async compute very well on AMD hardware (as AMD themselves attest to here: http://radeon.com/radeon-wins-3dmark-dx12/), but Time Spy *also* exploits async compute well on nVidia hardware. Most other async compute games/benchmarks were optimized by/for AMD hardware alone, and as such do not represent how nVidia hardware would perform with this feature, since it is not even enabled in the first place. We will probably see more games that benefit as much as Time Spy does on nVidia hardware, once they start optimizing for the Pascal architecture as well. And once that happens, we can judge how well Time Spy has predicted the performance. Currently, DX12/Vulkan titles are still too much of a vendor-specific mess to draw any fair conclusions (eg. AoTS and Hitman are AMD-sponsored, ROTR is nVidia-sponsored, DOOM Vulkan doesn’t have async compute enabled for nVidia (yet?), and uses AMD-specific shader extensions).

Too bad, Jelle. Next time, please try to do some research on the topic, and get your facts straight.

Posted in Direct3D, Hardware news, Software development, Software news, Vulkan | Tagged , , , , , , , , , , , | 5 Comments

GeForce GTX1060: nVidia brings Pascal to the masses

Right, we can be short about the GTX1060… It does exactly what you’d expect: it scales down Pascal as we know it from the GTX1070 and GTX1070 to a smaller, cheaper chip, aiming at the mainstream market. The card is functionally exactly the same, apart from missing a SLI connector.

But let’s compare it to the competition, the RX480. And as this is a technical blog, I will disregard price. Instead, I will concentrate on the technical features and specs.

Die size: 230 mm²
Process: GloFo 14 nm FinFET
Transistor count: 5.7 billion
Memory bandwidth: 256 GB/s
Memory bus: 256-bit
Memory size: 4/8 GB
TDP: 150W
DirectX Feature level: 12_0

Die size: 200 mm²
Process: TSMC 16 nm FinFET
Transistor count: 4.4 billion
Memory bandwidth: 192 GB/s
Memory bus: 192-bit
Memory size: 6 GB
TDP: 120W
DirectX Feature level: 12_1

And well, if we would just go by these numbers, then the Radeon RX480 looks like a sure winner. On paper it all looks very strong. You’d almost think it’s a slightly more high-end card, given the higher TDP, the larger die, higher transistor count, higher TFLOPS rating, more memory and more bandwidth (the specs are ~30% higher than the GTX1060). In fact, the memory specs are identical to that of the GTX1070, as is the TDP.

But that is exactly where Pascal shines: due to the excellent efficiency of this architecture, the GTX1060 is as fast or faster than the RX480 in pretty much all benchmarks you care to throw at it. If this would come to a price war, nVidia would easily win this: their GPU is smaller, their PCB can be simpler because of the smaller memory interface, and the lower power consumption, and they can use a smaller/cheaper cooler because they have less heat to dissipate. So the cost for building a GTX1060 will be lower than that of a RX480.

Anyway, speaking of benchmarks…

Time Spy

FutureMark recently released a new benchmark called Time Spy, which uses DirectX 12, and makes use of that dreaded async compute functionality. As you may know, this was one of the points that AMD has marketed heavily in their DX12-campaign, to the point where a lot of people thought that:

  1. AMD was the only one supporting the feature
  2. Async compute is the *only* new feature in DX12
  3. All gains that DX12 gets, come from using async compute (rather than the redesign of the API itself to reduce validation, implicit synchronization and other things that may reduce efficiency and add CPU overhead)

Now, the problem is… Time Spy actually showed that GTX10x0-cards gained performance when async compute was enabled! Not a surprise to me of course, as I already explained earlier that nVidia can do async compute as well. But many people were convinced that nVidia could not do async compute at all, not even on Pascal. In fact, they seemed to believe that nVidia hardware could not even process in parallel period. And if you take that as absolute truth, then you have to ‘explain’ this by FutureMark/nVidia cheating in Time Spy!

Well, of course FutureMark and nVidia are not cheating, so FutureMark revised their excellent Technical Guide to deal with the criticisms, and also published an additional press release regarding the ‘criticism’.

This gives a great overview of how the DX12 API works with async compute, and how FutureMark made use of this feature to boost performance.

And if you want to know more about the hardware-side, then AnandTech has just published an excellent in-depth review of the GTX1070/1080, and they dive deep into how nVidia performs asynchronous compute and fine-grained pre-emption.

I was going to write something about that myself, but I think Ryan Smith did an excellent job, and I don’t have anything to add to that. TL;DR: nVidia could indeed do async compute, even on Maxwell v2. The scheduling was not very flexible however, which made it difficult to tune your workload to get proper gains. If you got it wrong, you could receive considerable performance hits instead. Therefore nVidia decided not to run async code in parallel by default, but just serialize it. The plan may have been to ‘whitelist’ games that are properly optimized, and do get gains. We see that even in DOOM, the async compute path is not enabled yet on Pascal. But the hardware certainly is capable of it, to a certain extent, as I have also said before. Question is: will anyone ever optimize for Maxwell v2, now that Pascal has arrived?

Update: AMD has put a blog-post online talking about how happy they are with Time Spy, and how well it pushes their hardware with async compute: http://radeon.com/radeon-wins-3dmark-dx12

I suppose we can say that AMD has given Time Spy its official seal-of-approval (publicly, that is. They already approved it within the FutureMark BDP of course).

Posted in Direct3D, Hardware news, OpenGL, Software development, Vulkan | Tagged , , , , , , , , , , | 21 Comments