DMA activation

No, not the pseudoscience stuff, I am talking about Direct Memory Access. More specifically in the context of IBM PC and compatibles, which use the Intel 8237A DMA controller.

For some reason, I had never used the 8237A before. I suppose that’s because the DMA controller has very limited use. In theory it can perform memory-to-memory operations, but only between channels 0 and 1, and IBM hardwired channel 0 to perform memory refresh on the PC/XT, so using channel 0 for anything else has its consequences. Aside from that, the 8237A is a 16-bit DMA controller shoehorned into a 20-bit address space. So the DMA controller can only address 64k of memory. IBM added external ‘page registers’ for each channel, so you can store the high 4 bits of the 20-bit address there, and this will be combined with the low 16-bit address from the DMA controller on the bus. This means there are only 16 pages of 64k, aligned to 64k boundaries (so you have to be careful when allocating a buffer for DMA, you need to align it properly so you do not cross a page boundary. Beyond 64k, the addressing just wraps around). However, since channel 0 was reserved for memory refresh on the PC/XT, they did not add a page register for it. This means that you can only do memory-to-memory transfers within the same 64k page of channel 1, which is not very useful in general.

On the AT however, they added separate memory refresh circuitry, so channel 0 now became available for general use. They also introduced a new page register for it (as well as a second DMA controller for 16-bit DMA transfers, as I also mentioned in this earlier article). So on an AT it may actually work. There is another catch, however: The 8237A was never designed to run at speeds beyond 5 MHz. So where the 8237A runs at the full 4.77 MHz on a regular PC/XT, it runs at half the clockspeed on an AT (either 3 or 4 MHz, depending on whether you have a 6 or 8 MHz model). So DMA transfers are actually slower on an AT than on a PC/XT. At the same time the CPU is considerably faster. Which means that most of the time, you’re better off using the CPU for memory-to-memory transfers.

Therefore, DMA is mostly used for transferring data to/from I/O devices. Its main uses are for floppy and harddrive transfers, and audio devices. Being primarily a graphics programmer, I never had any need for that. I needed memory-to-memory transfers. You generally don’t want to implement your own floppy and harddrive handling on PC, because of the variety of hardware out there. It is better to rely on BIOS or DOS routines, because they abstract the hardware differences away.

It begins

But in the past weeks/months I have finally been doing some serious audio programming, so I eventually arrived at a need for DMA: the Sound Blaster. In order to play high-quality digital audio (up to 23 kHz 8-bit mono) on the Sound Blaster, you have to set up a DMA transfer. The DSP (‘Digital Sound Processor’ in this context, not Signal) on the SB will read the samples from memory via DMA, using an internal timer to maintain a fixed sampling rate. So playing a sample is like a ‘fire-and-forget’ operation: you set up the DMA controller and DSP to transfer N bytes, and the sample will play without any further intervention from the CPU.

This is a big step up from the sample playing we have been doing so far, with the PC Speaker, Covox or SN76489 (‘Tandy/PCjr’ audio). Namely, all these devices required the CPU to output individual samples to the device. The CPU was responsible for accurate timing. This requires either cycle-counted loops or high-frequency timer interrupts. Using DMA is more convenient than a cycle-counted loop, and far more efficient than having to handle an interrupt for every single sample. You can now play back 23 kHz mono 8-bit audio at little more than the cost of the bandwidth on the data bus (which is about 23 kb/s in this case: you transfer 1 byte for each sample), so you still have plenty of processing time left to do other stuff. The DMA controller will just periodically signal a HOLD to the CPU. Once the CPU acknowledges this with a HLDA signal, the DMA controller has now taken over the bus from the CPU (‘stealing cycles’), and can put a word from memory onto the bus for the I/O device to consume. The CPU won’t be able to use the bus until the DMA transfer is complete (this can either be a single byte transfer or a block transfer).

It’s never that easy

If it sounds too good to be true, it usually is, right? Well, in a way, yes. At least, it is for my chosen target: the original Sound Blaster 1.0. It makes sense that when you target the original IBM PC 5150/5160, that you also target the original Sound Blaster, right? Well, as usual, this opened up a can of worms. The keyword here is ‘seamless playback’. As stated above, the DMA controller can only transfer up to 64k at a time. At 22 kHz that is about 3 seconds of audio. How do you handle longer samples?

After the DMA transfer is complete, the DSP will issue an interrupt. For longer samples you are expected to send the next buffer of up to 64k immediately. And that is where the trouble is. No matter what you try, you cannot start the next buffer quickly enough. The DSP has a ‘busy’ flag, and you need to wait for the flag to clear before you send each command byte. I have measured that on my 8088 at 4.77 MHz, it takes 316 cycles to send the 3 bytes required for a new buffer command (the 0x14 command to play a DMA buffer, then the low byte and high byte of the number of samples to play). At 4.77 MHz, a single sample at 22050 Hz lasts about 216 CPU cycles. So you just cannot start a new transfer quickly enough. There is always a small ‘glitch’. A faster CPU doesn’t help: it’s the DSP that is the bottleneck. And you have to wait for the busy-flag to clear, because if you don’t, it will not process the command properly.

Nagging the DSP

Some early software tried to be creative (no pun intended) with the Sound Blaster, and implemented unusual ways to output sound. One example is Crystal Dream, which uses a method that is described by Jon Campbell of DOSBox-X as ‘Nagging the DSP’. Crystal Dream does not bother with the interrupt at all. Apparently they found out that you can just send a new 0x14 command, regardless of whether you received and acknowledged the interrupt or not. In fact, you can even send it while the buffer is still playing. You will simply ‘restart’ the DSP with a new buffer.

Now, it would be great if this resulted in seamless output, but experimentation on real hardware revealed that this is not the case (I have made a small test program which people can run on their hardware here).. Apparently the output stops as soon as you send the 0x14 command, and it doesn’t start again until you’ve sent all 3 bytes, which still takes those 316 cycles, so you effectively get the exact same glitching as you would with a proper interrupt handler.

State of confusion

So what is the solution here? Well, I’m afraid there is no software-solution. It is just a design-flaw in the DSP. This only affects DSP v1.xx. Later Sound Blaster 1.x cards were sold with DSP v2.00, and Creative also offered these DSPs as upgrades to existing users, as the rest of the hardware was not changed. See this old Microsoft page for more information.  The early Sound Blasters had a ‘click’ in digital output that they could not get rid of:

If a board with the versions 1.x DSP is installed and Multimedia Windows is running in enhanced mode, a periodic click is audible when playing a wave file. This is caused by interrupt latency, meaning that interrupts are not serviced immediately. This causes the Sound Blaster to click because the versions 1.x DSP produce an interrupt when the current DMA buffer is exhausted. The click is the time it takes for the interrupt to be serviced by the Sound Blaster driver (which is delayed by the 386 enhanced mode of Windows).

The click is still present in standard mode, although it is much less pronounced because the interrupt latency is less. The click is more pronounced for pure tones.

The version 2.0 DSP solves this problem by using the auto- initialize mode of the DMA controller (the 8237). In this mode, the DMA controller automatically reloads the start address and count registers with the original values. In this way, the Sound Blaster driver can allocate a 4K DMA buffer; using the lower 2K as the “ping” buffer and the upper 2K as the “pong” buffer.

While the DMA controller is processing the contents of the ping buffer, the driver can update the pong; and vice versa. Therefore, when the DMA controller auto-initializes, it will already have valid data available. This removes the click from the output sound.

What is confusing here, is the nomenclature that the Sound Blaster Hardware Programming Guide uses:

Single-cycle DMA mode

They refer to the ‘legacy’ DSP v1.xx output mode as ‘single-cycle DMA mode’. Which is true, in a sense: You program the DMA controller for a ‘single transfer mode’ read. A single-cycle transfer means that the DMA controller will transfer one byte at a time, when a device does a DMA request. After that, the bus is released to the CPU again. Which makes sense for a DAC, since it wants to play the sample data at a specific rate, such as 22 kHz. For the next byte, the DSP will initiate a new DMA request by asserting the DREQ line again. This opposed to a ‘block transfer’ where the DMA controller will fetch the next byte immediately after each transfer, so a device can consume data as quickly as possible, without having to explicitly signal a DREQ for each byte.

Auto-Initialize DMA mode

The ‘new’ DSP 2.00+ output mode is called ‘auto-initialize DMA mode’. In this mode, the DSP will automatically restart the transfer at every interrupt. This gives you seamless playback, because it no longer has to process a command from the CPU.

The confusion here is that the DMA controller also has an ‘autoinitialize mode’. This mode will automatically reload the address and count registers after a transfer is complete. So the DMA controller is immediately reinitialized to perform the same transfer again. Basically the same as what the DSP is doing in ‘auto-initialize DMA mode’. You normally want to use both the DMA controller and DSP in their respective auto-init modes. Double-buffering can then be done by setting up the DMA controller with a transfer count that is twice as large as the block size you set to the DSP. As a result, the DSP will give you an interrupt when it is halfway the DMA buffer, and another one when it is at the end. That way you can re-fill the half of the buffer that has just finished playing at each interrupt, without any need to perform extra synchronization anywhere. The DMA controller will automatically go back to the start of the buffer, and the DSP also restarts its transfer, and will keep requesting data, so effectively you have created a ringbuffer:

SB DMA

However, for the DMA controller, this is not a separate kind of transfer, but rather a mode that you can enable for any of the transfer types (single transfer, block transfer or demand). So you are still performing a ‘single transfer’ on the DMA controller (one byte for every DREQ), just with auto-init enabled.

You can also use this auto-init mode when using the legacy single-cycle mode of the DSP, because the DSP doesn’t know or care who or what programs the DMA, or what its address and count are. It simply requests the DMA controller to transfer a byte, nothing more. So by using auto-init on the DMA controller you can at least remove the overhead of having to reprogram DMA at every interrupt in single-cycle mode. You only have to send a new command to the DSP, to minimize the glitch.

Various sources seem to confuse the two types of auto-init, thinking they are the same thing and/or that they can only be used together. Not at all. In theory you can use the single-cycle mode for double-buffering in the same way as they recommend for auto-init mode: Set the DMA transfer count to twice the block size for the DSP, so you get two interrupts per buffer.

And then there is GoldPlay… It also gets creative with the Sound Blaster. Namely, it sets up a DMA transfer of only a single byte, with the DMA controller in auto-init mode. So if you start a DSP transfer, it would just loop over the same sample endlessly, right? Well no, because GoldPlay sets up a timer interrupt handler that updates that sample at the replay frequency.

That is silly and smart at the same time, depending on how you look at it. Silly, because you basically give up the advantage of ‘Fire-and-forget’ DMA transfers, and you’re back to outputting CPU-timed samples like on a PC speaker or Covox. But smart, for exactly that reason: you can ‘retrofit’ Sound Blaster support quite easily to a system that is already capable of playing sound on a PC speaker/Covox. That is probably the reason why they did it this way. Crystal Dream also uses this approach by the way.

There is a slight flaw there, however. And that is that the DSP does not run in sync with the CPU. The DSP has its own crystal on the card. What this means is that you probably will eventually either miss a sample completely, or the same sample gets output twice, when the timer and DSP go out of sync too far. But since these early SB cards already have glitches by design, one extra glitch every now and then is no big deal either, right?

The best of both worlds

Well, not for me. I see two requirements here:

  1. We want as few glitches as possible.
  2. We want low latency when outputting audio.

For the DSP auto-init mode, it would be simple: You just set your DMA buffer to a small size to have low latency, and handle the interrupts from the DSP to update the buffers. You don’t have to worry about glitches.

For single-cycle mode, the smaller your buffers, the more glitches you get. So the two requirements seem mutually exclusive.

But they might not be. As GoldPlay and Crystal Dream show, you don’t have to match the buffer size of the DMA with the DSP at all. So you can set the DSP to the maximum length of 64k samples, to get the least amount of glitches possible.

Setting the DMA buffer to just 1 sample would not be my choice, however. That defeats the purpose of having a Sound Blaster. I would rather set up a timer interrupt to fire once every N samples, so the timer interrupt would be a replacement for the ‘real’ DSP interrupt you’d get in auto-init more. If you choose your DSP length to be a multiple of the N samples you choose for your DMA buffer, you can reset the timer everytime the DSP interrupt occurs, so that you re-sync the two. Be careful of the race condition that theoretically the DSP and timer should fire at the same time at the end of the buffer. Since they run on different clock generators, you never know which will fire first.

One way to get around that would be to implement some kind of flag to see if the timer interrupt had already fired, eg a counter would do. You know how many interrupts to expect, so you could just check the counter in the timer interrupt, and not perform the update when the counter exceeds the expected value. Or, you could turn it around: just increment the counter in the timer interrupt. Then when the DSP interrupt fires, you check the counter, and if you see the timer had not fired yet, you can perform the last update from the DSP handler instead. That removes the branching from the timer interrupt.

Another way could be to take the ‘latched timer’ approach, as I also discussed in a previous article. You define a list of PIT count values, and update the count at every interrupt, walking down the list. You’d just set the last count in the list to a value of 0 (interpreted as 65536 ticks), so you’re sure it never fires before the DSP does.

Once you have that up and running, you’ll have the same low-latency and low CPU load as with DSP v2.00+, and your glitches will be reduced to the minimum possible. Of course I would only recommend the above method as a fallback for DSP v1.xx. On other hardware, you should just use auto-init, which is 100% glitch-free.

Posted in Oldskool/retro programming | Tagged , , , , , , , , , , , | Leave a comment

A picture says more than a thousand words

When I work out ideas, I sometimes draw things out on paper, or I use some test-images made in Paint.NET or whatnot. So I was thinking… my blog is mostly text-oriented, aside from some example images and videos. Perhaps I should try to draw some diagrams or such, to illustrate certain ideas, to help people understand them more quickly.

So here is my first try of such a diagram. I have drawn out the VGM data and how it is processed by the interrupt handlers, as discussed in my previous blog:

vgm-data-order

I have used draw.io for this, which seemed to work for me so far. I will update the previous post with these diagrams. Let me know what you think.

Posted in Oldskool/retro programming, Software development | Leave a comment

Putting the things together

So, over time I have discussed various isolated things related to 8088-based PCs. Specifically:

These topics are not as isolated as they seem at first. Namely, I was already using the auto-EOI trick for the streaming data program, to get the best possible performance. And I streamed audio data, which is related to sound cards. When I discussed the latching timer, I also hinted at music already (and the auto-EOI feature). And again, when I discussed auto-EOI in detail, I mentioned digital audio playback.

Once I had built my Tandy sound card (using the PCB that I ordered from lo-tech, many thanks to James Pearce for making this possible), I needed software to do something with it. The easiest way to get it to play something, is to use VGM files. There are quite a few captured songs from games (mostly from Sega Master System, which used the same audio chip), and various trackers can also export songs to VGM.

VGM is a very simple file format: it simply stores the raw commands sent to the sound chip(s). Between commands, it stores delays. There are simple VGM files, which only update the music 50 times or 60 times per second (synchronized to PAL or NTSC screen updates). These are easy to play: Just set up your timer interrupt to fire at that rate, and output the commands. But then there’s the more interesting files, which contain digital samples, which play at much higher rates, and they are stored with very fine-grained delay commands. These delays are in ticks of 44.1 kHz resolution. So the format is flexible enough to support very fast updates to sound chips, eg for outputting single samples at a time, up to 44.1 kHz sample rate.

The question of course is: how do you play that? On a modern system, it’s no problem to process data at 44.1 kHz in realtime. But on an 8088 at 4.77 MHz, not so much. You have about 108 CPU cycles to process each sample. That is barely enough to just process an interrupt, let alone actually processing any logic and outputting data to the sound chip. A single write to the SN76489 takes about 42 CPU cycles by the way.

So the naïve way of just firing a timer interrupt at 44.1 kHz is not going to work. Polling a timer is also going to be difficult, because it takes quite some time to read a 16-bit value from the PIT. And those 16-bit values can only account for an 18.2 Hz rate at the lowest, which gets you about 50 ms as maximum delay, so you will want to detect wraparound and extend it to 32-bit to be able to handle longer delays in the file. This will make it difficult to keep up high-frequency data as well. It would also tie up the CPU 100%, so you can’t do anything else while playing music.

But what if we view VGM not as a file containing 44.1 kHz samples, but rather as a timeline of events, where the resolution is 44.1 kHz, but the actual event rate is generally much lower than 44.1 kHz? Now this sounds an awful lot like the earlier raster effect with the latched timer interrupt! We’ve seen there that it’s possible to reprogram the timer interrupt from inside the interrupt handler. By not resetting the timer, but merely setting a new countdown value, we avoid any jitter, so we remain at an ‘absolute’ time scale. The only downside is that the countdown value gets activated immediately after the counter goes 0, so that is before the CPU can reach your interrupt handler. Effectively that means you always need to plan 1 interrupt handler ahead.

I have tried to draw out how the note data and delay data is stored in the VGM file, and how they need to be processed by the interrupt handlers:

vgm-data-order

We basically have two challenges here, with the VGM format:

  1. We need a way to ‘snoop ahead’ to get the proper delay to set in the current handler.
  2. We need to process the VGM data as quickly as possible.

For the first challenge, I initially divided up my VGM command processor into two: one that would send commands to the SN76489 until it encountered a delay command. The other would skip all data until it encountered a delay command, and returned its value.

Each processor has its own internal pointer, so in theory they could be in completely different places in the file. Making the delay processor be ‘one ahead’ in the stream was simple this way.

There was still that second challenge however: Firstly, I had to process the VGM stream byte-by-byte, and act on each command explicitly in a switch-case statement. Secondly, the delay values were in 44.1 kHz ticks, so I had to translate them to 1.19 MHz ticks for the PIT. Even though I initially tried with a look-up-table for short delays, it still wasn’t all that fast.

So eventually I decided that I would just preprocess the data into my own format, and play from there. The format could be really simple, just runs of:

uint16_t delay;
uint8_t data_count;
uint8_t data[data_count]

Where ‘delay’ is already in PIT ticks, and since there is only one command in VGM for the SN76489, which sends a single byte to its command port, I can just group them together in a single buffer. This is nice and compact.

I have now reversed the order of the delays and note data in the stream, and in the following diagram you can see how that simplifies the processing for the interrupt handlers:

preprocessed-data-order

As you can see, I can now just process the data ‘in order’: The first delay is sent at initialization, then I just process note data and delays as they occur in the stream.

Since I support a data_count of 0, I can get around the limitation of the PIT only being able to wait for 65536 ticks at most: I can just split up longer delays into multiple blocks with 0 commands.

I only use a byte for the data_count. That means I can only support 255 command bytes at most. Is that a problem? Well no, because as mentioned above, a single write takes 42 CPU cycles, and there are about 108 CPU cycles in a single tick at 44.1 kHz. Therefore, you couldn’t physically send more than 2 bytes to the SN76489 in a single tick. The third byte would already trickle over into the next tick. So if I were ever to encounter more than 255 bytes with no delays, then I could just add a delay of 1 in my stream, and split up the commands. In practice it is highly unlikely that you’ll ever encounter this. There are only 4 different channels on the chip, and the longest commands you can send are two bytes. You might also want to adjust the volume of each channel, which is 1 byte. So worst-case, you’d probably send 12 bytes at a time to the chip. Then you’d want a delay so you could actually hear the change take effect.

That’s all there is to it! This system can now play VGM data at the resolution of 44.1 kHz, with the only limitation being that you can’t have too many commands in too short a period of time, because the CPU and/or the SN76489 chip will not be able to keep up.

Well, not really, because there is a third challenge:

  1. VGM files (or the preprocessed data derived from it) may exceed 64k (a single segment) or even 640k (the maximum amount of conventional memory in an 8088 system).

Initially I just wanted to accept these limitations, and just load as much of the file into memory as possible, and play only that portion. But then I figured: technically this routine is a ‘background’-routine since it is entirely driven by an interrupt, and I can still run other code in the ‘foreground’, as long as the music doesn’t keep the CPU too busy.

This brought me back to the earlier experiment with streaming PWM/PCM data to PC speaker and Covox. The idea of loading the data into a ringbuffer of 64k and placing the int handler inside this ringbuffer makes a lot of sense in this scenario as well.

Since the data is all preprocessed, the actual interrupt handler is very compact and simple, and copying it around is very little overhead. The data rate should also be relatively low, unless VGMs use a lot of samples. In most cases, a HDD, or even a floppy, should be able to keep up with the data. So I gave it a try, and indeed, it works:

Or well, it would, if the floppy could keep up! This is a VGM capture of the music from Skate or Die, by Rob Hubbard. It uses samples extensively, so it is a bit of ‘worst case’ for my player. But as you can hear, it plays the samples properly, even while it is loading from disk. It only messes up when there’s a buffer underrun, but eventually recovers. Simpler VGM files play perfectly from floppy. Sadly this machine does not have a HDD, so I will need to try the Skate or Die music again some other time, when I have installed the card into a system with a HDD. I’m confident that it will then keep up and play the music perfectly.

But for now, I have other plans. They are also music-related, and I hope to have a quick demonstration of those before long.

Posted in Oldskool/retro programming | Tagged , , , , , , , , , , , , , , , | 2 Comments

Programming: What separates the men from the boys?

I have come up with the following list of topics:

  • Pointer arithmetic
  • Unicode vs ASCII strings
  • Memory management
  • Calling conventions
  • Basic mathematics, such as linear algebra (eg 2d rotations, translations and scaling, things you’ll regularly find in basic GUI stuff).
  • Multithreading/concurrency

Over time I have found that some programmers master such concepts at an early stage in their career, while others continue to struggle with these things for the rest of their lives.

Thoughts? Some more topics we can add to the list?

 

Posted in Oldskool/retro programming, Software development | Tagged , , | 21 Comments

Any real-keeping lately?

The 5-year anniversary of my inaugural ‘Just keeping it real’-article came and went. Has it been that long already? It’s also been quite some time since I’ve last written about some oldskool coding, or even anything at all. Things have been rather busy.

However, I have still been doing things on and off. So I might as well give a short overview of things that I have been doing, or things that I may plan to do in the future.

Libraries: fortresses of knowledge

One thing that has been an ongoing process, has been to streamline my 8088-related code, and to extract the ‘knowledge’ from the effects and utilities that I have been developing into easy-to-use include and source files for assembly and C. Basically I want to create some headers for each chip that you regularly interact with, such as the 8253, the 8259A and the 6845. And at a slightly higher level, also routines for dealing with MDA, Hercules, CGA, EGA and VGA, and also audio, such as the PC speaker or the SN76489.

For example, to make it easy to set up a timer interrupt at a given frequency, or to enable/disable auto-EOI mode, or to perform horizontal or vertical sync for low-level display hacking like in 8088 MPH. That is the real challenge here. The header files should be easy to use, while at the same time giving maximum control and performance.

I am somewhat inspired by the Amiga’s NDK. It contains header files that allow easy access to all the custom chip registers. For some reason, something similar is not around for the PC hardware, as far as I know. There’s very extensive documentation, such as Ralf Brown’s Interrupt List, and the BOCHS ports list. But these are not in a format that can be readily used in a programming environment. So I would like to make a list of constants that describes all registers and flags, in a way they can be used immediately in a programming context (in this case assembly and C, but it should be easy to take a header file and translate it to another language. In fact, currently I generally write the assembly headers first, then convert them to C). On top of that, I try to use macros where possible, to add basic support routines. Macros have the advantage that they are inlined in the code, so there is no calling overhead. If you design your macro just right, it can be just as efficient as hand-written code. It can even take care of basic loop unrolling and such.

Once this library gets more mature, I might release it for everyone to use and extend.

Standards are great, you can never have too many of them

As I was creating these header files, I came to the conclusion that I was doing it wrong, at least, for the graphics part. Namely, when I first started doing my oldskool PC graphics stuff, I started with VGA and then worked my way down to the older standards. I created some basic library routines in C, where I considered EGA to be a subset of VGA, and CGA a subset of EGA in turn. I tried to create a single set of routines that could work in CGA, EGA or VGA mode, depending on a #define that you could set. Aside from that, I also added a Hercules mode, which didn’t quite fit in there, since Hercules is not an IBM standard, and is not compatible at all.

There are two problems with that approach:

  1. As we know from software such as 8088 MPH, EGA and VGA are in fact not fully backward compatible with CGA at all.  Where CGA uses a real 6845 chip, EGA and VGA do not. So some of the 6845 registers are repurposed/redefined on EGA/VGA. Various special modes and tricks work entirely differently on CGA than they do on EGA or VGA (eg, you can program an 80×50 textmode on all, but not in the same way).
  2. If you set a #define to select the mode in which the library operates, then by definition it can only operate in one mode at a time. This doesn’t work for example in the scenario where you want to be able to support multiple display adapters in a single program, and allow the user to select which mode to use (you could of course build multiple programs, one for each mode, and put them behind some menu frontend or such. Various games actually did that, so you often find separate CGA, EGA, VGA and/or Tandy executables. But it is somewhat cumbersome). Another issue is that certain videocards can actually co-exist in a single system, and can work at the same time (yes, early multi-monitor). For example, you can combine a ‘monochrome’ card with a ‘color’ card, because the IBM 5150 was originally designed that way, with MDA and CGA. They each used different IO and memory ranges, so that both could be installed and used at the same time. By extension, Hercules can also be used together with CGA/EGA/VGA.

So now that I have seen the error of my ways, I have decided to only build header files on top of other header files when they truly are supersets. For example, I have a 6845 header file, and MDA, Hercules and CGA use this header. That is because they all use a physical 6845 chip. For EGA and VGA, I do not use it. Also, I use separate symbol names for all graphics cards. For example, I don’t just make a single WaitVBL-macro, but I make one specific for every relevant graphics card. So you get a CGA_WaitVBL, a HERC_WaitVBL etc. You can still masquerade them behind a single alias yourself, if you so please. But you can also use the symbols specific to each given card side-by-side.

And on his farm he had some PICs, E-O, E-O-I

The last oldskool article I did was focused around the 8259A Programmable Interrupt Controller, and the automatic End-of-Interrupt functionality. At the time I already mentioned that it would be interesting for high-frequency timer interrupt routines, such as playing back digital audio on the PC speaker. That was actually the main reason why I was interested in shaving a few cycles off. I have since used the auto-EOI code in a modified version of the chipmod routine from the endpart of 8088 MPH. Instead of the music player taking all CPU, it can now be run from a timer interrupt in the background. By reducing the mixing rate, you can free up some time to do graphics effects in the foreground.

That routine was the result of some crazy experimentation. Namely, for a foreground routine, the entire CPU is yours. But when you want to run a routine that triggers from an interrupt, then you need to save the CPU state, do your routines, and then restore the CPU state. So the less CPU state you need to save, the better. One big problem with the segmented memory model of the 8088 is how to get access to your data. When the interrupt triggers, the only segment you can be sure of is the code segment. You have no idea what DS and ES are pointing to. You can have some control over SS, because you can make sure that your entire program only uses a single segment for the stack throughout.

So one idea was to reserve some space at the top of the stack, to store data there. But then I figured that it might be easier to just use self-modifying code to store data directly in the operands of some instructions.

Then I had an even better idea: what if I were to use an entire segment for my sample data? It can effectively be a 64k ringbuffer, where wraparound is automatic, saving the need to do any boundschecking on my sample pointer. It is a 16-bit pointer, so it will wrap around automatically. And what if I would put this in the code segment? I only need a few bytes of code for an interrupt handler that plays a sample, increments the sample pointer, and returns from the interrupt. I can divide the ringbuffer in two segments. When the sample pointer is in the low segment, I put the interrupt handler in the high segment, and when the sample pointer switches to the high segment, I move the interrupt handler in the low segment.

Since each segment is so large, I do not need to check at every single sample. I can just do it in the logic of the foreground routine, every frame or such. This makes it a very efficient approach.

I also had this idea of placing the interrupt handlers in segment 0. The advantage here is that CS will point to 0, which means that you can modify the interrupt vector table directly, by just doing something like mov cs:[20h], offset myhandler. This allows you to have a separate handler for each sample, and include the sample in the code itself, so the code effectively becomes the sample buffer. But at the time I thought it may be too much of a hassle. But then reenigne suggested the exact same thing, so I thought about it once more. There may be something here yet.

I ended up giving it a try. I decided to place my handlers 32 bytes apart. 32 bytes was enough to make a handler that plays a sample and updates the interrupt vector. The advantage of spacing all handlers evenly in memory is that they all had the instruction that loaded the sample in the same place, so they were all spaced 32 bytes apart as well. This made it easy to address these samples, and update them with self-modifying code from a mixing loop. It required some tricky code that backs up the existing interrupt vector table, then disables all interrupts except irq 0 (the timer interrupt), and restores everything upon exist. But after some trial-and-error I managed to get it working.

As we were discussing these routines, we were wondering if this would perhaps be good enough as a ‘replacement’ for Trixter’s Sound Blaster routines in 8088 Corruption and 8088 Domination. Namely, the Sound Blaster was the only anachronism in these productions, because streaming audio would have been impossible without the Sound Blaster and its DMA functionality.

So I decided to make a proof-of-concept streaming audio player for my 5160:

As you can see, or well, hear, it actually works quite well. At least, with the original controller and Seagate ST-225, as in my machine. Apparently this system uses the DMA controller, and as such, disk transfers can work in the background nicely. It introduces a small amount of jitter in the sample playback, since the DMA steals bus cycles. But for a 4.77 MHz 8088 system, it’s quite impressive just how well this works. With other disk controllers you may get worse results, when they use PIO rather than DMA. Fun fact: the floppy drive also uses DMA, and the samples are of a low enough bitrate that they can play from a floppy as well, without a problem.

Where we’re going, we don’t need… Wait, where are we going anyway?

So yes, audio programming. That has been my main focus since 8088 MPH. Because, aside from the endpart, the weakest link in the demo is the audio. The beeper is just a very limited piece of hardware. There must be some kind of sweet-spot somewhere between the MONOTONE audio and the chipmod player of the endpart. Something that sounds more advanced than MONOTONE, but which doesn’t hog the entire CPU like the chipmod player, so you can still do cool graphics effects.

Since there has not been any 8088 production to challenge us,  audio still remains our biggest challenge. Aside from the above-mentioned disk streaming and background chipmod routine, I also have some other ideas. However, to actually experiment with those, I need to develop a tool that lets me compose simple music and sound effects. I haven’t gotten too far with that yet.

We could also approach it from a different angle, and use some audio hardware. One option is the Covox LPT DAC. It will require the same high-frequency timer interrupt trickery to bang out each sample. However, the main advantage is that it does not use PWM, and therefore it has no annoying carrier wave. This means that you can get more acceptable sound, even at relatively low sample rates.

A slightly more interesting option is the Disney Sound Source. It includes a small 16-byte buffer. It is limited to 7 kHz playback, but at least you won’t need to send every sample indvidually, so it is less CPU-intensive.

Yet another option is looking at alternative PC-compatible systems. There’s the PCjr and Tandy, which have an SN76489 chip on board. This allows 3 square wave channels and a noise channel. Aside from that, you can also make any of the square wave channels play 4-bit PCM samples relatively easily (and again no carrier wave). Listen to one of Rob Hubbard’s tunes on it, for example:

What is interesting is that there’s a home-brew Tandy clone card being developed as we speak. I am building my own as well. This card allows you to add an SN76489 chip to any PC, making its audio compatible with Tandy/PCjr. It would be very interesting if this card became somewhat ‘standard’ for demoscene productions.

(Why not just take an AdLib, you ask? Well, various reasons. For one, it was rather late to the market, not so much an integral part of 8088 culture. Also, it’s very difficult and CPU-consuming to program. Lastly, it’s not as easy to play samples on as the other devices mentioned. So the SN76489 seems like a better choice for the 8088. The fact that it was also used in the 8088-based PCjr and Tandy 1000 gives it some extra ‘street cred’).

Aside from that, I also got myself an original Hercules GB102 card some time ago. I don’t think it would be interesting to do another demo on exactly the same platform as 8088 MPH. Instead, it would be more interesting to explore other hardware from the 8088 era. The Hercules is also built around a 6845 chip, so some of the trickery done in 8088 MPH may be translated to Hercules. At the same time, the Hercules also offers unique features, such as 64 kB of memory, arranged in two pages of 32 kB. So we may be able to make it do some tricks of its own. Sadly, it would not be a ‘world’s first’ Hercules demo, because someone already beat me to it some months ago:

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , | 6 Comments

AMD Zen: a bit of a deja-vu?

AMD has released the first proper information on their new Zen architecture. Anandtech seems to have done some of the most in-depth coverage, as usual. My first impression is that of a deja-vu… in more than one way.

Firstly, it reminds me of what AMD did a few years ago on the GPU-front: They ditched their VLIW-based architecture, and moved to a SIMD-based architecture, which was remarkably similar to nVidia’s architecture (nVidia had been using SIMD-based architectures since their 8800GTX). In this case, Zen seems to follow Intel’s Core i7-architecture quite closely. They are moving back to high-IPC cores, just as in their K7/K8 heyday (which at the time was following Intel’s P6-architecture closely), and they seem to target lower clockspeeds, around the 3-4 GHz area where Intel also operates. They are also adopting a micro-op cache. Something that Intel has been doing for a long time.

Secondly, AMD is abandoning their CMT-approach, and going for a more conventional SMT-approach. This is another one of those “I told you so”-moments. Even before Bulldozer was launched, I already said that having 2 ALUs hardwired per core is not going to work well. Zen is now using 4 ALUs per two logical cores. So technically they still have the same amount of ALUs per ‘module’. However, like the Core i7, they can now use all 4 cores with each thread, so you get much better IPC for single threads. This again is something I said a few years ago already. AMD apparently agrees with that. Their fanbase did not, sadly.

We can only wonder why AMD did not go for SMT right away with Bulldozer. I personally think that AMD knew all along that SMT was the better option. However, their CMT was effectively a ‘lightweight’ SMT, where only the FPU portion did proper SMT. I think it may be a combination of two factors here:

  1. SMT was originally developed by IBM, and Intel has been using their HyperThreading variation for many years. Both companies have collected various patents on the technology over the years. Perhaps for AMD it was not worthwhile to use fullblown SMT, because it would touch on too many patents and the licensing costs would be prohibitive. It could be that some of these patents have now expired, so the equation has changed to AMD’s favour. It could also be that AMD is now willing to take a bigger risk, because they have to get back in the CPU race at all cost.
  2. Doing a fullblown SMT implementation for the entire CPU may have been to much of a step for AMD in a single generation. AMD only has a limited R&D budget, so they may have had to spread SMT out over two generations. We don’t know how long it took Intel to develop HyperThreading, but we do know that even though their first implementation in Pentium 4 worked well enough in practice, there were still various small bugs and glitches in their implementations. Not necessarily stability-wise, but also security-wise. The concept of SMT is not that complicated, but shoehorning it into the massively complex x86 architecture, which has tons of legacy software which needs to continue working flawlessly, is an entirely different matter. This is quite a risky undertaking, and proper validation can take a long time.

At any rate, Zen looks more promising than Bulldozer ever did. I think AMD made a wise choice in going back to ‘follow the leader’-mode. Not necessarily because Intel’s architecture is the right one, but because Intel’s architecture is the most widespread one. I have said the same thing about Pentium 4 in the past: the architecture itself was not necessarily as bad as people think. Its biggest disadvantage was that it did not handle code optimized for the P6-architecture very well, and most applications had been developed for P6. If all applications would be recompiled with Pentium 4 optimizations, it would already have made quite a different impression. Let alone if developers actually optimized their code specifically for Pentium 4’s strengths (something we mainly saw with video encoding/decoding and 3D rendering).

Bulldozer was facing a similar problem: it required a different type of software. If Intel couldn’t pull off a big change in software optimization with the Pentium 4, then a smaller player like AMD certainly wouldn’t either. That is the main reason why I never understood Bulldozer.

Posted in Hardware news | Tagged , , , , , , , | 26 Comments

DirectX 12 and Vulkan: what it is, and what it isn’t

I often read comments in the vein of: “… but vendor A’s hardware is designed more for DX12/Vulkan than vendor B’s”. It’s a bit more complicated than that, because it is somewhat of a chicken-and-egg problem. So I thought I’d do a quick blog to try and explain it.

APIs vs hardware features

A large part of the confusion seems to be because the capabilities of hardware tend to be categorized by versions of the DirectX API. In a way that makes sense, since each new version of the DirectX API also introduces support for new hardware features. So this became a de-facto way of categorizing the hardware capabilities of GPUs. Since DirectX 11, we even have different feature levels that can be referred to.

As you can see, the main new features in DirectX 12 are Conservative Rasterization, Volume Tiled Resources and Rasterizer ordered views. But, as you can also see, these have been ‘backported’ to DirectX 11.3 as well, so apparently they are not specific to the DirectX 12 API.

But what is an API really? API stands for Application Programming Interface. It is the ‘thing’ that you ‘talk to’ when programming something, in this case graphics. And the ‘new thing’ about DirectX 12, Vulkan (and Metal and Mantle) is that the interface follows a new paradigm, a new programming model. In earlier versions of DirectX, the driver was responsible for tasks such as resource management and synchronization (eg, if you first render to a buffer, and later want to use that buffer as a texture on some surface, the driver makes sure that the rendering to the buffer is complete before rendering with the texture starts).

These ‘next-gen’ APIs however, work on a lower level, and give the programmer control over such tasks. Leaving it to the driver can work well in the general case, and makes things easier and less error-prone for the programmer. However, the driver has to work with all software, and will use a generic approach. By giving the programmer fine-grained control over the hardware, these tasks can be optimized specifically for an engine or game. This way the programmer can shave off redundant work and reduce overhead on the CPU side. The API calls are now lighter and simpler, because they don’t have to take care of all the bookkeeping, validation and other types of management. These have now been pushed to the engine code instead. On the GPU side, things generally stay the same however, but more on that later.

Command lists

Another change in the programming model is that GPU commands are now ‘decoupled’ at the API side: Sending rendering commands to the GPU is now a two-step process:

  1. Add all the commands you want to execute to a list
  2. Execute your command list(s)

Classic rendering APIs are ‘immediate’ renderers: with an API call you send the command directly to the driver/GPU. Drivers might internally buffer this, and create/optimize their own command lists, but this is transparent to the programmer. A big problem in this programming model is that the order in which the commands are executed, are important. That basically means that you can only use a single thread to send commands. If you were to use multiple threads, you’d have to synchronize them so that they all sent their commands in-order, which basically would mean they’d run one-after-another, so you might as well use a single thread.

DirectX 11 tried to work around this by introducing ‘deferred’ contexts. You would have one ‘immediate’ context, which would execute all commands immediately. But you could create additional contexts, which would buffer commands in a list, which you could later hand down to the immediate context to execute.

There were however two problems with this approach:

  1. The deferred contexts supported only a subset of all commands
  2. Only nVidia’s implementation managed to get significant performance from this

To clarify that second point, FutureMark built an API overhead test, which includes tests with DX11 using immediate and deferred contexts, with a single or multiple threads. See Anandtech’s review of this test.

As you can see, this feature does absolutely nothing on AMD hardware. They are stuck at 1.1M calls regardless of what technique you use, or how many cores you throw at it.

With nVidia however, you see that with 4 or 6 cores, it goes up to 2.2M-2.3M calls. Funny enough, nVidia’s performance on the single-threaded DX11 code also goes up with the 6-core machine, so the total gains from this technique are not very dramatic. Apparently nVidia already performs some parallel processing inside the driver.

DirectX 12 takes this concept further. You now have a command queue, in which you can queue up command lists, which will be executed in-order. The commands inside the command list will also be executed in-order. You can create multiple command lists, and create a thread for each list, to add the commands to it, so that they can all work in parallel. There are no restrictions on the command lists anymore, like there were with the deferred context in DX11 (technically you no longer have an ‘immediate’ context in DX12, they are all ‘deferred’).

An added advantage is that you can re-use these command lists. In various cases, you want to send the same commands every frame (to render the same objects and such), so you can now remove redundant work by just using the same command list over and over again.

Honourable mention for Direct3D 1 here: The first version of Direct3D actually used a very similar concept to command lists, known as ‘execute buffers’. You would first store your commands as bytecode in an ‘execute buffer’, and then execute the buffer. Technically this could be used in multi-threaded environments in much the same way: use multiple threads, which each fill their own execute buffer in parallel.

Asynchronous compute

Why is there a queue for the command lists, you might ask? Can’t you just send the command lists directly to an Execute( myList ) function? The answer is: there can be more than one queue. You can see this as a form of ‘GPU multithreading’: you can have multiple command lists executing at the same time. If you want to compare it to CPU mutithreading, you could view a command queue as a thread, and a command list as an instruction stream (a ‘ThreadProc’ that is called when the thread is running).

There are three different classes of queues and command lists:

  1. Graphics
  2. Compute
  3. DMA/Copy

The idea behind this is that modern GPUs are capable of performing multiple tasks at the same time, since they use different parts of the GPU. Eg, you can upload a texture to VRAM via DMA while you are also rendering and/or performing compute tasks (previously this was done automatically by the driver).

The most interesting new feature here is that you can run a graphics task and a compute task together. The classic example of how you can use this is rendering shadowmaps; Shadowmaps do not need any pixel shading, they just need to store a depth value. So you are mainly running vertex shaders and using the rasterizer. In most cases, your geometry is not all that complex, so there are relatively few vertices that need processing, leaving a lot of ALUs on the GPU sitting idle. With these next-gen APIs you can now execute a compute task at the same time, and make use of the ALUs that would otherwise sit idle (compute does not need the rasterizer).

This is called ‘asynchronous’ compute, because, like with conventional multithreading on the CPU, you are scheduling two (or more) tasks to run concurrently, and you don’t really care about which order they run in. They can be run at the same time if the hardware is capable of it, or they can be run one-after-another, or they can switch multiple times (time-slicing) until they are both complete (on CPUs there are a number of ways to run multiple threads, single-core, multi-core, multi-CPU, HyperThreading. And the OS will use a combination of techniques to schedule threads on the available hardware. see also my earlier blog). You may care about the priority of both, so that you can allocate more resources to one of them, to make it complete faster. But in general, they are running asynchronously. You need to re-synchronize by checking that they have both triggered their event to signal that they have completed.

Now that the introduction is over…

So, how does it look when you actually want to render something? Well, let’s have a (slightly simplified) look at rendering an object in DirectX 11:

// Set the viewport and scissor rectangle.
D3D11_VIEWPORT viewport = m_deviceResources->GetScreenViewport();
m_immediateContext->RSSetViewports(1, &viewport);
m_immediateContext->RSSetScissorRects(1, &m_scissorRect);

// Send drawing commands.
ID3D11RenderTargetView* renderTargetView = m_deviceResources->GetRenderTargetView();
ID3D11DepthStencilView* depthStencilView = m_deviceResources->GetDepthStencilView();
m_immediateContext->ClearRenderTargetView(renderTargetView, DirectX::Colors::CornflowerBlue);
m_immediateContext->ClearDepthStencilView(depthStencilView, D3D11_CLEAR_DEPTH, 1.0f, 0);

m_immediateContext->OMSetRenderTargets(1, &renderTargetView, &depthStencilView);

m_immediateContext->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
m_immediateContext->IASetVertexBuffers(0, 1, &m_vertexBuffers);
m_immediateContext->IASetIndexBuffer(&m_indexBuffer, DXGI_FORMAT_R16_UINT, 0);
m_immediateContext->DrawIndexedInstanced(36, 1, 0, 0, 0);

And in DirectX 12 it would look like this (again, somewhat simplified):

// Set the viewport and scissor rectangle.
D3D12_VIEWPORT viewport = m_deviceResources->GetScreenViewport();
m_commandList->RSSetViewports(1, &viewport);
m_commandList->RSSetScissorRects(1, &m_scissorRect);

// Indicate this resource will be in use as a render target.
CD3DX12_RESOURCE_BARRIER renderTargetResourceBarrier =
	CD3DX12_RESOURCE_BARRIER::Transition(m_deviceResources->GetRenderTarget(), D3D12_RESOURCE_STATE_PRESENT, D3D12_RESOURCE_STATE_RENDER_TARGET);
m_commandList->ResourceBarrier(1, &renderTargetResourceBarrier);

// Record drawing commands.
D3D12_CPU_DESCRIPTOR_HANDLE renderTargetView = m_deviceResources->GetRenderTargetView();
D3D12_CPU_DESCRIPTOR_HANDLE depthStencilView = m_deviceResources->GetDepthStencilView();
m_commandList->ClearRenderTargetView(renderTargetView, DirectX::Colors::CornflowerBlue, 0, nullptr);
m_commandList->ClearDepthStencilView(depthStencilView, D3D12_CLEAR_FLAG_DEPTH, 1.0f, 0, 0, nullptr);

m_commandList->OMSetRenderTargets(1, &renderTargetView, false, &depthStencilView);

m_commandList->IASetPrimitiveTopology(D3D_PRIMITIVE_TOPOLOGY_TRIANGLELIST);
m_commandList->IASetVertexBuffers(0, 1, &m_vertexBufferView);
m_commandList->IASetIndexBuffer(&m_indexBufferView);
m_commandList->DrawIndexedInstanced(36, 1, 0, 0, 0);

// Indicate that the render target will now be used to present when the command list is done executing.
CD3DX12_RESOURCE_BARRIER presentResourceBarrier =
	CD3DX12_RESOURCE_BARRIER::Transition(m_deviceResources->GetRenderTarget(), D3D12_RESOURCE_STATE_RENDER_TARGET, D3D12_RESOURCE_STATE_PRESENT);
m_commandList->ResourceBarrier(1, &presentResourceBarrier);

m_commandList->Close())

// Execute the command list.
ID3D12CommandList* ppCommandLists[] = { m_commandList.Get() };
m_deviceResources->GetCommandQueue()->ExecuteCommandLists(_countof(ppCommandLists), ppCommandLists);

As you can see, the actual calls are very similar. The functions mostly have the same names, and even the parameters are mostly the same. At a higher level, most of what you do is exactly the same: you use a rendertarget, a depth/stencil surface, you set up a viewport and scissor rectangle. Then you clear the rendertarget and depth/stencil for a new frame, and send a list of triangles to the GPU, which is stored in a vertex buffer and index buffer pair (you would already have initialized a vertex shader and pixel shader at an earlier stage, and already uploaded the geometry to the vertex- and index buffers, but I left those parts out for simplicity. The code there is again very similar between the APIs, where DirectX 12 again requires a bit more code, because you have to tell the API in more detail what you actually want. Uploading the geometry also requires a command list there).
So what the GPU actually has to do, is exactly the same, regardless of whether you use DirectX 11 or DirectX 12. The differences are mainly on the CPU-side, as you can see.

The same argument also extends to Vulkan. The code may look a bit different from what you’re doing in DirectX 12, but in essence, you’re still creating the same vertex and index buffers, and sending the same triangle list draw command to the GPU, rendering to a rendertarget and depth/stencil buffer.

So, what this means is that you do not really need to ‘design’ your hardware for DirectX 12 or Vulkan at all. The changes are mainly on the API side, and affect the workload of the CPU and driver, not the GPU. Which is also why DirectX 12 supports feature levels of 11_x: the API can also support hardware that pre-dates DirectX 12.

Chickens and eggs

But, how exactly did these new hardware features arrive in the DirectX 11.3 and 12 APIs? And why exactly did this new programming model emerge in these new APIs?

The first thing to point out is that Microsoft does not develop hardware. This means that Microsoft can’t just think up new hardware features out-of-the-blue and hope that hardware will support it. For each update of DirectX, Microsoft will have meetings with the big players on the hardware market, such as Intel, nVidia and AMD (in recent years, also Qualcomm, for mobile devices). These IHVs will give Microsoft input on what kind of features they would like to include in the next DirectX. Together with Microsoft, these features will then be standardized in a way that they can be implemented by all IHVs. So this is a somewhat democratic process.

Aside from IHVs, Microsoft also includes some of the larger game engine developers, to get more input from the software/API side of things. Together they will try to work out solutions to current problems and bottlenecks in the API, and also work out ways to include the new features presented by the IHVs. In some cases, the IHVs will think of ways to solve these bottlenecks by adding new hardware features. After all, game engines and rendering algorithms also evolve over time, as does the way they use the API and hardware. For example, in the early days of shaders, you wrote separate shaders for each material, and these used few parameters. These days, you tend to use the same shader for most materials, and only the parameters change. So at one point, switching between shaders efficiently was important, but now updating shader parameters efficiently is more important. Different requirements call for different APIs (so it’s not like older APIs were ‘bad’, they were just designed against different requirements, a different era with different hardware and different rendering approaches).

So, as you can see, it is a bit of cross-pollination between all the parties. Sometimes an IHV comes up with a new approach first, and it is included in the API. Other times, developers come up with problems/bottlenecks first, and then the API is modified, and hardware redesigned to make it work. Since not all hardware is equal, it is also often the case that one vendor had a feature already, while others had to modify their hardware to include it. So for one vendor, the chicken came first, for others the egg came first.

The development of the API is an iterative process, where the IHVs will work closely with Microsoft over a period of time, to make new hardware and drivers available for testing, and move towards a final state of the API and driver model for the new version of DirectX.

But what does it mean?

In short, it means it is pretty much impossible to say how much of a given GPU was ‘designed for API X or Y’, and how much of the API was ‘designed for GPU A or B’. It is a combination of things.

In DirectX 12/Vulkan, it seems clear that Rasterizer Ordered Views came from Intel, since they already had the feature on their DirectX 11 hardware. It looks like nVidia designed this feature into Maxwell v2 for the upcoming DX11.3 and DX12 APIs. AMD has yet to implement the feature.

Conservative rasterization is not entirely clear. nVidia was the first to market with the feature, in Maxwell v2. However, Intel followed not much later, and implemented it at Tier 3 level. So I cannot say with certainty whether it originated from nVidia or Intel. AMD has yet to implement the feature.

Asynchronous compute is a bit of a special case. The API does not consider it a specific hardware feature, and leaves it up to the driver how to handle multiple queues. The idea most likely originated from AMD, since they have had support for running graphics and compute at the same time since the first GCN architecture, and the only way to make good use of that is to have multiple command queues. nVidia added limited support in Maxwell v2 (they had asynchronous compute support in CUDA since Kepler, but they could not run graphics tasks in parallel), and more flexible/efficient support in Pascal. Intel has yet to support this feature (that is, they support code that uses multiple queues, but as far as I know, they cannot actually run graphics and compute tasks in parallel, so they cannot use it to improve performance by better ALU usage).

Also, you can compare performance differences from DirectX 11 to 12, or from OpenGL to Vulkan… but it is impossible to draw conclusions from these results. Is the DirectX 11 driver that good, or is the DirectX 12 engine that bad for a given game/GPU? Or perhaps the other way around? Was OpenGL that bad, and is the Vulkan engine that good for a given game/GPU?

Okay, but what about performance?

The main advantage of this new programming model in DirectX 12/Vulkan is also a potential disadvantage. I see a parallel with the situation of compilers versus assemblers on the CPU: it is possible for an assembly programmer to outperform a compiler, but there are two main issues here:

  1. Compilers have become very good at what they do, so you have to be REALLY good to even write assembly that is on par with what a modern compiler will generate.
  2. Optimizing assembly code can only be done for one CPU at a time. Chances are that the tricks you use to maximize performance on CPU A, will not work, or even be detrimental to performance on CPU B. Writing code that works well on all CPUs is even more difficult. Compilers however are very good at this, and can also easily optimize code for multiple CPUs, and include multiple codepaths in the executable.

In the case of DirectX 12/Vulkan, you will be taking on the driver development team. In DirectX 11/OpenGL, you had the advantage that the low-level resource management, synchronization and such, was always done by the driver, which was optimized for a specific GPU, by the people who built that GPU. So like with compilers, you had a very good baseline of performance. As an engine developer, you have to design and optimize your engine very well, before you get on par with these drivers (writing a benchmark that shows that you can do more calls per second in DX12 than in DX11 is one thing. Rendering an actual game more efficiently is another).

Likewise, because of the low-level nature of DirectX 12/Vulkan, you need to pay more attention to the specific GPUs and videocards you are targeting. The best way to manage your resources on GPU A might not be the best way on GPU B. Normally the driver would take care of it. Now you may need to write multiple paths, and select the fastest one for each GPU.

Asynchronous compute is especially difficult to optimize for. Running two things at the same time means you have to share your resources. If this is not balanced well, then one task may be starving the other of resources, and you may actually get lower performance than if you would just run the tasks one after another.

What makes it even more complicated is that this balance is specific not only to the GPU architecture, but even to the specific model of video card, to a certain extent. If we take the above example of rendering shadowmaps while doing a compute task (say postprocessing the previous frame)… What if GPU A renders shadowmaps quickly and compute tasks slowly, but GPU B renders the shadowmaps slowly and compute tasks quickly? This would throw off the balance. For example, once the shadowmaps are done, the next graphics task might require a lot of ALU power, and the compute task that is still running will be starving that graphics task.

And things like rendering speed will depend on various factors, including the relative speed of the rasterizers to the VRAM. So, even if two videocards use the same GPU with the same rasterizers, variations in VRAM bandwidth could still disturb the balance.

On Xbox One and PlayStation 4, asynchronous compute makes perfect sense. You only have a single target to optimize for, so you can carefully tune your code for the best performance. On a Windows system however, things are quite unpredictable. Especially looking to the future. Even if you were to optimize for all videocards available today, that is no guarantee that the code will still perform well on future videocards.

So we will have to see what the future brings. Firstly, will engine developers actually be able to extract significant gains from this new programming model? Secondly, will these gains stand the test of time? As in, are these gains still available 1, 2 or 3 generations of GPUs from now, or will some code actually become suboptimal on future GPUs? Code which, when handled by an optimized ‘high-level’ driver such as in DirectX 11 or OpenGL, will actually be faster than the DirectX 12/Vulkan equivalent in the engine code? I think this is a more interesting aspect than which GPU is currently better for a given API.

Posted in Direct3D, Hardware news, OpenGL, Software development, Software news, Vulkan | Tagged , , , , , , , , , , , , , , , , , , , , | 11 Comments