The past few days I’ve been programming some graphics… but not in the usual way… No, I had to get some things out of my system. I’ve had a lot of computers over the years, and I’ve always been interested in graphics, both in games and demos. I’ve started at the beginning as far as PC’s go, I suppose. My first PC had an 8088 processor and a Paradise PVC4 video chip, which was compatible with Hercules and CGA (and the Plantronics ColorPlus graphics card, sort of like ‘SuperCGA’). I started out with a monochrome Hercules monitor, and later got a colour CGA/EGA monitor for it. I skipped EGA, but got VGA at an early stage.
I’ve also been programming graphics for a long time now, but I started somewhat late in the SVGA era. I’ve done a bit of the classic mode 13h VGA stuff in DOS, but quickly moved to hi/truecolour stuff and Windows/DirectDraw. So as I grew up, I was intrigued by the various graphics tricks and things in games and demos, but never got to actually implement them myself, as they were a bit ‘before my time’. One thing that intrigued me in particular was fast 3d flatshaded polygon filling in the so-called ‘mode X’ of VGA.
Originally the PC was not my main system. We had a PC at home because my father used one. For gaming and demos I had a C64 and later an Amiga. Especially Amiga demos were incredibly impressive at the time. However, at some point, PCs started to catch up. CGA and EGA were pretty poor standards, but with VGA, the PC made a big step forward, not only in display quality, but also in performance.
Some early PC demos that really impressed me were UltraForce’s Vector Demo:
And Triton’s Crystal Dream:
They showed smoothy animated 3d objects, and played nice music. Much like Amiga demos. Sure, there have been many great PC demos since then… but at the time these demos really stood out, as they ran smoothly even on a relatively modest PC (286/386SX), where other PC demos needed a high-end 486 to get decent framerates. I’m pretty sure it is because these particular demos make use of mode X to fill multiple pixels at a time.
An early 3d game that really impressed me as well, was F29 Retaliator:
Just like the aforementioned demos, this game really stood out in terms of speed. When I first got this game, I still had my first PC with the 8088 CPU at 9.54 MHz, and an 8-bit VGA card. Most 3d games were quite useless on it, but this game was very playable. In fact, it ran even better than the Amiga version. I wonder how they made it that fast. It could have used mode X, but then it would be a very early example of such, as it wasn’t a widely known trick until Michael Abrash published his mode X article in Dr. Dobb’s journal of July 1991.
At any rate, such fast 3d on such modest hardware has always fascinated me, and I’ve always wanted to do an optimized mode X polygon routine myself. I have decided to finally give it a try. I realized, after doing a 3d renderer for the iPhone/iPad, that programming these days is mostly repeating the same trick. There was nothing fundamentally different to an iPhone/iPad compared to a regular desktop machine, especially not once I figured out how to write regular C++ for it, rather than Objective-C. The OS was very much like any flavour of *nix, and the graphics API was just a variation of OpenGL. I could just port most of my existing code over, and it worked just fine. So I figured it would be nice to work on something completely different for a change. It has been a while.
Right, so first I refreshed my memory on mode X. I googled some information. Funny enough I landed on the same information I used years ago, when I got into DOS/VGA programming: tutorials from VLA, Asphyxia, Michael Abrash’s Black Book, the PC-GPE, and Ralf Brown’s Interrupt List. All this information is still widely available on the internet today.
So, to try and explain it in a nutshell: Mode 13h in VGA (the standard 320×200 256 colour mode) is a bit of a special case: it provides you with a linear framebuffer: each pixel is stored as a single byte, allowing the framebuffer to be addressed as a simple byte-array (sometimes referred to as ‘chunky’ pixel mode). This byte array can be found at address A000:0000. Each byte represents the colour of that pixel as an index to the 256-entry palette. All other videomodes are ‘planar’: The bits that make up the palette index for a pixel are spread out over ‘bitplanes’. Eg, for a 4-bit colour mode (16 colours), you have 4 bitplanes, each carrying one bit of each pixel’s palette index. So each byte in a bitplane contains 8 different pixels. The bits of the pixels are ‘rotated 90 degrees’ compared to a linear framebuffer, as it were:
If mode 13h was planar, it would require 8 bitplanes, which would make it very inefficient to update each pixel. Instead, the hardware ‘chains’ the bitplanes together, making it appear as a linear framebuffer. The downside of this is that in chained mode, only 64kb of the total 256kb of VGA memory can be addressed (only the A000-segment). This is just enough to store a single 320×200 frame, but there is no room for double-buffering, or any related techniques such as scrolling or quick block image transfers (‘blitting’) from one part of the videomemory to the other.
One of the characteristics of mode X is that it disables the chained mode of mode 13h. Not only is this possible, but for some reason, instead of giving you 8 separate bitplanes, it results in 4 bitplanes, or actually ‘byteplanes’ for lack of a better word. That is, you get 4 different planes, where plane N stores each (X mod 4 = N)th pixel. So the first plane stores pixels 0, 4, 8 etc. the second plane stores pixels 1, 5, 9 etc. Each pixel is still stored as a single byte. This means it is still relatively easy to address individial pixels. You just need to select the correct bitplane based on the x-coordinate of your pixel (x mod 4).
As a side-effect, since only every 4th pixel can be addressed at a time, you now only need 320×200/4 = 16000 bytes for a single frame. But the full 64kb of the plane can still be addressed through segment A000. And since you can also select the other planes, you can address 4*64kb, so you now have access to the full 256kb of VGA memory. This allows you to do double-buffering in video memory, scrolling, and blitting.
Another characteristic of this unchained mode is that you can select multiple bitplanes at a time, to write to. So if for example you enable all 4 bitplanes, writing a single byte results in the byte being written to all 4 bitplanes, which means that you actually drew 4 consecutive bytes on screen, with the same colour. This is what allows you to write fast polygon fillers. It can be taken even further: you can write a word (2 bytes), and 8 consecutive pixels are drawn with a single operation. On a 386+, even a dword (4 bytes) can be done in a single operation, resulting in 16 pixels on screen.
Building the polygon filler
The next step is to apply this trick to a polygon filler. You will start by scan-converting the edges of the polygon, resulting in a startpoint and endpoint on each scanline. This span of consecutive pixels now has to be split up into three parts:
- Any ‘leading pixels’ before we reach the first pixel on plane 0
- A multiple of 4 pixels, which we can draw with all 4 planes enabled
- Any ‘trailing pixels’ from the last pixel on plane 3 up to the actual endpoint
Simple enough so far. There is a catch, though: switching which plane or planes to write to, is a relatively expensive operation. So we want to reduce the amount of switching to a minimum.
A naive approach to rendering the leading and trailing spans of pixels would be to write one pixel at a time, and switch the plane each time. Observe however that all leading and trailing pixels are always less than four pixels, and always within ‘mod 4’ range, so always between planes 0 and 3. Which means they can always be written with a single byte, as long as the proper planes are enabled. This reduces the amount of plane switches to a maximum of 3 per scanline.
Can we reduce the amount of switching even further? Yes. There are only 16 different possibilities, for any combination of planes. And in our case, not even all of them will occur. Namely, each scanline will always have a consecutive run of pixels. Which means that you will never run into a case where planes 0 and 2 are enabled, but planes 1 and 3 aren’t, for example. We can reduce the total set of cases to the following subset:
Now, if we don’t draw each scanline immediately, but instead buffer each part of the scanline and sort them according to the above cases, we can draw any polygon (or set of polygons) with only 10 switches for the entire screen.
So what else is there?
Now that I had my mode X polygon routine figured out, I thought it would be a shame to just stop at VGA. I had never done much with real planar displays. And now that I had this VGA routine, I thought it would not be all that difficult to modify it to work on EGA as well, and perhaps even CGA. It would be nice to truly understand all that hardware that I used all those years ago. The main idea would be quite similar: you can draw a consecutive run of 8 pixels at a time by writing byte 0xFF to a bitplane. And the remaining leading and trailing pixels can be drawn in a similar way as well, with some bit masking.
I wanted to start with EGA first. But I found remarkably little information on EGA programming at first. I did find some helpful information on CGA though, so I decided to start there first.
On to CGA it is then
The CGA video chip turned out to be remarkably simple. It has only two graphics modes: a 320×200 mode with 4 colours and a 640×200 monochrome mode. It has only 16kb of video memory, and as such, there isn’t any room for double buffering or anything else. As simple as it is, it is a tad quirky though. Namely, it uses two bitplanes, as one would expect with 4 colours. However, they are not used in the ’90 degree rotated’ fashion I described above. Instead, the two bits of each pixel are grouped together, so 4 adjacent pixels are packed into each byte of a bitplane. So this can be seen as a linear or ‘chunky’ pixel arrangement. Where does the second bitplane come in then? Well, the display is split up in even and odd scanlines. Plane 0 stores the even scanlines, and plane 1 stores the odd scanlines. Plane 1 is not positioned directly after plane 0 in memory either. Plane 0 starts at address B800:0000, and takes 160×200/4 bytes. That is 8000 bytes. Plane 1 is at an 8kb offset however, so 8192 bytes, in other words: at address BA00:0000.
CGA does not offer any hardware functionality to select which plane or which pixel to access, so everything must be done on the CPU. So to draw a pixel, the CPU must read the entire byte, use bitmasking to replace only the bits for that pixel, and then write back the byte. There is a big similarity with unchained VGA here: a byte contains 4 pixels, so you can write 4/8/16 pixels at a time by writing a single byte/word/dword.
Only EGA remains
There isn’t much more to CGA than that, so it is time to give EGA another try. This time I stumbled upon an obscure page on the internet, which had links to the actual IBM manuals from 1984, including the programmers reference for CGA and EGA. Great, this was exactly what I was looking for!
While reading through the EGA manual, it dawned on me just how similar EGA and VGA really are, especially in the unchained mode. All of the registers that I had to tweak in VGA to get unchained mode and double buffering were actually EGA registers. EGA also has its frame buffer at A000:0000, and since EGA comes with 64kb of memory, there is enough memory for 2 buffers of 320×200 in 16-colour mode (4 bits per pixel). So double buffering works exactly the same way as it does on VGA in mode X.
EGA has a ‘classic’ planar configuration, the ’90 degree rotated’ system I mentioned earlier, with one bitplane for each bit. So just like VGA’s mode X, there are 4 bitplanes here. EGA uses the same register to control which bitplanes to write to as mode X. However, since the pixels are oriented differently with EGA, it makes more sense to use different modes.
Namely, if you want to draw a pixel in planar mode, you need to update all planes. If you use the same method as mode X, then you need 2 modes per pixel: the palette index is a 4-bit number which can contain both ‘0’ and ‘1’ bits. You need to write the bits of this palette index to the bitplanes, but you can only access one byte at a time. So you need to write a 0 in all bitplanes corresponding to the ‘0’ values in the palette index. And you have to write a 1 in the bitplanes corresponding to the ‘1’ values.
But, EGA has more tricks up its sleeve: there is also a mode where it splits up the byte over the bitplanes for you. So if you write a byte to an address, it will write bit 0 to plane 0, bit 1 to plane 1, and so on. This way you only need a single write to update the bitplanes. What’s more: you don’t have to swap modes everytime. With the earlier method, the selection of planes was dependent on the colour you used. With this method, you can leave the EGA card in the same mode all the time: all planes enabled. Sadly this mode does not appear to work on more than one pixel at a time, so using word or dword writes to speed up rendering even further does not seem to be an option.
Now let’s look at updating single pixels inside a byte. The problem with the planar videomemory access is that you cannot read from multiple planes at a time. Since you can only access a plane byte-by-byte, you would need to read a byte, update only the bits you are interested in, and write it back, much like how I described with CGA earlier. However, with EGA, you would have to select each of the 4 planes separately, and do the read-modify-write operation on all of them, to update the entire pixel.
Luckily EGA has a trick for that as well: for each bitplane, there is a data latch in the graphics processor. Each latch stores the value of the last byte that was read by the CPU. The graphics processor has a simple ALU which can do some basic logical operations between the latch and the byte that is sent from the CPU. The ALU has another register, which contains a bitmask. With all this combined, updating pixels becomes reasonably easy:
- Set a bitmask with 1-bits for each pixel in the byte that you want to update
- Read the byte from the current bitplane with the CPU, so that the latches contain the byte you want to modify (we are not interested in the actual value, so it doesn’t matter which bitplane is selected for reading, the read access is merely to trigger the latches).
- Write a byte with the value you want.
The graphics processor will then take the byte you wrote, AND it with the mask you set, then OR it together with the latched byte, after applying the reverse mask to that. Something like this:
result = (input AND mask) OR (latch AND NOT mask);
So now the graphics processor does the read-modify-write for you, and it does it on all bitplanes that have been selected for writing at the same time. Combined with the write mode discussed earlier, where you write a 4-bit palette index, which is automatically split up, you can now plot pixels quite easily: only the bitmask needs to be updated to select the pixel or pixels you want to plot. Then reading and writing a single byte is enough to update all bitplanes.
These latches can also be used for other tricks. By setting the ALU to write only the latch value and ignore any data from the CPU, it is possible to do efficient blitting: you read from the source byte, so the latches contain 4 bytes of source data. Then you write to the destination byte. The graphics processor will then write the 4 latched bytes to the target address in all 4 bitplanes.
EGA has some other features, like a logical colour compare mode, and tricks like horizontal scrolling, or even resetting the bitplane pointers at a given point, so that you can have one part of the screen scrolling vertically, while the other remains static (such as for a HUD interface). All in all, it’s a deceivingly complex chip. At any rate, I’ve had some fun playing around with it.
Well, so much for PC’s then… I now understand the basics of CGA, EGA and VGA programming at the hardware level. Time to dive into the hardware of some other machines I used in the past. I am currently looking into the Commodore Amiga. The machine I learnt some of my first graphics programming on, but at the time I used highlevel languages and tools, since I wasn’t very good at assembly or C yet. So my knowledge of the hardware was rather superficial. But it was the machine that really got me hooked on graphics and the demoscene in general. And another machine that I’d like to get into at some point is the C64. Also a machine that I used a lot in the old days, with some great games and demos. Two machines that meant a lot to me in the early days, and I’d like to pay tribute to them now.