Today I want to talk about a rather obscure, yet interesting demo, namely CGADEMO by Codeblasters, from 1992:
As you can read from the scroller, what’s interesting about this demo is that it runs at full framerate (60 Hz) even on the original IBM PC (8088 at 4.77 MHz with CGA). And that there are 16 colours on screen at the same time.
To start with the 16 colours… They use a trick similar to my palette switching in the 1991 donut. Namely, they change the background colour of the CGA palette at every scanline, which gives a rasterbar effect. They can pick any of the 16 CGA colours for the background, which allows them to show all 16 colours on screen at a time. This is very similar to what I have discussed on C64. This demo does not use a stable raster however, since that is very difficult to achieve on a PC anyway. Instead, they use polling of the ‘display enable’ status bit to determine when a scanline is finished (and the rasterbeam enters the horizontal retrace). The downside of this technique is that you’re burning up valuable CPU time in the polling loop, which you could have used for other effects.
When I discussed the rasterbars in INTROjr, I said this:
In INTROjr, the rasterbars may not be perfectly stable, but that is understandable, given that the CPU is far too slow to do any remotely accurate polling for the horizontal blank interval
Was I right? Well, I was… but I wasn’t… Namely, with CGADEMO, this happens with the rasterbars on the left of the screen:
Visible jitter. Sometimes the polling is so slow that the background colour is switched in the visible part of the scanline, so you see the old colour on the left. I figured that this would happen on a PC. However, when I wrote the above, I thought it would be worse than it actually is (on a C64 you have very fast access to the mem-mapped registers of the VIC-II chip. I figured communicating with the CRT controller on a CGA card with port I/O over an 8-bit ISA bus would be much slower). It’s actually not even that bad. You might not even notice at first.
Trixter originally sent me the CGADEMO binary, and he mentioned something about removing the noise. So I figured I’d have a look at the code myself, and see what he was talking about.
After reverse-engineering the code a bit, I isolated the rasterbar code:
waitHBL1: ; CODE XREF: start_0+70j start_0+7Fj in al, dx ; Video status bits: ; 0: retrace. 1=display is in vert or horiz retrace. ; 1: 1=light pen is triggered; 0=armed ; 2: 1=light pen switch is open; 0=closed ; 3: 1=vertical sync pulse is occurring. test al, 1 jnz short waitHBL1 waitHBL2: ; CODE XREF: start_0+75j in al, dx ; Video status bits: ; 0: retrace. 1=display is in vert or horiz retrace. ; 1: 1=light pen is triggered; 0=armed ; 2: 1=light pen switch is open; 0=closed ; 3: 1=vertical sync pulse is occurring. test al, 1 jz short waitHBL2 mov dx, 3D9h lodsb out dx, al mov dx, 3DAh loop waitHBL1
The first HBL-loop waits until a new scanline starts. The second HBL-loop waits until the scanline ends, and we enter the horizontal retrace/blank interval. This gives us a small period of time to modify the background colour before the new scanline starts drawing. So it is important to change it as soon as possible (you could still insert some other effect code between the two loops… or if you are not concerned about the code running on machines that are ‘too fast’, you could remove the first loop altogether, if you can assume that your effect code will make sure that by the time you enter the polling loop for start-of-hbl, you will not still be in the previous hbl).
The problem here is the ‘lodsb’: the rasterbar colours are pre-calced in a table, and loaded at the very last moment to change the colour (the out dx, al immediately after it). On an 8088, memory access is very slow. Each byte takes 4 cycles to load (this goes for both instructions and data). While lodsb in itself is a very efficient instruction, it is not the best choice in this particular case. Namely, we have plenty of time to load from memory while waiting for our scanline to finish drawing. If we would just perform the memory access there, we can save those 4 cycles later.
But, there is a problem: both the in and out instructions are hardwired to use dx and al (this is the sort of thing why x86 is hated by so many assembly programmers. Most other systems don’t hardwire their registers to instructions like that, and/or just use memory-mapped I/O, so you don’t need special instructions to access hardware registers). So, we can’t just move the lodsb out of the loop, since al will get overwritten by the in. What we can do however, is to store it in a different register, and then just copy it to al before the out.
So, you figure… bx is still free, shall we use that? Just mov ah, bh and that’s that? Well, then the joke is on you: the shortest form of a mov-instruction is still 2 bytes, where lodsb was only 1 byte. So you are only moving the memory access from the byte you wanted to read to an additional byte for your instruction. There is a ‘loophole’ however: some instructions on x86 have a special short encoding for the accumulator register. xchg is such an instruction. It only exists for ax though, not for al. But that doesn’t matter: we can just do xchg ax, bx.
I noticed another place where we can have a small gain however: test al, 1. Although we cannot shave a byte off this instruction… what we CAN do is to replace it with test al, ah. Using a register takes one less cycle than using an immediate operand. We are not using ah anyway, except that it gets swapped with bh as part of the xchg ax, bx. So we can easily load it with 1, and use the slightly faster test al, ah.
So my polling loop now looks like this:
mov bh, 1 loopHBL: lodsb xchg ax, bx waitHBL1: ; CODE XREF: start_0+70j start_0+7Fj in al, dx ; Video status bits: ; 0: retrace. 1=display is in vert or horiz retrace. ; 1: 1=light pen is triggered; 0=armed ; 2: 1=light pen switch is open; 0=closed ; 3: 1=vertical sync pulse is occurring. test al, ah jnz short waitHBL1 waitHBL2: ; CODE XREF: start_0+75j in al, dx ; Video status bits: ; 0: retrace. 1=display is in vert or horiz retrace. ; 1: 1=light pen is triggered; 0=armed ; 2: 1=light pen switch is open; 0=closed ; 3: 1=vertical sync pulse is occurring. test al, ah jz short waitHBL2 mov dx, 3D9h xchg ax, bx out dx, al mov dx, 3DAh loop loopHBL
The few cycles we save are enough to move the jitter of the colour change outside of the visible area. Not the same approach as the stable raster I demonstrated on C64 earlier, but the result looks just as good in practice. So I have to correct myself on what I said earlier: even the slowest PC *is* fast enough to perform accurate enough rasterbars (but mind you, INTROjr runs on a PCjr, which is even slower than a regular PC, since it shares videomemory with the CPU, causing extra waitstates on reading instructions/data from memory).
A second problem I noticed was with the scroller:
What is happening here? Well, I have discussed ‘racing the beam’ earlier. In this case, the scroller is losing the race to the beam. You can see the part where the beam overtakes the scroll routine.
Trixter says the scroller looks fine on his real IBM PC 5150 with real IBM CGA. My PC is a clone, a Philips P3105 with an ATi Small Wonder CGA-compatible videocard. It does have an original Intel 8088 CPU at 4.77 MHz. But apparently the system as a whole is even slower than the real IBM PC. I think the Small Wonder is the culprit.
As you may know, there is no scrolling hardware whatsoever on a CGA card, so what we have here is a bruteforce software scroller. Just a memcpy() routine basically. The original code looks like this:
doScroll proc near ; CODE XREF: start_0+A0p mov ax, 0B800h mov ds, ax assume ds:nothing mov es, ax mov si, 1A91h mov di, 1A90h mov cx, 4Fh rep movsb ... retn doScroll endp
The problem here is rep movsb: the 8088 CPU executes this literally. So it performs 79 movsb instructions in a loop, moving 1 byte at a time.
However, the 8088 is a 16-bit CPU (it is basically an 8086 on an 8-bit data bus), so it also has the movsw instruction, which moves a word (2 bytes) at a time. If we use that, we only need half the iterations, and this cuts some of the extra overhead, so we save some valuable cycles. Why did they choose rep movsb? Not sure. Perhaps they assumed there would be no performance difference, since the 8088 has an 8-bit data bus (they also used rep movsb to copy the logo to videoram)… Or they used it because 79 is an odd number. But that is easy enough to solve: Just move 78 bytes with 39 movsw instructions, and then add an extra movsb for the 79th byte:
doScroll proc near ; CODE XREF: start_0+A0p mov ax, 0B800h mov ds, ax assume ds:nothing mov es, ax mov si, 1A91h mov di, 1A90h mov cx, 27h rep movsw movsb ... retn doScroll endp
This bought me just enough time to keep the scroller ahead of the beam (an alternative solution would be to move the scroller position down a few scanlines, to buy some extra time. Or, the area of rasterbars could be reduced by 1 or 2 scanlines. But luckily that was not required, so the improved version looks exactly like the original).
I have done some other small tweaks and modifications to the source code. I have restructured some code, to avoid some jmps over data and such. I have also removed some code and data which was duplicate or not used. So the improved version is not only faster, it is also some 400 bytes smaller than the original. If you are interested, you can download the original, the improved version, and my reverse-engineered/semi-commented source code here.