Today I want to talk about a rather obscure, yet interesting demo, namely CGADEMO by Codeblasters, from 1992:
As you can read from the scroller, what’s interesting about this demo is that it runs at full framerate (60 Hz) even on the original IBM PC (8088 at 4.77 MHz with CGA). And that there are 16 colours on screen at the same time.
To start with the 16 colours… They use a trick similar to my palette switching in the 1991 donut. Namely, they change the background colour of the CGA palette at every scanline, which gives a rasterbar effect. They can pick any of the 16 CGA colours for the background, which allows them to show all 16 colours on screen at a time. This is very similar to what I have discussed on C64. This demo does not use a stable raster however, since that is very difficult to achieve on a PC anyway. Instead, they use polling of the ‘display enable’ status bit to determine when a scanline is finished (and the rasterbeam enters the horizontal retrace). The downside of this technique is that you’re burning up valuable CPU time in the polling loop, which you could have used for other effects.
When I discussed the rasterbars in INTROjr, I said this:
In INTROjr, the rasterbars may not be perfectly stable, but that is understandable, given that the CPU is far too slow to do any remotely accurate polling for the horizontal blank interval
Was I right? Well, I was… but I wasn’t… Namely, with CGADEMO, this happens with the rasterbars on the left of the screen:
Visible jitter. Sometimes the polling is so slow that the background colour is switched in the visible part of the scanline, so you see the old colour on the left. I figured that this would happen on a PC. However, when I wrote the above, I thought it would be worse than it actually is (on a C64 you have very fast access to the mem-mapped registers of the VIC-II chip. I figured communicating with the CRT controller on a CGA card with port I/O over an 8-bit ISA bus would be much slower). It’s actually not even that bad. You might not even notice at first.
Trixter originally sent me the CGADEMO binary, and he mentioned something about removing the noise. So I figured I’d have a look at the code myself, and see what he was talking about.
After reverse-engineering the code a bit, I isolated the rasterbar code:
waitHBL1: ; CODE XREF: start_0+70j start_0+7Fj in al, dx ; Video status bits: ; 0: retrace. 1=display is in vert or horiz retrace. ; 1: 1=light pen is triggered; 0=armed ; 2: 1=light pen switch is open; 0=closed ; 3: 1=vertical sync pulse is occurring. test al, 1 jnz short waitHBL1 waitHBL2: ; CODE XREF: start_0+75j in al, dx ; Video status bits: ; 0: retrace. 1=display is in vert or horiz retrace. ; 1: 1=light pen is triggered; 0=armed ; 2: 1=light pen switch is open; 0=closed ; 3: 1=vertical sync pulse is occurring. test al, 1 jz short waitHBL2 mov dx, 3D9h lodsb out dx, al mov dx, 3DAh loop waitHBL1
The first HBL-loop waits until a new scanline starts. The second HBL-loop waits until the scanline ends, and we enter the horizontal retrace/blank interval. This gives us a small period of time to modify the background colour before the new scanline starts drawing. So it is important to change it as soon as possible (you could still insert some other effect code between the two loops… or if you are not concerned about the code running on machines that are ‘too fast’, you could remove the first loop altogether, if you can assume that your effect code will make sure that by the time you enter the polling loop for start-of-hbl, you will not still be in the previous hbl).
The problem here is the ‘lodsb’: the rasterbar colours are pre-calced in a table, and loaded at the very last moment to change the colour (the out dx, al immediately after it). On an 8088, memory access is very slow. Each byte takes 4 cycles to load (this goes for both instructions and data). While lodsb in itself is a very efficient instruction, it is not the best choice in this particular case. Namely, we have plenty of time to load from memory while waiting for our scanline to finish drawing. If we would just perform the memory access there, we can save those 4 cycles later.
But, there is a problem: both the in and out instructions are hardwired to use dx and al (this is the sort of thing why x86 is hated by so many assembly programmers. Most other systems don’t hardwire their registers to instructions like that, and/or just use memory-mapped I/O, so you don’t need special instructions to access hardware registers). So, we can’t just move the lodsb out of the loop, since al will get overwritten by the in. What we can do however, is to store it in a different register, and then just copy it to al before the out.
So, you figure… bx is still free, shall we use that? Just mov ah, bh and that’s that? Well, then the joke is on you: the shortest form of a mov-instruction is still 2 bytes, where lodsb was only 1 byte. So you are only moving the memory access from the byte you wanted to read to an additional byte for your instruction. There is a ‘loophole’ however: some instructions on x86 have a special short encoding for the accumulator register. xchg is such an instruction. It only exists for ax though, not for al. But that doesn’t matter: we can just do xchg ax, bx.
I noticed another place where we can have a small gain however: test al, 1. Although we cannot shave a byte off this instruction… what we CAN do is to replace it with test al, ah. Using a register takes one less cycle than using an immediate operand. We are not using ah anyway, except that it gets swapped with bh as part of the xchg ax, bx. So we can easily load it with 1, and use the slightly faster test al, ah.
So my polling loop now looks like this:
mov bh, 1 loopHBL: lodsb xchg ax, bx waitHBL1: ; CODE XREF: start_0+70j start_0+7Fj in al, dx ; Video status bits: ; 0: retrace. 1=display is in vert or horiz retrace. ; 1: 1=light pen is triggered; 0=armed ; 2: 1=light pen switch is open; 0=closed ; 3: 1=vertical sync pulse is occurring. test al, ah jnz short waitHBL1 waitHBL2: ; CODE XREF: start_0+75j in al, dx ; Video status bits: ; 0: retrace. 1=display is in vert or horiz retrace. ; 1: 1=light pen is triggered; 0=armed ; 2: 1=light pen switch is open; 0=closed ; 3: 1=vertical sync pulse is occurring. test al, ah jz short waitHBL2 mov dx, 3D9h xchg ax, bx out dx, al mov dx, 3DAh loop loopHBL
The few cycles we save are enough to move the jitter of the colour change outside of the visible area. Not the same approach as the stable raster I demonstrated on C64 earlier, but the result looks just as good in practice. So I have to correct myself on what I said earlier: even the slowest PC *is* fast enough to perform accurate enough rasterbars (but mind you, INTROjr runs on a PCjr, which is even slower than a regular PC, since it shares videomemory with the CPU, causing extra waitstates on reading instructions/data from memory).
A second problem I noticed was with the scroller:
What is happening here? Well, I have discussed ‘racing the beam’ earlier. In this case, the scroller is losing the race to the beam. You can see the part where the beam overtakes the scroll routine.
Trixter says the scroller looks fine on his real IBM PC 5150 with real IBM CGA. My PC is a clone, a Philips P3105 with an ATi Small Wonder CGA-compatible videocard. It does have an original Intel 8088 CPU at 4.77 MHz. But apparently the system as a whole is even slower than the real IBM PC. I think the Small Wonder is the culprit.
As you may know, there is no scrolling hardware whatsoever on a CGA card, so what we have here is a bruteforce software scroller. Just a memcpy() routine basically. The original code looks like this:
doScroll proc near ; CODE XREF: start_0+A0p mov ax, 0B800h mov ds, ax assume ds:nothing mov es, ax mov si, 1A91h mov di, 1A90h mov cx, 4Fh rep movsb ... retn doScroll endp
The problem here is rep movsb: the 8088 CPU executes this literally. So it performs 79 movsb instructions in a loop, moving 1 byte at a time.
However, the 8088 is a 16-bit CPU (it is basically an 8086 on an 8-bit data bus), so it also has the movsw instruction, which moves a word (2 bytes) at a time. If we use that, we only need half the iterations, and this cuts some of the extra overhead, so we save some valuable cycles. Why did they choose rep movsb? Not sure. Perhaps they assumed there would be no performance difference, since the 8088 has an 8-bit data bus (they also used rep movsb to copy the logo to videoram)… Or they used it because 79 is an odd number. But that is easy enough to solve: Just move 78 bytes with 39 movsw instructions, and then add an extra movsb for the 79th byte:
doScroll proc near ; CODE XREF: start_0+A0p mov ax, 0B800h mov ds, ax assume ds:nothing mov es, ax mov si, 1A91h mov di, 1A90h mov cx, 27h rep movsw movsb ... retn doScroll endp
This bought me just enough time to keep the scroller ahead of the beam (an alternative solution would be to move the scroller position down a few scanlines, to buy some extra time. Or, the area of rasterbars could be reduced by 1 or 2 scanlines. But luckily that was not required, so the improved version looks exactly like the original).
I have done some other small tweaks and modifications to the source code. I have restructured some code, to avoid some jmps over data and such. I have also removed some code and data which was duplicate or not used. So the improved version is not only faster, it is also some 400 bytes smaller than the original. If you are interested, you can download the original, the improved version, and my reverse-engineered/semi-commented source code here.
Reblogged this on Oldskooler Ramblings and commented:
I don’t normally reblog other people’s blogs, but my own blog has been neglected lately due to work on other vintage computing and programming projects, so I thought I would share a post by my friend Scali that covers a vintage programming problem just as well, if not better, then I would have covered. Enjoy.
Given your past posts on PowerVR and PowerSGL, I thought you would find this interesting.
We’ve been slowly (over the last decade) working on collecting SDK’s to proprietary 3D APIs over at VOGONS.org and just as slowly people are starting to write wrappers to wrap them to more generic 3D APIs like Direct3D and OpenGL (in some cases modern versions, in other cases more “retro” versions like Direct3D 9.)
For the SDKs it is probably easier to link you here, at the moment I can’t manage to make a linkable collection at vogonsdrivers.com:
Some of the newer wrapper threads:
Tuxality is working on PowerVR and ATI3DCIF wrappers:
blueshogun96 is planning a NVidia NV1 wrapper but has not got very far yet:
(and the newest posts)
dgVoodoo 2 is wrapping 3dfx Glide to Direct3D 11 and just last month added Direct3D 5/6/7 to Direct3D 11:
(and of course nGlide, the most tested Glide wrapper ever, to Direct3D 9: http://www.zeus-software.com/downloads/nglide )
No one has called S3 Toolkit or Rendition RRedline yet 😉
Also, we recently (re)discovered that S3 MeTaL is a true standalone proprietary 3D API, it seems.
Here’s a list of the proprietary 3D APIs we are still hunting for 🙂
Hum, interesting. Although I don’t think a PowerSGL-to-D3D/OGL wrapper would be all that useful.
Most games with SGL support would also support Glide or D3D/MiniGL.
One thing you’re not looking for, and which I’m not sure even exists, is a PowerVR SDK for DOS. It would be nice if there was an SGL library that can be used from DOS, but I’m not sure if one even exists. Perhaps they just hardcoded on the metal for those few PVR-patches they’ve done for DOS games.
Oh. and of course the monstrous “3D Accelerated Games List (Proprietary APIs – No 3DFX/Direct3D)” thread by vetz, the PowerVR section is still a work-in-progress:
Always wondered how these scrollers were made 🙂 Nice post! Thanks!
Well, this is just one way to do a scroller 🙂
I have done a CGA scroller myself, which works in a different way (and unlike this scroller it is not limited to scrolling 4 pixels per frame).
The EGA scroller in my 1991 donut works in a different way yet again. And then there’s the scroller I did for C64, which uses hardware scrolling (which is possible on EGA/VGA, but only at the top of the screen).
Wiz/Imphobia typing here 🙂
This little blog of you reminds me of the troubles I had to make some stable raster in VGA. The technique was quite similar. IIRC, in VGA, you had colors in the palette coded on 3 bytes. To make raster bars, I was changing the palette entry of the background color (much like what you describe). This had a lot of jitter because sending the 3 bytes to the VGA to set the colours required 3 out DX,… and this took too much time. The trick was like yours but a bit different. To avoid jitter free bars, on had to set the red and blue component then wait for the scaline to be at the beginnig of the blank and, only then, send the last colour component. By doing so I remember I had full screen raster bars (and a bit less if I wnated to save some CPU ‘cos as you said, waiting the scan line is time consuming).
Hi, have you read about the VGA palette banking that I used in 1991 donut?
I think it may be useful for rasterbars as well. You can set up 4 banks of 64 colours or 16 banks of 16 colours. Switching the bank could be done to trigger a rasterbar. Then you have the rest of the scanline to update palette entries in one of the inactive banks, and then again just switch banks at the next scanline.
ah … What you explain sounds like my Imphobia stuff indeed 🙂
But I didn’t know that all timer were synced on the same crystal… Anyway, the timer resolution was enough to cope even with unsynced stuff I guess 🙂
Now that I type I also remember making “overscan” with resolution such as 320×232, but that’s too far away (and needless to say, I’ve lost the source 😦 )
Oh, and while I’m at it, in Imphobia I wanted to have in picture scrolling (if you remember, scrolling on EGA/VGA was easy only if you moved a band at the top or a at the bottom of the screen, but not both). TO do that I was actually timing the scanline. With that, I could trigger timer interrupt at rather precise positions on line and from there I could reprogram some registers in the EGA/VGA (such as byte offset or palette). The nice thing with the timer is that I had plenty of time to do something else and still keep my effect very stable.
Thaks to that, I could scroll the middle of the screen and keep big static bands on top and bottom. I could also change the colour palette three times on the screen thus achieving 640x400x48 colours 🙂
That was not easy to set up, especially with a music player behind ! 🙂
I stumbled upon this trying to figure out if the graphic card can be programmed to have 16 columns in text mode (number of rows not important). Maybe with your skills you might know if it can be possible! (I was thinking about a minimal 0x88 chess program implementation :))
Pingback: 8088 MPH: How it came about | Scali's OpenBlog™
Pingback: 8088 MPH: The final version | Scali's OpenBlog™
Pingback: That’s not Possible on this Platform! | Scali's OpenBlog™
It seems the server hosting CGADEMO.ZIP (bohemiq.scali.eu.org) has disappeared. Put it in your dropbox?
Right, I’ve put it here: https://www.dropbox.com/s/2j0ya7ta5ibhv0u/CGADEMO.ZIP?dl=0