I would like to give some technical background information on the sprite part in 8088 MPH, but before I do that, I want to discuss the music and the musicians, as they did not seem to have gotten very much attention in my previous article. One might think that the musicians were not as much part of the project as the coders and graphician were. And in a way that is true.
Phoenix was the only musician who was actively involved with the project, but he joined at a later stage. Initially the plan was that Phoenix would do all the music for the demo. But later he said he wouldn’t have enough time, so we had to look for other musicians. Which is not easy 🙂 As I said before, we had to ‘invent’ this demo platform, and that goes for the music as well. Although MONOTONE was not written specifically for this demo, it was written by Trixter, and it had not seen widespread use yet. So on the one hand, it was difficult to find musicians who had any experience with MONOTONE at all, or with PC speaker music in general, or who were willing to give it a try. And on the other hand, MONOTONE had not seen much of a workout yet since its release, so the musicians ran into some bugs, and also requested some modifications to the user interface to make tracking a bit more efficient.
But, it all worked out in the end. Phoenix found enough time to finish one track, which is the first one in our demo. Virt did the second MONOTONE track, and Coda did the chiptune on short notice (the mod player was one of the last parts to get finished, less than two weeks before the compo). I had also asked a few DESiRE people if they could do something with MONOTONE, and they did. So we actually had more music than we could use in the end. And primitive as MONOTONE may be on PC speaker, I think even the beeper tracks in our demo are up there with the best beeper music done on PC, getting a LOT of sounds and effects out of that primitive beeper.
Let’s revisit the CGA hardware which I briefly covered earlier. We have 16k of video memory, which is just enough for a single framebuffer in 320×200 with 4 colours or 640×200 with 2 colours. The main chip is a Motorola 6845 CRTC, which is not really designed for graphics modes, and only supports up to 127 lines. Therefore, CGA uses a ‘hack’ to display 200-line graphics: it has two bitplanes, one for even and one for odd scanlines. So the CRTC ‘thinks’ in 100-line modes, while we actually get 200 lines on screen.
Pixels are packed side-by-side in bytes, so you have the following pixel formats:
- 2-colour: 1-bit pixels, 8 per byte
- 4-colour: 2-bit pixels, 4 per byte
- 16-colour (composite artifact mode): 4-bit pixels, 2 per byte
Since there is no hardware functionality whatsoever for sprites, bobs, overlays or anything remotely useful, we have to use CPU-based routines to draw sprites. The fastest possible way to draw sprites is to use ‘compiled’ sprites: you generate pieces of code which write the sprite data directly into video memory.
Because of the cumbersome pixel format you will need multiple variations of your sprite-code to have pixel-exact placement in the x and y directions. Namely, because of the separate even and odd bitplanes, some of the code for drawing a sprite is different depending on whether you start on an even or an odd scanline. Because when you want to switch from even to odd scanlines, you need to adjust your pointer with a certain offset. And when you want to switch from odd to even, you need a different offset.
And for the horizontal case, it is quite expensive to shift the data on the pixel level at runtime, so you will want to compile different pre-shifted variations of your code.
As for the code for the pixels in the sprite itself, we have a 16-bit instructionset, so we can process two bytes (one word) at a time, in the best case. However, sometimes it makes more sense to only process a single byte. I defined the following classes of pixel groups:
- All pixels in word opaque
- All pixels in word transparent
- Some pixels in word opaque/transparent
- All pixels in byte opaque
- All pixels in byte transparent
- Some pixels in byte opaque/transparent
For each class, I hand-optimized ‘templates’ of assembly-code. Then I defined heuristics for when to use which template, to always get the fastest/smallest code. The compiler will grab two bytes worth of pixels from the source bitmap, then it will select the proper template based on these heuristics. It then fills the template with the proper pixel/masking info and emits the code and data. It does not always choose to use immediate operands, unlike many examples of sprite compilers you’ll find online. Whenever possible, it uses movsb or movsw, which leads to much smaller and faster code than regular mov instructions with immediate operands.
Aside from that, there are also some optimizations to generate the best possible code. For example, transparent bytes or words will simply update a pointer. Likewise, switching to a new scanline is also just a pointer update. The compiler will group these pointer updates together. Another optimization is that the compiler will first process all even scanlines, and then all odd scanlines. The reason for this is that switching to the next scanline will be an offset of less than a byte, which results in a shorter instruction encoding.
The first version of the compiler would just assume that black pixels are transparent, and everything else is opaque. However, when VileR got the idea for doing the DeLorean sprite, he asked if it was possible to also have black in the sprite, because it would look much better. So, I modified the compiler to take two bitmaps as input rather than one. The second bitmap is just a black/white transparency mask, allowing you to mask any pixel as either opaque or transparent, which gives you the ability to draw opaque black pixels.
We don’t just need to draw the sprite, we also need to erase it. So the compiler does not just generate the code to draw the sprite, but also to erase it. The first version assumed that the sprites would always be drawn on a black background. So the background colour could just be compiled into the erase routine. This erase routine is also more coarse than the sprite routine: if the whole background is black anyway, you don’t need pixel-exact drawing. You can just erase at the byte-level, which is a lot faster, because you don’t need to do any read-modify-write operations on video memory. And again, the compiler will try to use words where possible, for maximum speed and minimum code size, it basically just uses stosw and stosb, and skips any transparent bytes/words by adjusting the pointer.
When I saw that the performance of these routines was quite good, I figured I could also try to use a background image. So I extended the compiler to generate code that would copy pixels from a background image to the screen, again at the byte-level, using words where possible, basically just movsw and movsb, and again skipping transparent bytes/words with pointer adjustments.
Now I had routines that were equivalent to the blitter objects (bobs) used on Amiga. Only these run in software, so they are more like software objects (sobs).
Another way to scroll
Once I had the sprites working on a background, and performance still seemed quite good, I had the idea that it would be possible to move the background by adjusting the start offset register. As long as I adjusted the target address for the sprite to compensate for the scrolling of the screen, I could have sprites floating over a scrolling background. The only problem is though: how does one scroll the background when there is no extra memory?
Well, as you have probably already noticed, the background is set up to be ’tiled’: the building repeats the same 60-pixel high segment multiple times. This means that we only have to scroll up 60 pixels to get to a ‘wraparound’ point. Now, with a regular 160×200 16-colour mode, you wouldn’t have enough memory for that. The framebuffer takes 16000 bytes, and you have 16384 bytes. That is just enough for 4 extra scanlines. My first idea was to just make the visible window smaller, so you’d get a 160×144 mode. But that would have a rather strange aspect ratio. So instead I set up a tweakmode that is 140×170 large. This looks like a full screen, because the aspect ratio is about the same. Also, by reducing the width, I free up a lot of memory to scroll in a new building segment. Once I had that working, I could have ‘endless’ upward and downward smooth scrolling of the background, by just taking a scroll offset modulo 60. Mind you, because of the ‘hack’ with even and odd scanlines mentioned earlier, you can only set the start offset to any of the even scanlines, so you can only scroll with two scanline increments.
Putting it all together
We aren’t quite there yet… Remember, that we only have a single framebuffer. Our tweakmode gives us extra off-screen memory to scroll into view, but we still need to draw the sprites directly into the visible area. So yes, we are once again racing the beam. Now, on most 80s hardware this is not that much of a problem, because the hardware was designed for this: you will have scanline counters or even raster interrupts to synchronize your code with a specific position on the screen.
Alas, CGA has none of this. There is the lightpen functionality, which could be used… but because of the ‘hack’ with even/odd scanlines, it does not give scanline-exact results. Also, it is rather cumbersome to use, and may or may not work on a wider range of hardware. Instead, I opted to use the timer. Since the timer runs off the same clock as the CGA card, they are exactly in sync. A single scanline takes 76 timer ticks. An entire screen takes 262*76 = 19912 ticks.
So, if I set the timer to 19912 ticks, and I make it restart at the bottom of the screen, I effectively have a ‘reverse’ scanline counter, except it is scaled up by a factor of 76. Once I had this set up properly, I could poll the timer to determine when the last scanline of my sprite was drawn out by the beam, and I could start erasing it and draw the position for the next screen.
The only limitation I had was that the sprite could not take more than an entire screen to draw. When VileR sent me the DeLorean, I was a bit worried at first… It was quite a big sprite, would the compiled code be fast enough? But luckily it was, and in fact, there’s actually some rastertime left.
The fun part is that I do not actually do any synchronization to the vertical blank at all. The timer interrupt is used to play the music. The sprite is redrawn by polling the counter, and since I can never draw more than one sprite per frame, it synchronizes automatically. I also exploit the fact that the start offset register is latched: I set the new scrolling offset immediately before I start drawing my sprite. The 6845 will latch it, and it won’t become active until the next frame.
Now I had the technology in place to scroll the screen and move sprites on top. But I still had to create some kind of acceptable motion to show it off. Initially I thought I could just hardcode a few motions and glue them together with some more hardcoding. Just some movements based on some sine-functions or whatnot. But after I messed about with that idea for a while, I concluded that the movements were still too ‘robotic’, and didn’t really do the effect justice. So I would have to try something else.
So I decided to try and rig up some keyframe-based animation instead, where I could put in some sprite/scroll positions in keyframes, and have smooth spline-based interpolation in between. I used a Hermite spline interpolation in some places, which takes two points to interpolate between, and two other points as ‘control points’ for the total curve, with some extra control values for the ‘tension’ of the path around the points. In other places I used cosine interpolation instead, because it gave slightly smoother, more pleasing results.
Because time was really tight by now, I wrote up the whole keyframe player and animation path in a single evening. We stuck with that rough draft for the demo, with the only difference being that I tweaked the speed of the animation somewhat by adjusting the time of some keyframes, to match the part up with the music.
As you may have noticed, this part also uses some fade-in/fade-out techniques. How is that possible? Well, this is an idea brought forth by VileR, which only works in this particular mode, which is effectively mode 6 with colorburst enabled. Namely, mode 6 is a monochrome mode, where the palette register has a special meaning: The background colour is actually interpreted as the foreground colour in this mode. The background colour is always black, and the foreground colour defaults to 15, which is white.
By adjusting the foreground colour, you are modifying the artifact colours. There are a few foreground colours you can use to gradually reduce the overall luminance, namely 15, 7, 8 and 0 (in that order). This is what I use to fade from black to colour at the start, and back to black at the end of the DeLorean part.
The fade-to-white is slightly more complicated. I have done this based on a set of gradients that VileR provided for this mode. I have created a table that maps each colour to the next brighter colour in a gradient (but leaving black black at all times, to create a more interesting effect). Then I just read all pixels on the screen and remap them with this table. Repeat this a few times, and eventually everything saturates to white.
Because the CGA memory is too slow, you can’t do a whole iteration in just one frame. Which is why I opted for the ‘sideways burst’ effect instead. This is more visually pleasing, and doesn’t suffer from the problem that I can’t process the pixels fast enough for a whole screen update.
This idea was based on effects on C64, which rewrite colorram in a similar way to do fade/flash effects. Unlike the foreground-colour trick, this type of fade can be adapted to any mode, and can be used to fade to white, to black, or any other colour, as long as you construct the proper tables for it.
The vectorbobs part is using the same sprite compiler, but in a different way. Since we are dealing with 3d-sprites here, we need to sort them in z-order. This makes it very difficult to draw them based on a scanline counter while racing the beam. So instead, I opted for another tweakmode here, which is effectively 112×140. This mode takes only 8k of memory, so we have enough memory for a backbuffer. This means we don’t have to worry about racing the beam anymore. Just draw in z-order in the backbuffer, and then update the start offset register to flip the buffers (in this case I do have to wait for vsync, because otherwise I may start drawing in the ‘backbuffer’ while the pageflip is not yet active, so I may cause flicker).
Another advantage here is that I no longer need to draw at the full 60 fps, so I could throw some more vectorbobs at it. During testing I found that 32 bobs would still run quite acceptably. Then I had the idea to generate a more interesting shape, something with text. So I tried to approximate the IBM logo, which ended up at 30 bobs.
Also, since the window is now much smaller, it didn’t make a lot of sense to use a background image here. So I opted for a black background, also using the slightly faster erase-to-black routines I mentioned earlier.
Back to the future
We still have some plans for the future, with this sprite routine. There are some ideas we had, but there was not enough time to work them into the demo. For example, we wanted to support multiple shapes at the same time. This would allow multi-coloured vectorbobs, and vectorbobs that resize based on the z-coordinate, for example. We also wanted to do animated sprites. This can simply be done by calling a different sprite routine for every frame. For a large sprite, such as the DeLorean, you may not want to animate the whole sprite, but rather cut it up into subsections of animated and non-animated parts, to save on the total amount of code you need to generate.
Fun fact: I did some of the development in DOSBox. While DOSBox 0.74 does support emulation of composite video to some extent, it had a bug which prevented the tweakmodes from working properly (in composite mode, the width was hardcoded to 80 bytes, regardless of how you set up the CRTC registers). So I patched this bug and used a custom-built DOSBox for some of the development, until I had to move to real hardware to fine-tune the code.
Pingback: 8088 MPH: We Break All Your Emulators « Oldskooler Ramblings
Mind bogglingly cool!
Pingback: 8088 MPH: The polygons | Scali's OpenBlog™
Pingback: Just keeping it real at Revision 2015 | Scali's OpenBlog™
Pingback: 8088 MPH: How it came about | Scali's OpenBlog™
Pingback: 8088 MPH: The final version | Scali's OpenBlog™
Pingback: Latch onto this, it’s all relative | Scali's OpenBlog™
Pingback: Just keeping it real at Revision 2019 | Scali's OpenBlog™
Pingback: Area 5150, a reflection | Scali's OpenBlog™
Pingback: Area 5150, a Reflection - Mondaychick