The last effect in 8088 MPH to be discussed is the polygon renderer. As already mentioned earlier, it is not a regular polygon renderer, but actually a ‘delta-renderer’: it does not rasterize pixels directly, but for each scanline, it uses spans of pixels, and compares the difference with the previous frame, to only render the differences or ‘deltas’. This bears some similarity to 8088 Domination, where the video is encoded as sets of differences to the previous frame. In the case of 8088 MPH the goal is not so much compression as it is reduction of the amount of bytes written to CGA memory, since CGA memory is a huge bottleneck. You have about 170 KB/s write speed. Another problem with CGA is the lack of memory, as I mentioned in the previous article as well: there is no memory for more than one framebuffer, so no double-buffering.
Rendering, let’s recap
I have discussed polygon rasterizing before, but let’s review the whole conventional rendering process, so we can then discuss how the renderer in 8088 MPH is different from this process. In this case, I am taking the z-sorting renderer used in 1991 Donut an as example, since that is closest to what we have here:
- Clear backbuffer
- Transform vertices from object space to screen space
- Sort polygons in z-order
- Clip polygons against screen rectangle
- Render polygons to backbuffer in back-to-front order (Painter’s algorithm)
- Swap front and back buffers.
Now, there is one thing that should already be an obvious problem: Steps 1, 5 and 6 assume that we have a backbuffer. Namely, we first clear the whole screen, and then we render the polygons in a ‘random’ vertical order, because they are sorted on z. If you were to do this directly on the frontbuffer, you’d be racing the beam, and losing most of the time, so you will get a very flickery image, with random parts of polygons which may or may not appear depending on where the raster beam was at the point where you started drawing. So the backbuffer is very much required for a conventional polygon renderer.
In 1991 Donut I used a backbuffer in video memory, which meant that I could just change the start offset address to swap between buffers, which takes virtually no CPU time at all. On CGA, there is not enough memory, so at the full resolution it would only work if the backbuffer is done in system memory, and instead of doing a simple swap, we do a full memcpy to video memory. Since the memory speed is so low, this takes a lot of time and will give visible tearing while drawing.
As discussed previously with the vectorbobs, it is possible to do double-buffering in CGA memory, if you set up a tweak mode with a smaller window size. However, vectorbobs can be erased efficiently by just drawing black areas at the positions of the previous frame. With polygons, it is generally most efficient to just clear the whole screen. Drawing black polygons to erase the objects of the previous frame would be more complex and take longer. But even at a smaller window size, clearing the whole buffer still takes a lot of time in CGA memory. We’re still talking about 8 KB per frame, and at ~170 KB/s, you’d need ~0.05 seconds to clean that much. So even before you actually started drawing, you already limited yourself to about 20 fps max, on just half the screen area. Yes, CGA is *very* slow.
So this is where we start to think about delta-rendering. Even if we could just avoid clearing the whole screen, and only clear the parts around the polygons, we could already save a lot of expensive CGA writes. And since we can’t realistically do much more than single-coloured polygons, a large part of the pixels on screen will remain the same colour on consecutive frames. There will only be changes around the edges of the polygons.
So I worked out a solution to find these changes based on a span-buffer approach. Normally you render out polygons as spans of pixels per scanline. In this case I do not actually render the pixels directly, but I use a linked list of spans for each scanline. A span contains the leftmost and rightmost x-coordinates, and the pixel colour. Each span is then inserted into the list, where spans are split, merged, or eliminated on insert, to eliminate any overdraw. Therefore you still want to insert the spans back-to-front, so you still use a sort of painter’s algorithm. In order to ‘clear the screen’, you insert a set of spans with the background colour which span the entire screen before you start inserting your polygons. All these operations are very simple integer-based operations, just comparisons, additions or subtractions, so that makes it very suitable for the 8088, which is much slower at multiply and division than these simpler operations.
Once you have inserted all your polygons, you have a complete ‘snapshot’ of your framebuffer in a compact form. If you also saved the span-buffer of your previous frame, you can now quickly determine the differences per scanline by comparing the spans of the two buffers (so in a way, we are still double-buffering, just not with framebuffers). You do a very similar insert and clip/eliminate operation as with the earlier rendering of the spans, to get a list of spans that only contains the differences. There are a number of advantages to this method, including:
- Clearing the screen is a very cheap operation
- The performance is virtually independent of the size of the polygons, it depends on the size of the changes
- The spans are grouped per scanline, so rendering top-to-bottom and left-to-right is trivial
This means that we no longer need a backbuffer. We don’t actually clear the screen anymore, we only draw the differences, so any flicker will be very minor anyway. And on top of that, we can render in scanline-order. This is especially nice for CGA, since rendering in memory order (such as when you would copy a backbuffer to a frontbuffer with a memcpy operation) means you do the even scanlines first, and then the odd scanlines, which can cause visible interlacing effects.
And we can just use the full 160×200 resolution, and use some border graphics to spice things up.
An interesting difference with 1991 Donut is the sorting strategy. For 1991 Donut, I wanted to scale up the polycount as far as possible. Since sorting doesn’t scale well with a large number of elements, I did not want to do the sorting per polygon. It would probably not be very fast with the 128 polygons I wanted to use. So, what I did there was to subdivide the donut in 8 sections, and sort per section. This means the sorting cost is constant, regardless of polycount.
The downside to this technique is, however, that you need to create your donut so that it can be subdivided into 8 sections. This means that you can’t really construct a donut with less than the 40 polygons I used for the ‘low’ feature in 1991 Donut. This was still quite a lot of polygons for an 8088 at 4.77 MHz to handle. So therefore I decided to drop this sorting method in favour of a more conventional per-polygon sort, and instead construct a donut from 25 polygons, which is about as low as you can go, while still having something that somewhat resembles a donut shape.
I have already mentioned the shading in an earlier post, but I will just state it again here for completeness. VileR came up with a number of gradients that were possible in various CGA modes and palettes. The common black, cyan, magenta, white palette happened to generate a blue gradient of 7 colours, and a red gradient of 6 colours. This would allow for some semi-useful flatshading, so I decided to give it a try and see how it looks.
The downside of flatshading is that your polygons will change colour, which means you get more changes per frame when they do, slowing down the routine. The choice for the pyramid at the start was to show more colour variation by using another palette, and also to start simple, keep expectations low, so that we could build up to the cube and donut with more impact.
If you recall, I did some experiments with dithering earlier, when I targeted Hercules:
Since I only had one colour to work with there, I had to create a number of patterns to simulate different levels of brightness. I chose to use 8×8 patterns, since there are 8 pixels in a single byte. So each pattern consists of 8 bytes. I select the correct byte for each scanline based on the y-coordinate modulo 8. For this code however, I could use a simpler form of dithering, since I already had multiple colours.
Adding dithering to this particular renderer was actually quite simple, if you figure that the span-buffering works with bytes to store the colours anyway. Each pixel is only 4 bits, so I copy the 4-bit colour value to both pixels that are packed in the byte, when I render the span to the screen. In order to speed up that process, I moved that to the start of the polygon rasterization, since the same colour value is used for all spans in the polygon anyway. So the spans already store the expanded colour value, which is used as-is by the rendering routine.
Expanding this to dithering was quite simple: Instead of just taking the same colour value for both pixels in the byte, I just calculated the lighting with one extra bit of precision, and if the lowest bit is set, I take the two nearest colours, and place them in the byte for the even scanlines. Then I swap the order of these two colours for odd scanlines. This results in an X-pattern, which is simple but effective, and doesn’t need any change to the span-buffering and rendering mechanism itself, only a simple tweak to the polygon rasterizer, with a negligible impact on the overall speed.
Dithering effectively doubles the number of colours you use, so you will get more frequent changes of colours, which slows the routine down even more. But it did look very nice, and the speed was actually quite acceptable still, so we chose to use the dithered version of the routine in the final demo.