Ah, there it is at last, part 5! In case you were wondering why the previous two posts were not just part 5 and 6 already, well that is because they were not planned originally, and they were just things that I discovered as I went along. They were just addenda, or even errata. I wanted to do part 5 as an encore-piece that improves the rendering quality a bit. I’m not sure if this is going to be the last piece in the series… that depends on how much further I would like to take it. There are tons of little tricks and things to make 3d rendering faster, especially on such old and low-spec machines, which I may or may not cover in the future.

### Corrections

As I mentioned earlier, my 486 donut has subpixel-correction. If you are not familiar with that, I’ll try to explain it in a nutshell, but I think it’s one of those things that are difficult to explain, yet easy once you get it. The name ‘subpixel’ already suggests that there is something ‘below’ the pixel, smaller than pixel resolution. And indeed, it is perfectly possible to have fractional coordinates. When you are doing rotations, scaling and other transforms (either in 2d or 3d), you will usually get coordinates with fractions, which are mapped onto integer coordinates for the pixel grid.

It is this mapping onto integer coordinates that may cause a problem. Namely, you should not just look at pixels as being infinitely small dots. You should consider each pixel to occupy a rectangular area. To demonstrate what I mean, I will show two different lines drawn on a pixel grid:

The blue line shows the actual line at 16 times higher resolution (so we have 4-bits of ‘subpixel’ information in this line). The red pixels show the pixels that the line passes through, so this is how the line looks when rendered at the actual pixel resolution. Now the key thing to note here is that the blue lines start and end in the same pixel in both images, but the different subpixel positions within the pixel result in different pixels being traversed by the line on the screen.

The naive approach would be to just round the fractional coordinates to the nearest integer. Effectively you’re just snapping every coordinate to the center of each pixel. The result is that both lines will be rendered like this:

The second line could be a result of the first line being rotated a little. Small movements are where subpixel-correct rendering is most obvious. Instead of lines or edges just hopping from one pixel to the next, you will see a more ‘flowing’ movement as the line changes even when the endpoints are still within the same pixels.

Subpixel correction is an extra step that you do before you rasterize the pixels. You correct for the subpixel position of your starting point. Instead of just ‘snapping’ pixels to the center by rounding the coordinate, you define a ‘hotspot’ inside your pixel, which will be the exact point in subpixel coordinates that will be projected to the pixels on screen. Since you generally rasterize from left-to-right and from top-to-bottom, the most common choice is to take the bottom-right corner of each pixel as the hotspot. You then calculate the distance to the hotspot from your fractional coordinates, and take a ‘pre-step’ to land exactly on the hotspot coordinate. Once you are on the hotspot, each next hotspot is exactly one pixel away, both horizontally and vertically. This means that you can just rasterize with integer steps as usual. So apart from the pre-step, there is no extra cost in subpixel-correct rendering. The bulk of the work is still the actual rasterizing. So I was wondering if it would be realistic to add subpixel correction to my low-end polygon routines.

### To rasterize, or not to rasterize

Part of the fun in this little endeavour, for me at any rate, was to experiment with different kinds of rasterizers. It must have been some 10 years since I last wrote a rasterizer. And for such low-end hardware (for PC I not only targeted CGA/EGA/VGA, but also the original 8088 CPU, so with 16-bit registers in real mode. The title of these blogs was a play on that), I wanted to try something out of the ordinary. I started out with a classic Bresenham-based approach, as I mentioned in part 4.

Bresenham has the advantage that you don’t need any divisions or multiplies at all. However, the Bresenham routine is normally used for lines, where you will always rasterize along the biggest delta. So if (x1 – x0) < (y1 – y0), you will rasterize vertically, else you will rasterize horizontally. Then Bresenham works because you know that you will always step less than 1 pixel in the other direction (the ‘fraction’, so to say). When you rasterize a polygon however, you will always rasterize vertically. So now you have the following dilemma for lines that are more horizontal:

- Will I just loop through the Bresenham routine horizontally until it takes a step vertically?
- Or will I pre-calculate how many pixels it always steps horizontally, and modify the Bresenham step to only decide on one extra pixel to step, to account for the fraction?

Originally I went with 1. because it does not require any divisions. It works well in the average case, but there are extreme cases of near-horizontal lines, where the amount of iterations can get relatively costly. I then tried 2., which requires a division and a modulo operation during the setup (which is just one division on x86 CPUs, you get the modulo for free). However, the rasterization itself becomes more straightforward now, with just one conditional jump per edge.

But how does one perform subpixel-correction on such a routine? Apparently it is not a common subject with Bresenham algorithms. It seems they are generally seen as integer-only solutions. It is possible to do, however. Namely, you are still processing a fraction in Bresenham. You have the nominator, the denominator, and the error term. You keep adding the nominator to the error term until it reaches or exceeds the denominator. At this point your fraction has stepped over the next integer boundary. With classic Bresenham, the error term is initialized with denominator/2. This means that we start our fraction ‘halfway’. Which is effectively snapping the starting point to the center of the pixel.

So the initial value of the error term is the key to subpixel-correcting a Bresenham algorithm. During the ‘pre-step’, you need to calculate the proper error term at the pixel hotspot. Calculating the nominator and denominator from the fractional coordinates does the rest. If you want to know more about this approach, you might want to read Chris Hecker’s articles on perspective texture mapping. He uses a Bresenham-derived rasterization approach with subpixel correction.

Well, I decided to implement it, just to finish the routine. I managed to get mine a bit more efficient than Chris Hecker’s. I used 12.4 fixedpoint coordinates, and managed to fit everything in 32/16-bit div/mod and just 16-bit terms for the fractional stepping. But I wasn’t happy with it. Namely, to get rid of the iterative nature of Bresenham during rasterizing, you already needed to use a div/mod operation per edge. And the subpixel-correction needed another div/mod per edge (both could be replaced with iterative solutions as well, but still relatively expensive). This completely defeated the original point of using Bresenham to avoid costly divisions and multiplies. So I figured I might as well write another rasterizer, one that is closer to the one I used in my 486 renderer, using regular 16.16 fixedpoint. It only needs one division per edge to set up, and the subpixel prestepping is more straightforward as well. Aside from that, it doesn’t need as many variables as a Bresenham-style rasterizer. You need 32-bit precision for 16.16, but it splits perfectly over 2 16-bit registers, which means you can access the integer portion of the coordinates immediately. So in retrospect this seems to be the best choice, even on low-spec machines.

To give you an example, let’s go back to the video I made with the transparent polygon. This one was still not subpixel-corrected, but it already rotates at the low speed I wanted to test the subpixel correction at:

And this is what it looks like with subpixel correction:

As you can see, you get that smoothly ‘flowing’ effect of the edges, making it look far less choppy than the all-integer variation.

Another example is from an application that Mikael Kalms made long ago (I believe it was to accompany an article on polygon rasterizing, but I cannot find it. Here is a related article he wrote though, complete with source code). This little app allows you to pick the number of subpixel bits to use, and also demonstrates the effect on lighting and texturing. Even 1 bit already makes a difference, and about 3-4 bits give excellent results:

If you played some 3d games or watched 3d demos back in the early 90s, you’ll probably recognize that ‘choppy’ look. Subpixel correction wasn’t commonplace until the mid-to-late 90s (although various demos, including Crystal Dream, would just spin their objects fast enough that it was not very obvious). At this time the 3d accelerators came into swing as well, which may have had something to do with that. Even early 3DFX VooDoo accelerators already had subpixel-correction (with 3-bit accuracy I believe). Most accelerators focused on accurate, high quality rendering. This also included perspective-correct texturing and texture filtering. The Sony PlayStation was the exception to the rule, with rather unstable rasterizing and distorted, unfiltered textures. Occasionally you will also find early accelerated software, where the coders had apparently not designed their geometry pipeline for subpixel-correct rasterization. As a result, they only pass integer data to the accelerator, making it look as choppy as a non-subpixel corrected renderer. Well-known offenders are Gods, with Toys and Incoming Future.

### Windows NT does what?

While we’re on the subject of subpixel-correct rendering… Chris Hecker mentions in his articles that Windows NT is capable of rendering subpixel-correct lines and polygons as well. He does not go into detail however… But I was curious. So I looked into the API reference to see how that would work. My guess whas that the SetWorldTransform() function would be the answer, as the drawing functions only took integer coordinates. If you would specify a scaling matrix, then all your integer coordinates would be scaled down, and apparently the subpixel-information would be used for rendering. So I made a small test application to try out that theory. By scaling everything down to 1/16, I would effectively have the 4-bit subpixel accuracy that Chris Hecker mentioned. As it turns out, you also need to use SetGraphicsMode() to set the GM_ADVANCED mode, or else the world transform will not be enabled.

Lo and behold: indeed, the GDI LineTo() and Polygon() functions now gave me subpixel-correct results! And apparently this has been part of Windows for a long time, seeing as Hecker’s article is from 1995. Doing some digging through old MSDN resources showed that apparently it has been supported since NT 3.1 (the current MSDN just reports ‘Windows 2000’ for any old functionality, because Microsoft no longer supports the older versions. Kind of a shame, in a way they have rewritten history by doing that). It does not work on any Win9x-derivative however. Although there is an alternative way to use scaled coordinates (by using SetWindowExtEx() and SetViewportExtEx()), which does work on Win9x as well, this does not result in subpixel-correct rendering. So it seems that only the NT-derivatives support it (which must be why Chris Hecker specifically mentions NT).

Which makes me wonder: Did some early Windows accelerator cards already have subpixel-correct rendering? Or even better: was this a requirement? In which case, a flatshaded subpixel-corrected polygon 3d engine could probably be accelerated even before the VooDoo cards arrived. But I have never seen it done. Perhaps because it’s just too obscure a functionality? Or is it because it was not widely supported, or just not fast enough? Oh well. It’s an interesting feature to know about.

### But what of the Amiga?

The polygon routine on the Amiga was also an integer-only one. But that one used the built-in hardware to draw the polygons. Can we make it subpixel-correct as well?

The easy way would be to use the CPU to rasterize the edges, and then use the blitter to fill it. So that was my first try. I just used the same rasterizer as the PC version, but instead of filling entire scanlines, I just plotted the endpoints. Then the blitter would fill as usual. So it is still halfway hardware-accelerated, but it does take quite a bit more CPU time. Something you do not have an awful lot of to begin with, on a classic Amiga.

Anyway, that first attempt resulted in this:

But can we do any better? How does that hardware linedrawer really work? It is some kind of Bresenham algorithm, and as we’ve seen above, there are ways to subpixel-correct them. From the Amiga Hardware Reference Manual, we get these semi-obscure formulas to initialize a blitter line:

bltapt = (APTR) (4*dy-2*dx);

bltamod = (UWORD)(4*(dy-dx));

bltbmod = (UWORD)(4*dy);

(Note however that dx and dy are not necessarily dx and dy here. You first determine the octant for your line, and after that, you sort dx and dy so that dy is always the smallest delta and dx is always the largest).

The System Programmers Guide tells us slighly different formulas:

bltapt = (APTR) (2*Sdelta-Ldelta);

bltamod = (UWORD)(2*Sdelta-2*Ldelta);

bltbmod = (UWORD)(2*Sdelta);

They replaced dx and dy with Sdelta and Ldelta (**S**hort and **L**ong respectively). Other than that, apparently they divided all terms by a factor 2, and the resulting line is still equivalent. Now, I was wondering… why are there three values? For a regular integer Bresenham routine, you would only need a nominator and denominator value, as mentioned before. The error term can be initialized with denominator/2, so it would not have to be specified explicitly. So by the looks of things, these three terms are some form of nominator, denominator and initial error term.

2*Sdelta is the smallest value, so that one is likely to be the nominator. 2*Sdelta-2*Ldelta would be a negative value, since Ldelta is larger than Sdelta by definition. If we disregard the 2*Sdelta terms for a moment, then -2*Ldelta appears to be a denominator, and -Ldelta is denominator/2. For some reason these values are negative (possibly because it counts from -denominator towards 0, since checking for error <= 0 may be easier to implement than checking for error >= denominator in hardware) The 2*Sdelta then would be an offset of 1*nominator, for some reason.

So I started to experiment with the values a bit, and indeed the value stored in bltapt is the initial error term. By replacing the Ldelta term with some other factor, I could control the starting point of the line. I have not quite perfected the routine yet, once again the problem of line endings rear their ugly head, and the blitter fill will leak at certain points. Nevertheless, I think it might be possible to make it work 100%. Currently it looks something like this:

I may get it right one day… but it is going to take some experimenting to figure out exactly what the blitter is doing when.

Neat!

For the amiga blitter details, check out the WinUAE sources, namely https://github.com/tonioni/WinUAE/blob/master/blitter.cpp – search for blitter_line_proc and some of yer questions shall be answered.

(I don’t know how to make blitter lines subpixel-correct either – if there is a way, it ought to be by modifying bltapt/bltamod/bltbmod like you’re trying here though.)

Yes, I’ve been looking at that code a bit. Not that easy to read as it processes things in many small steps, trying to be cycle-exact.

At any rate… I think I got subpixel-correct lines worked out pretty much… But it seems that it is even less suitable for polygon edge rasterizing than regular lines. It’s hard to make ends meet, so to say.

Therefore I’m currently thinking about some kind of hybrid method:

Use the blitter for vertically-oriented lines, and use the CPU for the horizontally-oriented lines (which would generally require less iterations anyway, as they usually won’t span that many scanlines).

As long as I can make the vertical blitter lines start and finish on the correct scanlines (which I think I can), I can use the CPU for the rest, using a vertical approach as well, where I can hopefully match the blitter’s rasterizing convention well enough to make it all work out.

But if that doesn’t work either, then well, I guess my initial approach of using the CPU entirely for the edges is the best option.

Success: http://www.youtube.com/watch?v=f4a7rxGL7b0

The approach I mentioned above, using blitter for lines where abs(dx) < abs(dy), and CPU for the other cases, seems to work quite well.

I've given up on the other cases, the fact that the blitter renders them horizontally rather than vertically seems to be quite problematic. It might be possible to fix those quirks as well, but this will have to do for now.

Pingback: Just keeping it real, part 5.1 | Scali's blog

Pingback: Just keeping it real, part 10 | Scali's OpenBlog™

Keep in mind the amiga was suppose to get connected to a TV screen of all thing back in those days it’s CRT beam bleed out colors choising the right palette means you somewhat free AA.

the better the screen the less this occures though

After doing some extensive reading on Chris Hecker’s algorithm I managed to get a version of it working here myself.

I am curious though at what you did however to only require one div/mod and 16.16 math. I would love to hear back about this.

Well, the 16.16 math is a result of limiting myself to 12.4 coordinates rather than 28.4.

With 12.4 I still get a range of -2048 to +2047, which is more than enough for 320x200ish resolutions with enough room for clipping.

In the DDA case, the div sets up the dx/dy stepping factor, so I need one div per edge.

The subpixel correction is very straightforward for the DDA case, because you just re-use the dx/dy term. As I said in the blog, a 16.16 dx/dy term will give you plenty of precision for these low resolutions.

In the Bresenham case, you need a div/mod for edges where dx/dy > 1 (as I said in the blog, to eliminate the iterative nature). However, Bresenham forces you to also use a div to calculate the start of the error-term for subpixel correction.

So with Bresenham you can only get it down to 1 div/mod with subpixel correction if you choose to have iterative edge-stepping (a while-loop basically, which is generally more expensive than doing one div/mod upfront).

Pingback: Graphics - Kcchao