Picking up where I left off in part 5, the subpixel-corrected polygons on Amiga:
The subpixel-correction in itself appeared to work. My analysis of the input terms for the blitter’s linedrawing appeared to be correct: you can specify the error-term in bltapt. Let us look at this formula again:
bltapt = (APTR) (2*Sdelta-Ldelta);
The Ldelta-term here represents Denominator/2, and can be replaced by a subpixel prestep, something like this:
initialNominator = fraction*2*Ldelta;
bltapt = (APTR) (2*Sdelta-initialNominator);
In other words:
initialNominator = fraction*Denominator;
bltapt = (APTR) (Nominator-initialNominator);
Where fraction would be the distance from the start of the line to the hotspot inside that pixel (so a value between 0 and 1). For completeness I’d like to point out that you have to calculate the Sdelta and Ldelta values from the coordinates at the higher resolution, rather than screen resolution, and you will need to also pre-step the x and y coordinates for your starting point before passing them on to the blitter. Important to note: the blitter can render lines in any direction (I already mentioned that in part 3, when I said I wanted to sort my lines to always render top-down, so that left and right edges would fit properly). This means that you need to perform the subpixel correction in the proper direction as well.
As an aside: I already mentioned that the Amiga Hardware Reference Manual and the System Programmers Guide had slightly different formulas. For some reason the HRM scaled everything up by a factor 2, compared to the SPG. I can see why the SPG uses 2*Sdelta and 2*Ldelta terms: this means you can just use Ldelta as your Denominator/2 value, without losing any precision. I am not sure why the HRM uses 4*dx and 4*dy terms. However, it clearly shows that applying a scale factor to all terms does not affect linedrawing.
Since I use 4-bit subpixel information, my terms are scaled up by 16 already. Therefore I chose to not use the scaled-up versions from SPG or HRM, but just use the dx and dy values as-is. I do not use a Denominator/2 term anyway, since I replace that for subpixel-correction, and I use higher precision anyway, so for me there is no real advantage there.
But it is still broken…
So far, nothing new. I have just given a slightly more in-depth explanation of how the idea of subpixel-correction on blitter lines works. For line-drawing, this works well enough. However, as we’ve seen above, polygons are still buggy. So what is wrong here?
Well, this again has to do with the blitter rendering rendering lines in different directions. As long as the blitter renders them top-down (the cases where abs(dx) < abs(dy)), things work as expected for polygon edges. However, for the cases where it renders them left-to-right or right-to-left (abs(dy) < abs(dx)), there is that problem again of making the edges meet properly. The line may not end exactly on the scanline you intended, and as a result, there is either a pixel too few or a pixel too many on the scanline, causing the filler to overshoot and fill the entire scanline.
At first I tried to mess about with the line setup to try and fix this problem, but so far I have not come up with a working variation. So then I decided to make a hybrid routine instead: for the cases where abs(dy) < abs(dx), I use a CPU routine, which still draws it top-down, and puts pixels in all the right places. Luckily, these are generally the lines that require the fewest pixels (since we always draw top-down, abs(dy) is the number of pixels we draw), so the blitter still takes care of most of the workload. The CPU routine can also work in parallel with the blitter, so it is not even all that bad.
The only problem-case here is when the polygon is less than 2 pixels wide. If both edges draw at the same pixel, the filler will not work properly, as we’ve seen in part 3. Back then I proposed the solution of using XOR-mode when drawing pixels. This way, when two pixels are drawn on top of eachother, the second pixel will turn the first pixel back off, so the filler will not do anything there.
This solution works perfectly for our hybrid subpixel-correct renderer, since we now render exactly 2 pixels on every scanline. So we use the blitter to draw in XOR-mode, and we also use a XOR operation to draw the pixels with the CPU. We do not need any other tricks, like throwing away the first pixel of a line. And there we have it then: a blitter-accelerated subpixel-correct polygon drawer on the custom hardware of a 1985 home computer:
I am getting 4-bit subpixel precision here, which is as good as early 3d accelerators from the mid-90s on PC. Quite bizarre actually. Is this just an undocumented feature? I don’t recall ever having seen subpixel-correct lines or polygons on a regular Amiga. But as usual, Amiga makes it possible!
On to some other old junk
Before I end this post, I would like to share some other small things that I have made in the meantime. Namely, on PC I had made CGA, EGA and VGA-optimized polygon fillers. But there are more early graphics standards. One of them is Hercules, which is actually the first graphics standard I ever used on PC. My first PC came with a Plantronics onboard adapter, which was compatible with both Hercules and CGA, and also had a special 16-colour mode. At first I only had a monochrome monitor, so Hercules was all I could use. It wasn’t even that bad, really. Sure, it was monochrome, but the resolution was 720×348 pixels, which was incredibly high at the time. CGA could only do 640×200, EGA did 640×350, and VGA did 640×480.
Anyway, I decided to give it a go. I tried to look at the ah=0 int10h setvideomode function to see which mode it would be… Shock! Horror! There *is* no mode for Hercules. Apparently Hercules does not have any BIOS API, so the only way to set a videomode is to manually reprogram all registers. Luckily I found the right register settings on the internet somewhere. And before long I could switch to graphics mode and back.
Then I had to figure out how to address each pixel in memory. Hercules is quite quirky that way. The scanlines are stored in an 4-way interleaved arrangement. Each scanline is just as you expect: 720 pixels packed into bytes, giving a total of 720/8 = 90 bytes. But the addressing of the scanlines is like this:
Y MOD 4 == 0 at B000:0000 + (Y/4)*90
Y MOD 4 == 1 at B000:2000 + (Y/4)*90
Y MOD 4 == 2 at B000:4000 + (Y/4)*90
Y MOD 4 == 3 at B000:6000 + (Y/4)*90
So, now that the addressing is worked out, it’s time for the final details. Hercules uses a Motorola 6845 CRT controller, just like CGA (and EGA/VGA are near-100% compatible extensions of the 6845). The main difference is that monochrome adapters have their I/O ports based at 3B0h rather than at 3D0h for colour adapters (so that both can co-exist in the same system). Hercules comes with 64kb of memory, which means it supports 2 pages of memory. A single screen takes 720*348/8 = 32kb of memory. The second page is at segment B800. This is the same segment as is normally used by CGA. Which means that you can use the second page, but only if you do not have CGA-compatible card in your system as well (the first page is always available, so dual monitor setups are possible, as long as one is MDA/Hercules and the other is CGA/EGA/VGA-compatible).
Assuming that the Hercules card is the only one in the system, we can use double-buffering in video memory, just like on EGA and unchained VGA modes. Porting my polygon routine was quite straightforward from here on in. There was a slight problem however: the routine only supported flatshading, and Hercules has only two shades: black and white (or amber, green, or whichever other colour your monochrome display may use). So I decided to implement a simple dithering scheme, so that you could discern the individual faces:
Yes, it’s rather flickery, because the vsync does not appear to work correctly. I’m not sure if that’s dosbox’ fault, or if the vsync bit on the 6845’s status register does not work on real Hercules hardware. But it will have to do.
PCjr and Tandy
Although I have now REALLY covered every single graphics card I ever owned, there was still one graphics standard that was reasonably popular in the early days: the enhanced 16-colour mode of IBM’s PCjr, and the clones made by Tandy. Okay, I have no support for the Plantronics mode on my first PC, but I no longer have that PC, and I don’t think dosbox is compatible with it… It seems easy enough to add support for it though: It is like CGA, but with two extra even/odd bitplanes at segment BC00h. It combines the 2-bit pixels from B800 and BC00 to a 4-bit pixel.
Right, now onto PCjr/Tandy, because that mode IS supported by dosbox. This is yet another 16-colour mode, it does not work like EGA, and not like Plantronics either. Instead, it uses a packed-pixel format like CGA, so now there are two 4-bit pixels packed into each byte. And where CGA has even/odd planes at B8000 and BA000, PCjr/Tandy has 4 scanline-interleaved planes, much like Hercules, at B800, BA000, BC00 and BE00.
So PCjr/Tandy does not lend itself very well to fast polygon filling. With just 2 pixels per byte, and no special trickery to fill multiple planes at a time, it is not going to be all that efficient. But I’ve implemented it anyway, just to complete the whole set of graphics adapters:
And well, that’s it for now. I am not sure what I am going to do next. As I already mentioned in part 1, I may explore the graphics capabilities (or lack thereof) of the Commodore 64, or I may evolve these simple polygon routines into a more complete engine, allowing some simple objects to be animated on screen.
I’ve been meaning to ask about this for a while as it’s how I originally found your blog. Would you share the code/project/etc you wrote to the Amiga sub-pixel correct renderer?
Hum, well, not really. The code is in a horrible state, and never really got past the proof-of-concept stage.
Is there any part in particular that you’re interested in, or are you just looking for a complete polygon renderer for Amiga?
Oh I don’t mind about the state of the code, more about the complete renderer. Not done any Amiga development in a lot of years and just wanted to take a look at what you’d done, and what you’d used etc.
Yea, well… my code is far too buggy at this point to be of much use, I suppose. There are bugs regarding the set up of the graphics mode and returning back to WB properly… and there are bugs in the polygon routines, which make it crash sometimes.
I think you’re better off downloading that Danish assembly course that I mention somewhere. It includes a complete polygon routine as well.
With the information in my blog you should be able to convert the routine to a subpixel-corrected one. Or if that fails, I could give you that part of the code, since that seems to be the least buggy part at this moment 😛
Pingback: Just keeping it real, part 6 | Scali's OpenBlog™
I can’t thank you enough for your series on this subject. I keep threatening to write a PCjr demo someday (I have all the hardware, and I know 8088 assembler well) but I never seem to find the motivation. Your series exploring fillers has nudged me a little closer, so thank you 🙂
Hum… any concrete plans on that yet? The PCjr graphics mode is not that fast, and combined with an 8088 CPU, I guess 3d polygon stuff is not going to be that much of a success. Also, do you have any idea about a music system? 8088 corruption just played back a single sample, right? I have been thinking of doing some AdLib music, which should not take much CPU. The Edlib tracker and replay code are still available from Vibrants, and it’s all 16-bit code. I just need to find a musician who would want to make an Edlib song 🙂
For 286 I have been thinking about a MOD player as well… Crystal Dream somewhat got away with that. Perhaps if the music itself is optimized to use only 2 channels (like Purple Motion’s Minimum Velocity), it would be even better. Downside is, I’d probably have to write the whole MOD routine myself, because I haven’t found any 16-bit replay routines, let alone ones that would be fast enough on a 286 or slower. Alternatively, I could just reverse Crystal Dream a bit further and ‘borrow’ their routine 🙂
But so far I think Edlib is the best option. AdLib music also has a certain charm to it.
(Sorry for the delay in replying; I didn’t have “notify me” turned on.) I don’t have concrete plans yet, I have to free up other things in my life right now. But I will let you know if I’m heading to the latest easter party and maybe at that time we can collaborate? 🙂
3D filling is slow on PCjr (and CGA) so the trick is to write less bits. PCjr has a 160×200 mode that is only 16K; that is perfectly usable. And if the 3D parts are short, there are other tricks that can massively speed things up (don’t want to give away too much before I use the tricks, sorry 🙂
Adlib is a good choice for size and fidelity, but the time it takes is horrendous. It can take up to 75% of a frame on 8088+CGA to write the Adlib registers (I’ve measured it using JCH’s replayer). The time sink due to required waiting for the registers to recover. SBPro 2 (OPL 3) and later don’t have this problem. The only practical realtime solution for 8088 is either changing the speaker 60 times a second (like MONOTONE) or on the PCjr you can drive the SN76496 which takes very little time. I wrote a VGM player for PCjr that takes no more than 6 scanlines to play a frame’s worth of music.
MOD playing is possible on 8088 but you must dedicate the entire machine to it, and use a sound blaster so that you can calc and play and the same time. Galaxy Player (look for glx212) succeeds at this and can play 12KHz 8-bit mono at 4.77Mhz using an ingenious self-modifying code method. Email me for more info. I’d give my right nut to see the original source 🙂 but the author has moved on to bigger and better things and I can’t find his contact info. As for 16-bit reply routines for 286, 386s and higher were common when 16-bit cards came out so that’s primarily why you don’t see any code for it, but it shouldn’t be any slower than 8-bit code running on 8088. If anything, it should be the same speed because you’re doing the same things (mixing using source samples and volume tables), only the target is larger.
The replayer in Crystal Dream is very “dirty”, it sounds like they were trying to save time by pre-shifting all samples by 2 to save volume table lookups if no volume adjustment was necessary, or something else hideous. I’d be very curious to see the routine if you manage to pull it out.
Well, what I meant with 16-bit is the code itself, not the samples. The old DOS MOD routines that I’ve found on the net are all 32-bit code, and would require a 386 to run.
Also, I was not aware that the original Adlib was so slow. My first sound card was an SB Pro 2.0 in a 386SX-16, and that seemed to take very little time for playing music. Then again, that was an OPL3 of course. Never had any OPL2-based cards.
So the PC platform sucks even more than I thought 🙂
Anyway, I was originally planning to target the 286 platform with VGA (and SB Pro 2.0 would realistically fit in there), because it actually allows quite decent 3d stuff, rivaling Amiga demos (much like Crystal Dream itself).
Anything below the 286 is such an incredible deal slower, that I’m not sure if 3d stuff would be worthwhile. Especially if you also go below EGA graphics. Because this series originally started out with exploring special EGA and VGA tricks to get fast polygon filling.
Tandy/PCjr is by far the worst graphics standard, because each pixel takes up 4 bits, and there is no help from the graphics chip whatsoever to improve filling. So the best you can do is rep stosw, which will give you only 4 pixels at a time.
Going to CGA will already improve things a lot, since the pixels are only 2 bits each. You’ll get 8 pixels at a time, which is as good as mode X VGA (but still twice as slow as the 16 pixels you get out of EGA). And if you can’t fit a backbuffer in videomemory and flip buffers, it will take another memcpy to get things on screen.
But well, the obvious trick is to just use a smaller window on screen for the 3d, and fill the rest with a logo/scroller or such. Another simple trick could be to apply interlacing.
I’m willing to give it a try, just for the heck of it. See how far we can push the platform, and see if we can actually make a worthwhile demo within these insane limits.
Looks like i have a lot to reply to here, so I’ll do so over email 🙂 But 286+VGA+GUS is a totally doable target; you have a MUL and DIV that don’t suck, you can shift and rotate very fast, and with a GUS you don’t have to waste time mixing. Legend by Impact Studios is a great demo that shows off what a 286 can do; it’s almost as good as Crystal Dream.
As for PCjr being the worst standard, it depends on how you look at things. An STOSW can fill 4 pixels, or 8 pixels in 320x200x4, or 16 pixels in 640x200x2. The scanline format is the same. Tandy/PCjr can use a 4-color mode but remap the palette entries to any of the 16 text colors, so that can help for “cheating”.
What about the CGA 160×100 16 color mode or the CGA composite mode?
I actually did implement support for CGA composite:

It’s very similar to regular CGA modes, just half the horizontal resolution (basically every 2 2-bit pixels are now turned into 4-bit pixels, so you get 2 pixels packed into a byte instead of 4), and you need to enable NTSC color burst by setting the appropriate register value in the CRTC.
160×100 is just a tweaked textmode, which means it is quite inefficient to render graphics. You have to modify the attribute of each character separately, so it doesn’t lend itself to efficient polygon filling with multiple pixels at a time. I never bothered to experiment with this mode for that reason.
The reason the BLTAPT in the amiga has a multiplication 2 is because that bit 0 is not available in any XLPTL registers on the amiga system.(Word aligned memory addressing)
Besides that the Amiga’s design only allows a channel combination for the ALU, with which i mean Channel A pointer and Channel A modulo can be added or subtracted with the result being stored back in Channel A pointer.
Multiplying a fraction to remove the floating point arithmatic, it all helps reducing hardware requirements.