8088 MPH: The final version

Although we were very happy with the win at Revision, we were not entirely happy with 8088 MPH as a whole. Because of the time pressure, there were some things we could not finish as well as we wanted. Also, as others tried to run our demo on real hardware, we found that even real IBM CGA cards had problems with some of the CRTC tweaks, since not all IBM cards used the same Motorola MC6845 chip. Some used a Hitachi HD6845 instead, which does not handle a hsync width setting of 0 in the same way (the real Motorola chips interpret a value of 0 as 16. The Amstrad CPC also suffers from the issue that during its lifespan, various 6845-chips have been used, introducing incompatibilities).

On the other hand, we also found out that there are indeed clones that are 100% cycle-exact to a real IBM PC. The Philips P3105 machine I have, which I discussed in the CGADEMO blog earlier, is one of these. In that blog I said the system was not cycle-exact, since the CGADEMO showed some problems with the scroller that were not visible on a real IBM. I also said that I suspected the ATi Small Wonder videocard. I was already working on 8088 MPH at the time, and I knew that I had to have a cycle-exact machine. So I bought an IBM 5160. Now that 8088 MPH was done, I had some time to experiment with the hardware. And indeed, it turns out that the Philips machine uses (clones of) the same Intel 82xx chipset as the IBM 5150/5155/5160 (also known as the MCS-85 set, originally developed for the 8085 CPU). Once I swapped the ATi Small Wonder for the real IBM CGA card, the Philips P3105 machine ran 8088 MPH exactly the same as the 5160. Likewise, putting the ATi Small Wonder in the 5160 made it produce the same bugs in CGADEMO and 8088 MPH as the Philips P3105.

So there is hope: there are at least SOME clones that are cycle-exact to a real IBM. The 82xx-chipset is probably a feature to look out for. I have another 8088 machine, a Commodore PC20-III, but it uses a Faraday FE2010 chipset instead, which is not entirely cycle-exact, not even with the real IBM CGA card installed. In fact, it crashes on the endpart. I am not sure if there are any CGA clones that are cycle-exact to a real IBM though. The ATi Small Wonder is not, at least. And neither is the Paradise PVC4 in the PC20-III. What’s more, even though both these cards have a composite output, neither of them produce the same artifact colours as the IBM CGA does. So the demo looks completely wrong.

Speaking of artifact colours, not all IBM CGA cards are equal either, as you might know. We distinguish two versions, known as ‘old-style’ and ‘new-style’ (but as mentioned above, there are variations on the components used as well… and apparently some versions have a green PCB while others have a purple PCB, but neither appear to have any relation to which version of the artifact colours you get). Here is a picture that demonstrates the difference between the two cards (courtesy of minuszerodegrees.net): 5150_compare_cga_adapter

As you can see in the yellow outlines, the difference is mainly a set of resistors. These resistors are used in the generation of the composite output signal. The top card is an ‘old-style’ CGA card, and the bottom one is a ‘new-style’. We believe that IBM changed the circuitry because of the monochrome screen in the portable 5155 PC. This internal screen was connected to the TV modulator header of the CGA card, which basically feeds it the composite output. The problem with the old style card is that when using it in monochrome, there were only a few unique luminance values. The tweak to the circuitry delivered 16 unique luminance values, which made it more suitable for monochrome displays.

The graphics in our demo were entirely tuned for old-style CGA, partly because we wanted to target the original 1981 PC, and partly because old-style CGA has slightly nicer colours, especially for the new 256/512/1024-colour modes.

For more in-depth information, see this excellent Nerdly Pleasures blog.

So, we decided to make a demo about it: we made a to-do list of things we wanted fixed or changed in the final version. Let’s go over these changes.

Overall tweaks

There were some minor glitches between effects, when videomodes were switched, and timer interrupt handlers were installed or removed, which led to the music skipping, the screen losing sync and other tiny things like that.

We fixed all the visual and aural bugs we could (sadly because CGA is rather crude, it is very difficult to make videomode changes entirely invisible, so we have accepted that there are still some tiny visual glitches here and there).

One in particular I like to mention is the plasma effect. This runs in an 80-column textmode, which is susceptible to so-called ‘CGA snow’. Basically, because IBM used single-ported DRAM memory, there is a problem in 80-column mode. When the CPU accesses memory, the video output circuitry can not. This results in the output circuitry just receiving whatever value is on the data bus, rather than the byte it actually wants to read. The resulting random patterns of characters are known as ‘snow’ (the 80-column mode uses more bandwidth than the other modes. In other modes, IBM solved the snow issues by inserting wait states on CPU access).

Our plasma routine is actually much faster than what you see in the demo, because we ‘throttle’ it, and write to the screen only when there is no memory being read by the video output, to avoid snow. Or at least, that was the idea. In the capture of the party version you see one column of CGA snow on the left of the screen, which should have been hidden in the border.

This is because we worked under the assumption that if we set the DMA refresh timer to 19 memory cycles instead of 18 (19 being a divisor of the 76 memory cycles on a scanline), then bring the CRTC, CGA and CPU clocks in lockstep, the entire system would be in lockstep. At the party we discovered that the plasma would sometimes work as designed, and hide the snow in the border, and sometimes the snow would be visible.

As we investigated the issue further, we found that the PIT’s relative phase to the CGA clock can differ on each power-cycle. That is, there is a 14.31818 MHz base clock of the system, from which the CPU, PIT and the ISA bus take their clock signal (and thereby the CGA card, since unlike later graphics cards, it has no timing source of its own. In fact, original IBM PCs have an adjustable capacitor on the motherboard to fine-tune the system clock, in order to fine-tune the NTSC signal of the CGA card).  The PIT is 1.19 MHz, which is 1/12th of the base clock, while the CPU is 4.77 MHz (and thereby the data bus/ISA bus on which the CGA card runs), which is 1/3rd of the base clock. So one PIT cycle is equal to 4 CPU cycles. However, the PIT can start on any of these 4 cycles. So there are 4 possible phases between CPU/CGA and PIT, and as far as we know, there is no way to control this from software. So instead, we have a tool which determines the relative PIT/CGA phase that the PC has powered-up in. Then we just have to power-cycle until we have the right phase, and then we can run the plasma with no snow. Although there are other cycle-exact parts in the demo, they are not affected by this relative phase.

End tune

The end tune was rather quiet. The reason for this is in how the tune is composed, it just happens to be a rather quiet tune. So, we added a ‘preprocessing’ step, which can be seen as a ‘peak normalize’ step: the entire tune is mixed, and the peak amplitude is recorded. Then the song data is compensated to boost the volume to the maximum for that particular tune. Since the sample technique is timer-based (Pulse-width modulation), you have to be careful not to get any overflow during mixing. Where conventional PCM samples have no problem with overflow, other than some clipping/distortion (which is not very noticeable in small amounts), the result of an overflow with PWM can result in a timer value of 0, which is interpreted by the hardware as the maximum timer count of 65536. This means you will get a long pause in the music whenever this happens. Negative timer values will also be interpreted as large positive counter values.

So the end tune is now louder. Which not only makes it easier to hear on a real PC speaker, it also means that the carrier whine is relatively more quiet than before (the carrier whine is still the same amplitude as it was before). Aside from that, the limited sample resolution we have with the PWM technique is now used more effectively, so we have slightly better resolution. So overall it sounds better too.

DOS 2.x

As we found out after the release, when people tried to run the demo on DOS 2.x, it didn’t work. After some debugging, we found that this was a shortcoming of the loader. Namely, we retrieved the filename of the loader (8088MPH.EXE) in order to open the file and read some of the embedded data from it. However, as we found out, this functionality was only added in DOS 3.x. In DOS 2.x this simply is not possible. So, we added ‘8088MPH.EXE’ as a default value, which it will pick in that case. The rest of the demo did not have any problems running under DOS 2.x, so we are now entirely 2.x-compatible, instead of just 3.x. We also looked at DOS 1.x-support briefly, but that version of DOS did not have any memory management yet, nor did it have the ‘nested process’ functionality that we use with our loader to load and start each new part in the background.

So supporting DOS 1.x would require us to implement quite a bit of extra functionality, which effectively would make it trivial to run the demo directly as a booter, without any DOS. A bit too much for a final version, but perhaps for a future project, who knows…

Calibration screen

For the party version, we had just hardcoded the CRTC-values for the tweaked modes to best suit the capture setup that we had. These values can be less than ideal, or even problematic on other setups. So we decided to add a calibration screen for the final version, where you can fine-tune the CRTC-values to get a stable image and the best possible colours for the tweaked modes. This also allows you to avoid the hsync width of 0, so this makes the demo compatible with non-Motorola 6845 chips as well.

New-style CGA

By far the most work in the final version is the new set of graphics. While we had to make some changes to the code to support multiple sets of graphics and palettes (and try to stay within the 360k floppy limit), the new-style CGA support is mostly VileR’s handywork. He had to re-colour most graphics in the demo to compensate for the different colours of new-style CGA. His attention to detail is second to none. For example, he even did a new set of borders for the polygon part. I hadn’t even thought of doing that, since the border didn’t look that different on new CGA to me, and it didn’t look ‘wrong’ unlike some of the other graphics. But once I saw the subtle changes he made, it really did make new CGA look better that way.

Also, even though one of the reasons to target old CGA was a better selection of colours, I must say the new CGA graphics look fantastic, and they are very close to the old CGA versions. In fact, I might even prefer the blue shades on the donut in new CGA mode. I am very happy with that personally, since I only have a new-style CGA card, so I was not able to see the demo properly on my own machine until now.

The texture for the rotozoomer was also updated. The reason for this is that there was a slight miscommunication between Trixter and VileR for the party version, which resulted in VileR using a different palette for the texture than what Trixter intended. The rotozoomer runs in an interlaced screen mode. We tried to set the unused scanlines to a good background colour to compensate for that, but it did not work too well. The final texture works much better.

The plasma effect also received an update to the banners for ’16 color’/’256 color’. There was no time to finish these for the compo. And lastly, the end scroller used ANSI art, but they were only used for some ‘large font’ parts in the end scroller. VileR has now created a whole colourful ANSI background for the scroller.

One last graphics tweak is the title-screen. In order to get the gray colour for the text, we had to set the foreground mode to a lower brightness (this is using mode 6, like in the sprite part with the fade-in/out). This meant that the title screen also appeared in lower brightness than how it was designed. In the final version we have some code that also adjusts the palette on-the-fly as the title screen rolls down, so that the text remains the gray colour, while the title screen is the full brightness.

For more in-depth information on the graphics changes for the final version, I direct you to VileR’s excellent blog.

Bonus

Since our demo introduces various ‘new’ videomodes, reenigne has also prepared some DOSBox patches to improve the CGA emulation, fixing some bugs (eg the tweaked modes in the sprite part did not work in composite mode with the official 0.74 version of DOSBox) and mostly adding a more accurate emulation of the CGA composite mode, by properly emulating the decoding of the NTSC colour signal. It also allows you to switch ‘monitors’ manually, since DOSBox originally implemented composite as a ‘video mode’, which it tried to auto-detect. That is not how it should work. In the real world, your CGA card has both an RGBI and composite output, and both are always enabled. So either you get everything on a composite screen, or you get everything on an RGBI screen. So you want to select which type of monitor DOSBox should emulate, which is now possible. It also allows you to switch between old style and new style colours.

The patches will not allow you to run 8088 MPH entirely in DOSBox, but some parts now look very close to real hardware, and should also benefit any games that make use of CGA composite modes.

In fact, the video capture we made of the final version of our demo is also done with an NTSC decoding routine, similar to the one in the DOSBox patch, but tuned for maximum quality rather than performance.

The demo was captured from real hardware in raw mode, and then the raw captured frames were sent to the NTSC decoder to generate high-quality colour frames. We feel that this capture method is closer to what you would see on a proper high-quality NTSC composite monitor than the cheap NTSC decoding built into most capture devices, which lead to annoying vertical banding issues (as you can see in the party capture video). Another small fix is that we didn’t use the correct aspect ratio for CGA when capturing the party version. The final version is captured with the correct aspect ratio.

Live

If you haven’t seen it yet, there is a live capture of the demo at the Revision 2015 oldskool compo, recorded by Trixter:

I really like the audience reactions. We clearly tried to design our demo for an audience like this, with the intro and all, and putting people on the wrong foot, hitting them with unexpected tricks, and just as they figure out what is going on, we move on to the next part.

Likewise, we tried to build up the speaker music gradually, so that people can get used to the heavily arpeggiated beeps, as the complexity scales up to squeeze as much out of the speaker as possible, rather than losing people by blasting them with very complex beeping out of the gate.

The response of the audience is the confirmation that we got most parts right in that sense: people laughed at the right places, and cheered and applauded in others. If you listen closely, you can actually hear some ‘nerdgasms’: people shouting “What the…!”, “Jesus”, “Oh man!” or “Oh god!” 🙂

We didn’t know beforehand how much people would ‘get’ about this rather obscure platform and the limited graphics and music, but we couldn’t have hoped for a better response, nearly every part was greeted with cheers and applause. A very unique experience, and it’s unlikely that we will ever match this.

Anyway, here’s our final version (old CGA). We hope you like it!
And this is a capture of new CGA:

This entry was posted in Oldskool/retro programming, Software development and tagged , , , , , , , , , , , , , . Bookmark the permalink.

20 Responses to 8088 MPH: The final version

  1. Pingback: Final version of 8088 MPH released « Oldskooler Ramblings

  2. Pingback: O demo do Demo pra PC fica ainda mais demoníaco | Retrocomputaria Plus

  3. John Young says:

    Thank you for all the 8088MPH write-ups. Fascinating reading, although I only understand a little of it.. But thank you for it all, 🙂

  4. Mark says:

    You wrote:
    “The demo was captured from real hardware in raw mode, and then the raw captured frames were sent to the NTSC decoder to generate high-quality colour frames. We feel that this capture method is closer to what you would see on a proper high-quality NTSC composite monitor than the cheap NTSC decoding built into most capture devices, which lead to annoying vertical banding issues (as you can see in the party capture video).”

    What capture hardware did you use to do that? I’d like to know of any hardware for digitizing a composite video signal (that does not decode it).

  5. Pingback: Latch onto this, it’s all relative | Scali's OpenBlog™

  6. Pingback: PC-compatibility, it’s all relative | Scali's OpenBlog™

  7. Pingback: Any real-keeping lately? | Scali's OpenBlog™

  8. What happens if you run 8088MPH on something like an 8086 or an NEC V20/V30? Obviously, it speed up to an incorrect speed, but by how much would it do so, and would there be an easy way to write a correction patch?

    • Scali says:

      What you need to understand is that different CPUs are not just ‘faster’ or ‘slower’. The speed difference comes down to the instruction level. Not all instructions speed up by the same amount.

      Since the code in 8088 MPH is specifically tweaked so that it runs at *exactly* the right speed on a true 8088 at 4.77 MHz, that is the only CPU it runs on correctly (and actually it goes beyond just the CPU, the rest of the system, including motherboard, chipset, video card and whatnot also need to be cycle-exact in speed).

      An 8086 or V20/V30 will not just make the demo run faster, but it will introduce all sorts of ‘jitter’ artifacts, since some parts speed up more than others.
      For example, the end-tune is tuned so that all blocks instructions for calculating every sample in the music run at exactly 288 cycles on an 8088 at 4.77 MHz. Each block uses a different sequence of instructions, so on a different CPU, some blocks will speed up more than others, causing the speed of the music to go up and down ‘randomly’.

      There is no easy fix for that. For each CPU (and each clockspeed), you would need to rewrite the code to match that particular CPU’s instructions.

      The only alternative is to use an absolute time source. The problem with that is that it takes a lot of overhead, so it will require a much faster CPU. It might only be possible on 286+ systems that way.

      It completely defeats the point of this demo, which is to show what you can make the original IBM PC 5150 do. If you’re not going to run it on a real IBM PC anyway, why not just stick to the YouTube capture?
      I don’t see the point in trying to make a demo for IBM PC run on faster systems. What’s impressive about that? They are faster systems.

      • I was only asking out of curiosity, mainly. But if I need an excuse… It would probably be something to do with wanting to run the demo on a real DOS machine, but not having access to an original IBM PC.

        Another question though; how much work do you think it would require of somebody wanted to port 8088 MPH to a completely different platform?

      • Scali says:

        The problem is that it’s specifically tailored to take advantage of the original PC’s hardware.
        At the very least you need an original IBM CGA card. It doesn’t work properly on most clones, and it certainly doesn’t work on EGA or VGA.

        Porting is also not possible for that reason.
        The only thing you could do is to reimplement it for another system, but then you’d be using different effects and different capabilities, they’d just look the same.
        I mean, take the 1024 colour effects for example. That’s only possible because of specific quirks of the IBM CGA card.
        It won’t work on EGA or VGA, and neither EGA nor VGA have a video mode that allows 1024 colours on screen.
        So there basically is no way to port it.
        I just don’t think you understand what this demo is all about. It is written in the same way that games and demos are developed on eg an Atari 2600, NES or Commodore 64: you assume that all machines have the EXACT same hardware, and take maximum advantage of everything you know about the hardware’s quirks, timings etc.

  9. Pingback: More PC(jr) incompatibilities! | Scali's OpenBlog™

  10. Pingback: What makes the PCjr cool, and what makes it uncool? | Scali's OpenBlog™

  11. tahrey says:

    Heh, that bit about the PIT sometimes coming up in different sync to the rest of the system, and it being down to its clock being 1/4 that of the CPU, puts me in mind of the Atari ST video system. It also has what’s been retroactively labelled “wakestates”, which seem to be particular to each board revision and how the chips warm up as they go from a completely cold state to their stable running temperature. One model being turned on cold will tend towards a certain wakestate, which is different from what it may exhibit after a soft reset when half warmed, and then from fully warmed up, and all three may be different from what a different model shows when cold-started. It’s something that’s caused headaches for democoders and emulator writers alike, and even for some very timing dependent software like Spectrum512, as essentially it governs the delay between a pixel being read from memory and it appearing on-screen.

    Technically, in low resolution, there are 4 pixels to a word (well, more accurately, 4 words to a 16-pixel block, but the result is the same), with 2 million words being read per second (the most the bus can allow, in a 1:1 contention with the CPU) and 8 million pixels going to the screen. With some buffering of output pixels in the video chip, and the output potentially starting on any given bus cycle when compared to when the actual data reads had to happen (as determined by the video/CPU access gating), it follows that there could be between 0 and 3 pixels’ worth of delay from the last read of a block landing in the chip, and that block starting to be read out.

    …which, when you’re trying to implement graphical effects that rely on beam-racing palette changes being slammed into the vidchip registers as rapidly and precisely as possible, causes no end of hard to trace, often seemingly random glitches and corruption… which would probably be familiar on one system to envelope-pushing programmers of the other. Though, the ST has the additional problem that there’s no way to automatically detect the wakestate, as the crucial parts of the video shifter are essentially write-only/one-way output… you have to program in an interactive check screen similar to the 8088mph “old/new CGA” intro instead…

  12. Pingback: GPU-accelerated video decoding | Scali's OpenBlog™

  13. Pingback: Area 5150: 8088 MPH gets a successor | Scali's OpenBlog™

  14. Pingback: MartyPC: PC emulation done right | Scali's OpenBlog™

  15. Pingback: Video playback on low-end MS-DOS machines | Scali's OpenBlog™

Leave a comment