Trackers vs MIDI

With all the recent tinkering with audio devices and sound routines, I stumbled across various resources, old and new. One such resource was this article on OS/2 Museum, about the Gravis UltraSound. (And small world, this site is by Michal Necasek, who is on the OpenWatcom team, the C/C++ compiler I use for my 16-bit DOS projects). More specifically, the ‘flamewar’ between Rich Heimlich and the rest of the newsgroup, regarding the quality of the UltraSound patches, and general usability in games.

Now, as a long-time demoscener and Amiga guy, it probably doesn’t surprise you that I myself was an early adopter of the GUS, and it has always had a special place in my heart. So I decided to browse through that flamewar, for a trip down memory lane. I can understand both sides of the argument.

In the blue corner…

The GUS was not designed to be a perfect clone of a prior standard, and build on that, unlike most other sound cards. Eg, the original Sound Blaster was basically an AdLib with a joystick port and a DMA-driven DAC for digital audio added. Later Sound Blasters and clones would in turn build on 100% AdLib/Sound Blaster compatibility. Likewise, the Roland MT-32 set a standard. Roland Sound Canvas set another standard (General MIDI), and also included an MT-32 compatibility mode (which wasn’t quite 100% though). Most other MIDI devices would also try to be compatible with MT-32 and/or Sound Canvas.

The GUS was different. Being a RAM-based wavetable synthesizer, it most closely resembles the Amiga’s Paula chip. Which is something completely alien to PCs. You upload samples, which you can then play at any pitch and volume, anywhere in the stereo image (panning). While there was a brave attempt at a software layer to make the thing compatible with Sound Blaster (SBOS), the nature of the hardware didn’t lend itself very well to simulating the Yamaha OPL2 FM chip. So the results weren’t that great.

In theory it would lend itself quite well to MIDI, and there was also an emulator available to support Roland MT-32 and Sound Canvas (Mega-Em). However, for a complete General MIDI patch set, you needed quite a lot of RAM, and that was the bottleneck here. Early cards only had 256kB. Later cards had 512kB and could be upgraded to a maximum of 1 MB. Even 1 MB is still quite cramped for a high-quality General MIDI patch set. Top quality ROM-based wavetable synthesizers would have around 4 MB of ROM to store the patches.

Since the card was rather new and not that well-known, there weren’t that many games that supported the card directly, so you often had to rely on these less-than-great emulators. Even when games did use the card natively, the results weren’t always that great. And that’s what I want to focus on in this blog, but more on that later.

I never used SBOS myself, so I suppose my perspective on the GUS is slightly different anyway. My first soundcard was a Sound Blaster Pro 2.0, and when I got a GUS some years later (being a C64/Amiga guy, the SB Pro never impressed me much. The music sounded bland and the card was very noisy), I just left the SB Pro in my system, so I had the best of both worlds: Full AdLib/SB compatibility, and GUS support (or MT-32/Sound Canvas emulation) when required.

In the red corner…

People who owned and loved the UltraSound, knew what the card was capable of, if you played to its strengths, rahter than its weaknesses (as in the emulators).

Gravis included their own MIDI player, where you could configure the player to use specially tweaked patch sets for each song. The card could really shine there. For example, they included a solo piano piece, where the entire RAM could be used for a single high-quality piano patch:

Another demonstration they included was this one:

That works well for individual songs, because you know what instruments are and aren’t used. But for generic software like games, you have to support all instruments, so you have to cram all GM instruments into the available RAM.

And being so similar to the Amiga’s Paula, the GUS was quickly adopted by demosceners, who had just recently started to focus on the PC, and brought the Amiga’s ProTracker music to the PC. Initially just by software-mixing multiple channels and outputting on PC speaker, Covox, Sound Blaster or similar single-channel devices. So when the GUS came out, everything seemed to fall into place: This card was made to play modules. Each module contains only the samples it needs, so you make maximum use of the RAM on the card. The chip would perform all mixing in hardware, so there was very little CPU overhead for playing music, and the resulting quality was excellent.

On the Amiga, every game used tracked music. So that would be a great solution for the GUS on the PC as well, right? Well, apparently not, because in practice very few games included tracked music on PC. And of the few games that did, many of them were ported from the Amiga, and used the 4-channel 8-bit music from the Amiga as-is. That didn’t really give the GUS much of a chance to shine. It couldn’t show off its 16-bit quality or its ability to mix up to 32 channels in hardware. Mixing just 4 channels was not such a heavy load on the CPU at the time, so hardware mixing wasn’t that much of an advantage in this specific case.

Yamaha’s FM synthesis

As you may know, the Sound Blaster and AdLib cards used a Yamaha FM synthesizer chip. Originally they used the OPL2, later generations (starting with Sound Blaster Pro 2.0 and AdLib Gold), used the more advanced OPL3. Now, Yamaha is a big name in the synthesizer world. And their FM synthesis was hugely popular in the 80s, especially with their revolutionary DX7 synthesizer, which you can hear in many hits from that era.

But I just said that I thought the Sound Blaster Pro 2.0 sounded bland. What happened here? Well, my guess is that MIDI happened. The above flamewar with Rich Heimlich seems to revolve a lot around the capability of the devices to play MIDI data. Rich Heimlich was doing QA for game developers at the time, and apparently game developers thought MIDI was very important.

Yamaha’s chips, much like the GUS, weren’t that well-suited for MIDI. For different reasons however, although, some are related. That is, if you want to play MIDI data, you need to program the proper instruments into the FM synthesizer. If you just use generic instrument patches, your music will sound… generic.

Also, you are not exploiting the fact that it is in fact an FM synthesizer, and you can modify all the operators in realtime, doing cool filter sweeps and other special effects that make old synthesizers so cool.

Why MIDI?

So what is it that made MIDI popular? Let’s define MIDI first, because MIDI seems to mean different things to different people.

I think we have to distinguish between three different ‘forms’ of MIDI:

  1. MIDI as in the physical interface to connect musical devices
  2. MIDI as in the file format to store and replay music
  3. MIDI as in General MIDI

Interfaces

The first is not really relevant here. Early MIDI solutions on PC were actually a MIDI interface. For example, the MT-32 and Sound Canvas that were mentioned earlier were actually ‘sound modules’, which is basically a synthesizer without the keyboard. So the only way to get sound out of it is to send it MIDI data. Which you could do from any MIDI source, such as a MIDI keyboard, or a PC with a MIDI interface. The Roland MPU-401 was an early MIDI interface for PC, consisting of an ISA card and a breakout box with MIDI connections. The combination of MPU-401 + MT-32 became an early ‘standard’ in PC audio.

However, Roland later released the LAPC-I, which was essentially an MPU-401 and MT-32 integrated on an ISA card. So you no longer had any physical MIDI connection between the PC and the sound module. Various later sound cards would also offer MPU-401-compatibility, and redirect the MIDI data to their onboard synthesizer (like the GUS with its MegaEm emulation, or the Sound Blaster 16 with WaveBlaster option, or the AWE32). I can also mention the IBM Music Feature Card, which was a similar concept to the LAPC-I, except that its MIDI interface was not compatible with the MPU-401, and it contained a Yamaha FB-01 sound module instead of an MT-32.

So for PCs, the physical MIDI interface is not relevant. The MPU-401 hardware became a de-facto standard ‘API’ for sending MIDI data to a sound module. Whether or not that is actually implemented with a physical MIDI interface makes no difference for PC software.

File format

Part of the MIDI standard is also a way to store the MIDI data that is normally sent over the interface to a file, officially called ‘Standard MIDI file’ or sometimes SMF. It is basically a real-time log of MIDI data coming in from an interface: a sequence of MIDI data events, with delta timestamps of very high accuracy (up to microsecond resolution). We mostly know them as ‘.MID’ files. These are also not that relevant to PC games. That is, they may be used in the early stages of composing the music, but most developers will at some point convert the MIDI data to a custom format that is more suited to realtime playback during a game on various hardware.

General MIDI

Now this is the part that affects sound cards, and the GUS in particular. Initially, MIDI was nothing more than the first two points: an interface and a file format. So where is the problem? Well, MIDI merely describes a number of ‘events’, such as note on/off, vibrato, etc. So MIDI events tell a sound module what to play, but nothing more. For example, you can send an event to select ‘program 3’, and then to play a C#4 at velocity 87.

The problem is… what is ‘program 3’? That’s not described by the MIDI events. Different sound modules could have entirely different types of instruments mapped to the same programs. And even if you map to the same instruments, the ‘piano’ of one sound module will sound different to the other, and one module may support things like aftertouch, while another module does not, so the expression is not the same.

In the PC-world, the MT-32 became the de-facto standard, because it just happened to be the first commonly available/supported MIDI device. So games assumed that you connected an MT-32, and so they knew what instruments mapped to which programs. One reason why the IBM Music Feature Card failed was because its FB-01 was very different from the MT-32, and the music had to be tweaked specifically for the unit to even sound acceptable, let alone sound good.

Roland later introduced the SC-55 Sound Canvas, as something of a ‘successor’ to the MT-32. The SC-55 was the first device to also support ‘General MIDI’, which was a standardization of the instrument map, as well as a minimum requirement for various specs, such as polyphony and multi-timbral support. It could be switched to the MT-32 instrument map for backward compatibility.

Where did it go wrong?

While the idea of standardizing MIDI instruments and specs seems like a noble cause, it never quite worked in practice. Firstly, even though you now have defined that program 1 is always a piano, and that you can always find an organ at program 17, there is still no guarantee that things will sound anything alike. Different sound modules will have different methods of sound generation, use different samples, and whatnot, so it never sounds quite the same. What’s worse, if you play an entire piece of music (as is common with games), you use a mix of various instruments. You get something that is more than the ‘sum of its parts’… as in, the fact that each individual instrument may not sound entirely like the one the composer had used, is amplified by them not fitting together in the mix like the composer intended.

In fact, even the SC-55 suffered from this already: While it has an MT-32 ’emulation’ mode, it does not use the same linear arithmetic method of sound generation that the real MT-32 uses, so its instruments sound different. Games that were designed specifically for the MT-32 may sound slightly less than perfect to downright painful.

The second problem is that indeed developers would design sound specifically for the MT-32, and thereby used so-called ‘System Exclusive’ messages to reprogram the sounds of the MT-32 to better fit the composition. As the name already implies, these messages are exclusive to a particular ‘system’, and as such are ignored by other devices. So the SC-55 can play the standard MT-32 sounds, but it cannot handle any non-standard programming.

This leads to a ‘lowest common denominator’ problem: Because there are so many different General MIDI devices out there, it’s impossible to try and program custom sounds on each and every one of them. So you just don’t use it. This is always a problem with standards and extension mechanisms, and reminds me a lot of OpenGL and its extension system.

Today, many years later, General MIDI is still supported by the built-in Windows software synthesizer and most synthesizers and sound modules on the market, and the situation hasn’t really changed: if you just grab a random General MIDI file, it will sound different on all of them, and in many cases it doesn’t even sound that good. The fact that it’s ‘lowest common denominator’ also means that some of the expression and capabilities of synthesizers are lost, and they tend to sound a bit ‘robotic’.

So I think by now it is safe to say that if the goal of General MIDI was to standardize MIDI and make all MIDI sound good everywhere, all the time, it has failed miserably. Hence General MIDI never caught on as a file format for sharing music, and we stopped using it for that purpose many years ago. The ‘classic’ MIDI interface and file format/data are still being used in audio software, but things went more into the direction of custom virtual instruments with VSTi plugins and such, so I don’t think anyone bothers with standardized instrument mapping anymore. The first two parts of MIDI, the interface and the file format, did their job well, and still do to this day.

Getting back to games, various developers would build their music system around MIDI, creating their own dialect or preprocessor. Some interesting examples are IMF by ID software, which preprocesses the MIDI data to OPL2-specific statements, and HERAD by Cryo Interactive.

Doing something ‘custom’ with MIDI was required for at least two reasons:

  1. Only high-end devices like the IBM Music Feature Card and the MPU-401/MT-32/Sound Canvas could interpret MIDI directly. For other devices, such as PC speaker, PCjr/Tandy audio, AdLib or Game Blaster, you would need to translate the MIDI data to specific commands for each chip to play the right notes.
  2. Most audio devices tend to be very limited in the number of instruments they can play at a time, and how much polyphony they have.

Especially that second issue is a problem with MIDI. Since MIDI only sends note on/off commands, there is no explicit polyphony. You can just endlessly turn on notes, and have ‘infinite polyphony’ going on. Since MIDI devices tend to be somewhat ‘high-end’, they’ll usually have quite a bit of polyphony. For example, the MT-32 already supports up to 32 voices at a time. It has a simple built-in ‘voice allocation’, so it will dynamically allocate voices to each note that is played, and it will turn off ‘older’ notes when it runs out. With enough polyphony that usually works fine in practice. But if you only have a few voices to start with, even playing chords and a melody at the same time may already cause notes to drop out.

An alternative

Perhaps it’s interesting to mention the Music Macro Language (MML) here. Like the MIDI file format it was a way to store note data in a way that was independent from the actual hardware. Various early BASIC dialects had support for it. It seemed to especially be popular in Japan, possibly because of the popularity of the MSX platform there. At any rate, where some game developers would build a music system around MIDI, others would build an MML interpreter, usually with their own extensions to make better use of the hardware. Chris Covell did an interesting analysis of the MML interpreter found in some Neo Geo games.

So, trackers then!

Right, so what is the difference between trackers and MIDI anyway? Well, there are some fundamental differences, mainly:

  1. The instrument data is stored together with the note data. Usually the instruments are actually embedded inside the tracker ‘module’ file, although some early trackers would store the instruments in separate files and reference them from the main file, so that instruments could easily be re-used by multiple songs on a single disk.
  2. Notes are entered in ‘patterns’, like a 2d matrix of note data, where a pattern is a few bars of music. These patterns are then entered in a ‘sequence’, which determines the order of the song, allowing easy re-use of patterns.
  3. The columns of the pattern are ‘channels’, where each channel maps directly to a physical voice on the audio hardware, and each channel is monophonic, like the audio hardware is.
  4. The horizontal ‘rows’ of the pattern make up the timeline. The timing is usually synchronized to the framerate (depending on the system this is usually 50, 60 or 70 Hz), and the tempo is set by how many frames each row should take.

Does that sound limited? Well yes, it does. But there is a method to this madness. Where MIDI is a ‘high-level’ solution for music-related data, which is very flexible and has very high accuracy, trackers are more ‘low-level’. You could argue that MIDI is like C, and trackers are more like assembly. Or, you could think of MIDI as HTML: it describes which components should be on the page, and roughly describes the layout, but different browsers, screen sizes, installed fonts etc will make the same page look slightly different. A tracker however is more like PostScript or PDF: it describes *exactly* what the page looks like. Let’s look at these 4 characteristics of trackers in detail.

Instruments inside/with the file

Trackers started out as being hardware-specific music editors, mainly on C64 and Amiga. As such, they were targeted at a specific music chip and its capabilities. As a result, you can only play tracker modules on the actual hardware (or an emulation thereof). But since it is a complete package of both note data and instrument data, the tracker module defines exactly how the song should sound, unlike MIDI and its General MIDI standard, which merely describe that a certain instrument should be ‘a piano’, or ‘a guitar’ or such.

The most popular form of tracker music is derived from the Amiga and its SoundTracker/NoiseTracker/ProTracker software. I have discussed the Amiga’s Paula sound chip before. It was quite revolutionary at the time in that it used 4 digital sound channels. Therefore, Amiga trackers used digital samples as instruments. Given enough CPU processing power, and a way to output at least a single digital audio stream, it was relatively easy to play Amiga modules on other platforms, so these modules were also used on PC and even Atari ST at times.

Notes entered in patterns

I more or less said it already: trackers use sequences of patterns. I think to explain what a ‘pattern’ is, an image speaks more than a thousand words:

protracker01

If you are familiar with drum machines, they usually work in a similar way: a ‘pattern’ is a short ‘slice’ of music, usually a few bars. Then you create a song by creating a ‘sequence’ of patterns, where you can re-use the same patterns many times to save time and space.

Patterns are vertically oriented, you usually have 64 rows to place your notes on. What these rows mean ‘rhythmically’ depends on what song speed you choose (so how quickly your rows are played), and how ‘sparsely’ you fill them. For example, you could put 4 bars inside a single pattern. But if you space your notes apart twice as far, and double the speed, it sounds the same, but you only get 2 bars out of the 64 rows now. However, you have gained extra ‘resolution’, because you now have twice the amount of rows in the same time interval.

Pattern columns are ‘voices’

This is perhaps the biggest difference between MIDI and trackers: Any polyphony in a tracker is explicit. Each channel is monophonic, and maps directly to a (monophonic) voice on the hardware. This is especially useful for very limited sound chips that only have a handful of voices (like 3 for the C64 and 4 for the Amiga). MIDI simply sends note on/off events, and there will be some kind of interpreter that converts the MIDI data to the actual hardware, which will have to decide which voices to allocate, and may have to shut down notes when new note on-events arrive and there are no more free voices.

With a tracker, you will explicitly allocate each note you play to a given channel/voice, so you always know which notes will be enabled, and which will be disabled. This allows you to make very effective use of only a very limited number of channels. You can for example ‘weave’ together some drums and a melody or bassline. See for example what Rob Hubbard does here, at around 4:03:

He starts out with just a single channel, weaving together drums and melody. Then he adds a second channel with a bassline and even more percussion. And then the third channel comes in with the main melody and some extra embellishments. He plays way more than 3 parts all together on a chip only capable of 3 channels. That is because he can optimize the use of the hardware by manually picking where every note goes.

Here is another example, by Purple Motion (of Future Crew), using only two channels:

And another 2-channel one by Purple Motion, just because optimization is just that cool:

I suppose these songs give a good idea of just how powerful a tool a tracker can be in capable hands.

The horizontal ‘rows’ of the pattern make up the timeline

This part also has to do with efficiency and optimization, but not in the musical sense. You may recall my earlier articles regarding graphics programming and ‘racing the beam’ and such. Well, of course you will want to play some music while doing your graphics in a game or demo. But you don’t want your music routine to get in the way of your tightly timed pixel pushing. So what you want is to have a way to synchronize your music routine as well. This is why trackers will usually base their timing on the refresh-rate of the display.

For example, Amiga trackers would run at 50 Hz (PAL). That is, your game/demo engine will simply call the music routine once per frame. The speed-command would be a counter of how many frames each row would take. So if you set speed 6, that means that the music routine will count down 6 ‘ticks’ before advancing to the next row.

This allows you to choose when you call the music routine during a frame. So you can reserve a ‘slot’ in your rastertime, usually somewhere in the vertical blank interval, where you play the music. Then you know that by definition the music routine will not do anything during the rest of the frame, so you can do any cycle-exact code you like. The music is explicitly laid out in the row-format to be synchronized this way, allowing for very efficient and controlled replaying in terms of CPU time. The replay routine will only take a handful of scanlines.

With regular MIDI this is not possible. MIDI has very accurate timing, and if you were to just take any MIDI song, you will likely have to process multiple MIDI events during a frame. You never quite know when and where the next MIDI event may pop up. Which is why games generally quantize the MIDI down. However, quantizing it all the way down to around 50 or 60 Hz is not going to work well, so they generally still use a significantly higher frequency, like in the range 240-700 Hz. Which is an acceptable compromise, as long as you’re not trying to race the beam.

Back to the UltraSound

The specific characteristics and advantages of tracker-music should make it clear why it was so popular in the demoscene. And by extension you will probably see why demosceners loved the UltraSound so much: it seems to be ‘custom-made’ for playing sample-based tracker modules. ProTracker modules already sounded very good with 4 channels and 8-bit samples, even if on the PC you needed to dedicate quite a bit of CPU power for a software-mixing routine.

But now there was this hardware that gave you up to 32 channels, supported 16-bit samples, and even better: it did high-quality mixing in hardware, so like on the Amiga it took virtually no CPU time to play music at all. The UltraSound was like a ‘tracker accelerator’ card. If you heard the above examples with just 2 or 3 channels on primitive chips like the C64’s SID and the Amiga’s Paula, you can imagine what was possible with the UltraSound in capable hands.

Where things went wrong for the UltraSound is that trackers were not adopted by a lot of game developers. Which is strange in a way. On the Amiga, most games used one of the popular trackers, usually ProTracker. You would think that this approach would be adopted for the UltraSound as well. But for some reason, many developers treated it as a MIDI device only, and the UltraSound wasn’t nearly as impressive in games as it was in the demoscene.

So, let’s listen to two of my favourite demos from the time the UltraSound reigned supreme in the demoscene. The legendary demo Second Reality has an excellent soundtrack (arguably the highlight of the demo), using ‘only’ 8 channels:

And Triton’s Crystal Dream II also has some beautiful tracked music, again I believe it is ‘only’ 8 channels, certainly not the full 32 that the UltraSound offered (note by the way that the card pictured in the background of the setup menu is an UltraSound card):

What is interesting is that both these groups developed their own trackers. Future Crew developed Scream Tracker, and Triton developed FastTracker. They became the most popular trackers for PC and UltraSound.

So who won in the end? Well, neither did, really. The UltraSound came a bit too late. There were at least three developments that more or less rendered the UltraSound obsolete:

  1. CPUs quickly became powerful enough to mix up to 32 channels with 16-bit accuracy and linear interpolation in the background, allowing you to get virtually the same quality of tracker music from any sound card with a single stereo 16-bit DAC (such as a Sound Blaster 16 or Pro Audio Spectrum 16) as you do from the UltraSound.
  2. CD-ROMs became mainstream, and games started to just include CD audio tracks as music, which no sound card could compete with anyway.
  3. Gaming migrated from DOS to Windows. Where games would access sound hardware directly under DOS, in Windows the sound hardware was abstracted, and you had to go via an API. This API was not particularly suited to a RAM-based wavetable synthesizer like the UltraSound was, so again you were mostly in General MIDI-land.

As for MIDI, point 2 more or less sealed its fate in the end as well, at least as far as games are concerned. Soundtracks are ‘pre-baked’ to CD-tracks or at least digital audio files on a CD, and just streamed through a stereo 16-bit DAC. MIDI has no place there.

I would say that General MIDI has become obsolete altogether. It may still be a supported standard in the market, but I don’t think many people actually use it to listen to music files on their PCs anymore. It just never sounded all that good.

MIDI itself is still widely used as a basis for connecting synthesizers and other equipment together, and most digital audio workstation software will also still be able to import and export standard MIDI files, although they generally have their own internal song format that is an extension of MIDI, which also includes digital audio tracks. Many songs you hear on the radio today probably have some MIDI events in them somewhere.

Trackers are also still used, both in the demoscene, and also in the ‘chiptune’ scene, which is somewhat of a spinoff of the demoscene. Many artists still release tracker songs regularly, and many fans still listen to tracker music.

 

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , , , , , | 27 Comments

DMA activation

No, not the pseudoscience stuff, I am talking about Direct Memory Access. More specifically in the context of IBM PC and compatibles, which use the Intel 8237A DMA controller.

For some reason, I had never used the 8237A before. I suppose that’s because the DMA controller has very limited use. In theory it can perform memory-to-memory operations, but only between channels 0 and 1, and IBM hardwired channel 0 to perform memory refresh on the PC/XT, so using channel 0 for anything else has its consequences. Aside from that, the 8237A is a 16-bit DMA controller shoehorned into a 20-bit address space. So the DMA controller can only address 64k of memory. IBM added external ‘page registers’ for each channel, so you can store the high 4 bits of the 20-bit address there, and this will be combined with the low 16-bit address from the DMA controller on the bus. This means there are only 16 pages of 64k, aligned to 64k boundaries (so you have to be careful when allocating a buffer for DMA, you need to align it properly so you do not cross a page boundary. Beyond 64k, the addressing just wraps around). However, since channel 0 was reserved for memory refresh on the PC/XT, they did not add a page register for it. This means that you can only do memory-to-memory transfers within the same 64k page of channel 1, which is not very useful in general.

On the AT however, they added separate memory refresh circuitry, so channel 0 now became available for general use. They also introduced a new page register for it (as well as a second DMA controller for 16-bit DMA transfers, as I also mentioned in this earlier article). So on an AT it may actually work. There is another catch, however: The 8237A was never designed to run at speeds beyond 5 MHz. So where the 8237A runs at the full 4.77 MHz on a regular PC/XT, it runs at half the clockspeed on an AT (either 3 or 4 MHz, depending on whether you have a 6 or 8 MHz model). So DMA transfers are actually slower on an AT than on a PC/XT. At the same time the CPU is considerably faster. Which means that most of the time, you’re better off using the CPU for memory-to-memory transfers.

Therefore, DMA is mostly used for transferring data to/from I/O devices. Its main uses are for floppy and harddrive transfers, and audio devices. Being primarily a graphics programmer, I never had any need for that. I needed memory-to-memory transfers. You generally don’t want to implement your own floppy and harddrive handling on PC, because of the variety of hardware out there. It is better to rely on BIOS or DOS routines, because they abstract the hardware differences away.

It begins

But in the past weeks/months I have finally been doing some serious audio programming, so I eventually arrived at a need for DMA: the Sound Blaster. In order to play high-quality digital audio (up to 23 kHz 8-bit mono) on the Sound Blaster, you have to set up a DMA transfer. The DSP (‘Digital Sound Processor’ in this context, not Signal) on the SB will read the samples from memory via DMA, using an internal timer to maintain a fixed sampling rate. So playing a sample is like a ‘fire-and-forget’ operation: you set up the DMA controller and DSP to transfer N bytes, and the sample will play without any further intervention from the CPU.

This is a big step up from the sample playing we have been doing so far, with the PC Speaker, Covox or SN76489 (‘Tandy/PCjr’ audio). Namely, all these devices required the CPU to output individual samples to the device. The CPU was responsible for accurate timing. This requires either cycle-counted loops or high-frequency timer interrupts. Using DMA is more convenient than a cycle-counted loop, and far more efficient than having to handle an interrupt for every single sample. You can now play back 23 kHz mono 8-bit audio at little more than the cost of the bandwidth on the data bus (which is about 23 kb/s in this case: you transfer 1 byte for each sample), so you still have plenty of processing time left to do other stuff. The DMA controller will just periodically signal a HOLD to the CPU. Once the CPU acknowledges this with a HLDA signal, the DMA controller has now taken over the bus from the CPU (‘stealing cycles’), and can put a word from memory onto the bus for the I/O device to consume. The CPU won’t be able to use the bus until the DMA transfer is complete (this can either be a single byte transfer or a block transfer).

It’s never that easy

If it sounds too good to be true, it usually is, right? Well, in a way, yes. At least, it is for my chosen target: the original Sound Blaster 1.0. It makes sense that when you target the original IBM PC 5150/5160, that you also target the original Sound Blaster, right? Well, as usual, this opened up a can of worms. The keyword here is ‘seamless playback’. As stated above, the DMA controller can only transfer up to 64k at a time. At 22 kHz that is about 3 seconds of audio. How do you handle longer samples?

After the DMA transfer is complete, the DSP will issue an interrupt. For longer samples you are expected to send the next buffer of up to 64k immediately. And that is where the trouble is. No matter what you try, you cannot start the next buffer quickly enough. The DSP has a ‘busy’ flag, and you need to wait for the flag to clear before you send each command byte. I have measured that on my 8088 at 4.77 MHz, it takes 316 cycles to send the 3 bytes required for a new buffer command (the 0x14 command to play a DMA buffer, then the low byte and high byte of the number of samples to play). At 4.77 MHz, a single sample at 22050 Hz lasts about 216 CPU cycles. So you just cannot start a new transfer quickly enough. There is always a small ‘glitch’. A faster CPU doesn’t help: it’s the DSP that is the bottleneck. And you have to wait for the busy-flag to clear, because if you don’t, it will not process the command properly.

Nagging the DSP

Some early software tried to be creative (no pun intended) with the Sound Blaster, and implemented unusual ways to output sound. One example is Crystal Dream, which uses a method that is described by Jon Campbell of DOSBox-X as ‘Nagging the DSP’. Crystal Dream does not bother with the interrupt at all. Apparently they found out that you can just send a new 0x14 command, regardless of whether you received and acknowledged the interrupt or not. In fact, you can even send it while the buffer is still playing. You will simply ‘restart’ the DSP with a new buffer.

Now, it would be great if this resulted in seamless output, but experimentation on real hardware revealed that this is not the case (I have made a small test program which people can run on their hardware here).. Apparently the output stops as soon as you send the 0x14 command, and it doesn’t start again until you’ve sent all 3 bytes, which still takes those 316 cycles, so you effectively get the exact same glitching as you would with a proper interrupt handler.

State of confusion

So what is the solution here? Well, I’m afraid there is no software-solution. It is just a design-flaw in the DSP. This only affects DSP v1.xx. Later Sound Blaster 1.x cards were sold with DSP v2.00, and Creative also offered these DSPs as upgrades to existing users, as the rest of the hardware was not changed. See this old Microsoft page for more information.  The early Sound Blasters had a ‘click’ in digital output that they could not get rid of:

If a board with the versions 1.x DSP is installed and Multimedia Windows is running in enhanced mode, a periodic click is audible when playing a wave file. This is caused by interrupt latency, meaning that interrupts are not serviced immediately. This causes the Sound Blaster to click because the versions 1.x DSP produce an interrupt when the current DMA buffer is exhausted. The click is the time it takes for the interrupt to be serviced by the Sound Blaster driver (which is delayed by the 386 enhanced mode of Windows).

The click is still present in standard mode, although it is much less pronounced because the interrupt latency is less. The click is more pronounced for pure tones.

The version 2.0 DSP solves this problem by using the auto- initialize mode of the DMA controller (the 8237). In this mode, the DMA controller automatically reloads the start address and count registers with the original values. In this way, the Sound Blaster driver can allocate a 4K DMA buffer; using the lower 2K as the “ping” buffer and the upper 2K as the “pong” buffer.

While the DMA controller is processing the contents of the ping buffer, the driver can update the pong; and vice versa. Therefore, when the DMA controller auto-initializes, it will already have valid data available. This removes the click from the output sound.

What is confusing here, is the nomenclature that the Sound Blaster Hardware Programming Guide uses:

Single-cycle DMA mode

They refer to the ‘legacy’ DSP v1.xx output mode as ‘single-cycle DMA mode’. Which is true, in a sense: You program the DMA controller for a ‘single transfer mode’ read. A single-cycle transfer means that the DMA controller will transfer one byte at a time, when a device does a DMA request. After that, the bus is released to the CPU again. Which makes sense for a DAC, since it wants to play the sample data at a specific rate, such as 22 kHz. For the next byte, the DSP will initiate a new DMA request by asserting the DREQ line again. This opposed to a ‘block transfer’ where the DMA controller will fetch the next byte immediately after each transfer, so a device can consume data as quickly as possible, without having to explicitly signal a DREQ for each byte.

Auto-Initialize DMA mode

The ‘new’ DSP 2.00+ output mode is called ‘auto-initialize DMA mode’. In this mode, the DSP will automatically restart the transfer at every interrupt. This gives you seamless playback, because it no longer has to process a command from the CPU.

The confusion here is that the DMA controller also has an ‘autoinitialize mode’. This mode will automatically reload the address and count registers after a transfer is complete. So the DMA controller is immediately reinitialized to perform the same transfer again. Basically the same as what the DSP is doing in ‘auto-initialize DMA mode’. You normally want to use both the DMA controller and DSP in their respective auto-init modes. Double-buffering can then be done by setting up the DMA controller with a transfer count that is twice as large as the block size you set to the DSP. As a result, the DSP will give you an interrupt when it is halfway the DMA buffer, and another one when it is at the end. That way you can re-fill the half of the buffer that has just finished playing at each interrupt, without any need to perform extra synchronization anywhere. The DMA controller will automatically go back to the start of the buffer, and the DSP also restarts its transfer, and will keep requesting data, so effectively you have created a ringbuffer:

SB DMA

However, for the DMA controller, this is not a separate kind of transfer, but rather a mode that you can enable for any of the transfer types (single transfer, block transfer or demand). So you are still performing a ‘single transfer’ on the DMA controller (one byte for every DREQ), just with auto-init enabled.

You can also use this auto-init mode when using the legacy single-cycle mode of the DSP, because the DSP doesn’t know or care who or what programs the DMA, or what its address and count are. It simply requests the DMA controller to transfer a byte, nothing more. So by using auto-init on the DMA controller you can at least remove the overhead of having to reprogram DMA at every interrupt in single-cycle mode. You only have to send a new command to the DSP, to minimize the glitch.

Various sources seem to confuse the two types of auto-init, thinking they are the same thing and/or that they can only be used together. Not at all. In theory you can use the single-cycle mode for double-buffering in the same way as they recommend for auto-init mode: Set the DMA transfer count to twice the block size for the DSP, so you get two interrupts per buffer.

And then there is GoldPlay… It also gets creative with the Sound Blaster. Namely, it sets up a DMA transfer of only a single byte, with the DMA controller in auto-init mode. So if you start a DSP transfer, it would just loop over the same sample endlessly, right? Well no, because GoldPlay sets up a timer interrupt handler that updates that sample at the replay frequency.

That is silly and smart at the same time, depending on how you look at it. Silly, because you basically give up the advantage of ‘Fire-and-forget’ DMA transfers, and you’re back to outputting CPU-timed samples like on a PC speaker or Covox. But smart, for exactly that reason: you can ‘retrofit’ Sound Blaster support quite easily to a system that is already capable of playing sound on a PC speaker/Covox. That is probably the reason why they did it this way. Crystal Dream also uses this approach by the way.

There is a slight flaw there, however. And that is that the DSP does not run in sync with the CPU. The DSP has its own crystal on the card. What this means is that you probably will eventually either miss a sample completely, or the same sample gets output twice, when the timer and DSP go out of sync too far. But since these early SB cards already have glitches by design, one extra glitch every now and then is no big deal either, right?

The best of both worlds

Well, not for me. I see two requirements here:

  1. We want as few glitches as possible.
  2. We want low latency when outputting audio.

For the DSP auto-init mode, it would be simple: You just set your DMA buffer to a small size to have low latency, and handle the interrupts from the DSP to update the buffers. You don’t have to worry about glitches.

For single-cycle mode, the smaller your buffers, the more glitches you get. So the two requirements seem mutually exclusive.

But they might not be. As GoldPlay and Crystal Dream show, you don’t have to match the buffer size of the DMA with the DSP at all. So you can set the DSP to the maximum length of 64k samples, to get the least amount of glitches possible.

Setting the DMA buffer to just 1 sample would not be my choice, however. That defeats the purpose of having a Sound Blaster. I would rather set up a timer interrupt to fire once every N samples, so the timer interrupt would be a replacement for the ‘real’ DSP interrupt you’d get in auto-init more. If you choose your DSP length to be a multiple of the N samples you choose for your DMA buffer, you can reset the timer everytime the DSP interrupt occurs, so that you re-sync the two. Be careful of the race condition that theoretically the DSP and timer should fire at the same time at the end of the buffer. Since they run on different clock generators, you never know which will fire first.

One way to get around that would be to implement some kind of flag to see if the timer interrupt had already fired, eg a counter would do. You know how many interrupts to expect, so you could just check the counter in the timer interrupt, and not perform the update when the counter exceeds the expected value. Or, you could turn it around: just increment the counter in the timer interrupt. Then when the DSP interrupt fires, you check the counter, and if you see the timer had not fired yet, you can perform the last update from the DSP handler instead. That removes the branching from the timer interrupt.

Another way could be to take the ‘latched timer’ approach, as I also discussed in a previous article. You define a list of PIT count values, and update the count at every interrupt, walking down the list. You’d just set the last count in the list to a value of 0 (interpreted as 65536 ticks), so you’re sure it never fires before the DSP does.

Once you have that up and running, you’ll have the same low-latency and low CPU load as with DSP v2.00+, and your glitches will be reduced to the minimum possible. Of course I would only recommend the above method as a fallback for DSP v1.xx. On other hardware, you should just use auto-init, which is 100% glitch-free.

Update 17-04-2017: I read the following on the OSdev page on ISA DMA:

Some expansion cards do not support auto-init DMA such as Sound Blaster 1.x. These devices will crash if used with auto-init DMA. Sound Blaster 2.0 and later do support auto-init DMA.

This is what I thought was some of the ‘confusion’ I described above regarding auto-init mode on the SB DSP vs the DMA controller. But I did not want to include anything on it until I was absolutely sure. But NewRisingSun has verified this on a real Sound Blaster with DSP v1.05 over at the Vogons forum. He ran my test program, which uses auto-init DMA, while performing single-cycle buffer playback on the DSP (the only type the v1.xx DSP supports). And it plays just fine, like the DSP v2.xx and v3.xx we’ve tested with. So no crashing. The quote from OSDev is probably confusing DMA auto-init mode with the fact that Sound Blaster 1.x with DSP v1.xx do not have the auto-init command (which won’t make them crash either, they just don’t know the command, so they won’t play). In short, that information is wrong. DMA auto-init transfers work fine on any Sound Blaster, and are recommended, because they save you the CPU-overhead of reprogramming the DMA controller after every transfer. You only have to restart the DSP.

Posted in Oldskool/retro programming | Tagged , , , , , , , , , , , | 2 Comments

A picture says more than a thousand words

When I work out ideas, I sometimes draw things out on paper, or I use some test-images made in Paint.NET or whatnot. So I was thinking… my blog is mostly text-oriented, aside from some example images and videos. Perhaps I should try to draw some diagrams or such, to illustrate certain ideas, to help people understand them more quickly.

So here is my first try of such a diagram. I have drawn out the VGM data and how it is processed by the interrupt handlers, as discussed in my previous blog:

vgm-data-order

I have used draw.io for this, which seemed to work for me so far. I will update the previous post with these diagrams. Let me know what you think.

Posted in Oldskool/retro programming, Software development | Leave a comment

Putting the things together

So, over time I have discussed various isolated things related to 8088-based PCs. Specifically:

These topics are not as isolated as they seem at first. Namely, I was already using the auto-EOI trick for the streaming data program, to get the best possible performance. And I streamed audio data, which is related to sound cards. When I discussed the latching timer, I also hinted at music already (and the auto-EOI feature). And again, when I discussed auto-EOI in detail, I mentioned digital audio playback.

Once I had built my Tandy sound card (using the PCB that I ordered from lo-tech, many thanks to James Pearce for making this possible), I needed software to do something with it. The easiest way to get it to play something, is to use VGM files. There are quite a few captured songs from games (mostly from Sega Master System, which used the same audio chip), and various trackers can also export songs to VGM.

VGM is a very simple file format: it simply stores the raw commands sent to the sound chip(s). Between commands, it stores delays. There are simple VGM files, which only update the music 50 times or 60 times per second (synchronized to PAL or NTSC screen updates). These are easy to play: Just set up your timer interrupt to fire at that rate, and output the commands. But then there’s the more interesting files, which contain digital samples, which play at much higher rates, and they are stored with very fine-grained delay commands. These delays are in ticks of 44.1 kHz resolution. So the format is flexible enough to support very fast updates to sound chips, eg for outputting single samples at a time, up to 44.1 kHz sample rate.

The question of course is: how do you play that? On a modern system, it’s no problem to process data at 44.1 kHz in realtime. But on an 8088 at 4.77 MHz, not so much. You have about 108 CPU cycles to process each sample. That is barely enough to just process an interrupt, let alone actually processing any logic and outputting data to the sound chip. A single write to the SN76489 takes about 42 CPU cycles by the way.

So the naïve way of just firing a timer interrupt at 44.1 kHz is not going to work. Polling a timer is also going to be difficult, because it takes quite some time to read a 16-bit value from the PIT. And those 16-bit values can only account for an 18.2 Hz rate at the lowest, which gets you about 50 ms as maximum delay, so you will want to detect wraparound and extend it to 32-bit to be able to handle longer delays in the file. This will make it difficult to keep up high-frequency data as well. It would also tie up the CPU 100%, so you can’t do anything else while playing music.

But what if we view VGM not as a file containing 44.1 kHz samples, but rather as a timeline of events, where the resolution is 44.1 kHz, but the actual event rate is generally much lower than 44.1 kHz? Now this sounds an awful lot like the earlier raster effect with the latched timer interrupt! We’ve seen there that it’s possible to reprogram the timer interrupt from inside the interrupt handler. By not resetting the timer, but merely setting a new countdown value, we avoid any jitter, so we remain at an ‘absolute’ time scale. The only downside is that the countdown value gets activated immediately after the counter goes 0, so that is before the CPU can reach your interrupt handler. Effectively that means you always need to plan 1 interrupt handler ahead.

I have tried to draw out how the note data and delay data is stored in the VGM file, and how they need to be processed by the interrupt handlers:

vgm-data-order

We basically have two challenges here, with the VGM format:

  1. We need a way to ‘snoop ahead’ to get the proper delay to set in the current handler.
  2. We need to process the VGM data as quickly as possible.

For the first challenge, I initially divided up my VGM command processor into two: one that would send commands to the SN76489 until it encountered a delay command. The other would skip all data until it encountered a delay command, and returned its value.

Each processor has its own internal pointer, so in theory they could be in completely different places in the file. Making the delay processor be ‘one ahead’ in the stream was simple this way.

There was still that second challenge however: Firstly, I had to process the VGM stream byte-by-byte, and act on each command explicitly in a switch-case statement. Secondly, the delay values were in 44.1 kHz ticks, so I had to translate them to 1.19 MHz ticks for the PIT. Even though I initially tried with a look-up-table for short delays, it still wasn’t all that fast.

So eventually I decided that I would just preprocess the data into my own format, and play from there. The format could be really simple, just runs of:

uint16_t delay;
uint8_t data_count;
uint8_t data[data_count]

Where ‘delay’ is already in PIT ticks, and since there is only one command in VGM for the SN76489, which sends a single byte to its command port, I can just group them together in a single buffer. This is nice and compact.

I have now reversed the order of the delays and note data in the stream, and in the following diagram you can see how that simplifies the processing for the interrupt handlers:

preprocessed-data-order

As you can see, I can now just process the data ‘in order’: The first delay is sent at initialization, then I just process note data and delays as they occur in the stream.

Since I support a data_count of 0, I can get around the limitation of the PIT only being able to wait for 65536 ticks at most: I can just split up longer delays into multiple blocks with 0 commands.

I only use a byte for the data_count. That means I can only support 255 command bytes at most. Is that a problem? Well no, because as mentioned above, a single write takes 42 CPU cycles, and there are about 108 CPU cycles in a single tick at 44.1 kHz. Therefore, you couldn’t physically send more than 2 bytes to the SN76489 in a single tick. The third byte would already trickle over into the next tick. So if I were ever to encounter more than 255 bytes with no delays, then I could just add a delay of 1 in my stream, and split up the commands. In practice it is highly unlikely that you’ll ever encounter this. There are only 4 different channels on the chip, and the longest commands you can send are two bytes. You might also want to adjust the volume of each channel, which is 1 byte. So worst-case, you’d probably send 12 bytes at a time to the chip. Then you’d want a delay so you could actually hear the change take effect.

That’s all there is to it! This system can now play VGM data at the resolution of 44.1 kHz, with the only limitation being that you can’t have too many commands in too short a period of time, because the CPU and/or the SN76489 chip will not be able to keep up.

Well, not really, because there is a third challenge:

  1. VGM files (or the preprocessed data derived from it) may exceed 64k (a single segment) or even 640k (the maximum amount of conventional memory in an 8088 system).

Initially I just wanted to accept these limitations, and just load as much of the file into memory as possible, and play only that portion. But then I figured: technically this routine is a ‘background’-routine since it is entirely driven by an interrupt, and I can still run other code in the ‘foreground’, as long as the music doesn’t keep the CPU too busy.

This brought me back to the earlier experiment with streaming PWM/PCM data to PC speaker and Covox. The idea of loading the data into a ringbuffer of 64k and placing the int handler inside this ringbuffer makes a lot of sense in this scenario as well.

Since the data is all preprocessed, the actual interrupt handler is very compact and simple, and copying it around is very little overhead. The data rate should also be relatively low, unless VGMs use a lot of samples. In most cases, a HDD, or even a floppy, should be able to keep up with the data. So I gave it a try, and indeed, it works:

Or well, it would, if the floppy could keep up! This is a VGM capture of the music from Skate or Die, by Rob Hubbard. It uses samples extensively, so it is a bit of ‘worst case’ for my player. But as you can hear, it plays the samples properly, even while it is loading from disk. It only messes up when there’s a buffer underrun, but eventually recovers. Simpler VGM files play perfectly from floppy. Sadly this machine does not have a HDD, so I will need to try the Skate or Die music again some other time, when I have installed the card into a system with a HDD. I’m confident that it will then keep up and play the music perfectly.

But for now, I have other plans. They are also music-related, and I hope to have a quick demonstration of those before long.

Posted in Oldskool/retro programming | Tagged , , , , , , , , , , , , , , , | 2 Comments

Programming: What separates the men from the boys?

I have come up with the following list of topics:

  • Pointer arithmetic
  • Unicode vs ASCII strings
  • Memory management
  • Calling conventions
  • Basic mathematics, such as linear algebra (eg 2d rotations, translations and scaling, things you’ll regularly find in basic GUI stuff).
  • Multithreading/concurrency

Over time I have found that some programmers master such concepts at an early stage in their career, while others continue to struggle with these things for the rest of their lives.

Thoughts? Some more topics we can add to the list?

 

Posted in Oldskool/retro programming, Software development | Tagged , , | 21 Comments

Any real-keeping lately?

The 5-year anniversary of my inaugural ‘Just keeping it real’-article came and went. Has it been that long already? It’s also been quite some time since I’ve last written about some oldskool coding, or even anything at all. Things have been rather busy.

However, I have still been doing things on and off. So I might as well give a short overview of things that I have been doing, or things that I may plan to do in the future.

Libraries: fortresses of knowledge

One thing that has been an ongoing process, has been to streamline my 8088-related code, and to extract the ‘knowledge’ from the effects and utilities that I have been developing into easy-to-use include and source files for assembly and C. Basically I want to create some headers for each chip that you regularly interact with, such as the 8253, the 8259A and the 6845. And at a slightly higher level, also routines for dealing with MDA, Hercules, CGA, EGA and VGA, and also audio, such as the PC speaker or the SN76489.

For example, to make it easy to set up a timer interrupt at a given frequency, or to enable/disable auto-EOI mode, or to perform horizontal or vertical sync for low-level display hacking like in 8088 MPH. That is the real challenge here. The header files should be easy to use, while at the same time giving maximum control and performance.

I am somewhat inspired by the Amiga’s NDK. It contains header files that allow easy access to all the custom chip registers. For some reason, something similar is not around for the PC hardware, as far as I know. There’s very extensive documentation, such as Ralf Brown’s Interrupt List, and the BOCHS ports list. But these are not in a format that can be readily used in a programming environment. So I would like to make a list of constants that describes all registers and flags, in a way they can be used immediately in a programming context (in this case assembly and C, but it should be easy to take a header file and translate it to another language. In fact, currently I generally write the assembly headers first, then convert them to C). On top of that, I try to use macros where possible, to add basic support routines. Macros have the advantage that they are inlined in the code, so there is no calling overhead. If you design your macro just right, it can be just as efficient as hand-written code. It can even take care of basic loop unrolling and such.

Once this library gets more mature, I might release it for everyone to use and extend.

Standards are great, you can never have too many of them

As I was creating these header files, I came to the conclusion that I was doing it wrong, at least, for the graphics part. Namely, when I first started doing my oldskool PC graphics stuff, I started with VGA and then worked my way down to the older standards. I created some basic library routines in C, where I considered EGA to be a subset of VGA, and CGA a subset of EGA in turn. I tried to create a single set of routines that could work in CGA, EGA or VGA mode, depending on a #define that you could set. Aside from that, I also added a Hercules mode, which didn’t quite fit in there, since Hercules is not an IBM standard, and is not compatible at all.

There are two problems with that approach:

  1. As we know from software such as 8088 MPH, EGA and VGA are in fact not fully backward compatible with CGA at all.  Where CGA uses a real 6845 chip, EGA and VGA do not. So some of the 6845 registers are repurposed/redefined on EGA/VGA. Various special modes and tricks work entirely differently on CGA than they do on EGA or VGA (eg, you can program an 80×50 textmode on all, but not in the same way).
  2. If you set a #define to select the mode in which the library operates, then by definition it can only operate in one mode at a time. This doesn’t work for example in the scenario where you want to be able to support multiple display adapters in a single program, and allow the user to select which mode to use (you could of course build multiple programs, one for each mode, and put them behind some menu frontend or such. Various games actually did that, so you often find separate CGA, EGA, VGA and/or Tandy executables. But it is somewhat cumbersome). Another issue is that certain videocards can actually co-exist in a single system, and can work at the same time (yes, early multi-monitor). For example, you can combine a ‘monochrome’ card with a ‘color’ card, because the IBM 5150 was originally designed that way, with MDA and CGA. They each used different IO and memory ranges, so that both could be installed and used at the same time. By extension, Hercules can also be used together with CGA/EGA/VGA.

So now that I have seen the error of my ways, I have decided to only build header files on top of other header files when they truly are supersets. For example, I have a 6845 header file, and MDA, Hercules and CGA use this header. That is because they all use a physical 6845 chip. For EGA and VGA, I do not use it. Also, I use separate symbol names for all graphics cards. For example, I don’t just make a single WaitVBL-macro, but I make one specific for every relevant graphics card. So you get a CGA_WaitVBL, a HERC_WaitVBL etc. You can still masquerade them behind a single alias yourself, if you so please. But you can also use the symbols specific to each given card side-by-side.

And on his farm he had some PICs, E-O, E-O-I

The last oldskool article I did was focused around the 8259A Programmable Interrupt Controller, and the automatic End-of-Interrupt functionality. At the time I already mentioned that it would be interesting for high-frequency timer interrupt routines, such as playing back digital audio on the PC speaker. That was actually the main reason why I was interested in shaving a few cycles off. I have since used the auto-EOI code in a modified version of the chipmod routine from the endpart of 8088 MPH. Instead of the music player taking all CPU, it can now be run from a timer interrupt in the background. By reducing the mixing rate, you can free up some time to do graphics effects in the foreground.

That routine was the result of some crazy experimentation. Namely, for a foreground routine, the entire CPU is yours. But when you want to run a routine that triggers from an interrupt, then you need to save the CPU state, do your routines, and then restore the CPU state. So the less CPU state you need to save, the better. One big problem with the segmented memory model of the 8088 is how to get access to your data. When the interrupt triggers, the only segment you can be sure of is the code segment. You have no idea what DS and ES are pointing to. You can have some control over SS, because you can make sure that your entire program only uses a single segment for the stack throughout.

So one idea was to reserve some space at the top of the stack, to store data there. But then I figured that it might be easier to just use self-modifying code to store data directly in the operands of some instructions.

Then I had an even better idea: what if I were to use an entire segment for my sample data? It can effectively be a 64k ringbuffer, where wraparound is automatic, saving the need to do any boundschecking on my sample pointer. It is a 16-bit pointer, so it will wrap around automatically. And what if I would put this in the code segment? I only need a few bytes of code for an interrupt handler that plays a sample, increments the sample pointer, and returns from the interrupt. I can divide the ringbuffer in two segments. When the sample pointer is in the low segment, I put the interrupt handler in the high segment, and when the sample pointer switches to the high segment, I move the interrupt handler in the low segment.

Since each segment is so large, I do not need to check at every single sample. I can just do it in the logic of the foreground routine, every frame or such. This makes it a very efficient approach.

I also had this idea of placing the interrupt handlers in segment 0. The advantage here is that CS will point to 0, which means that you can modify the interrupt vector table directly, by just doing something like mov cs:[20h], offset myhandler. This allows you to have a separate handler for each sample, and include the sample in the code itself, so the code effectively becomes the sample buffer. But at the time I thought it may be too much of a hassle. But then reenigne suggested the exact same thing, so I thought about it once more. There may be something here yet.

I ended up giving it a try. I decided to place my handlers 32 bytes apart. 32 bytes was enough to make a handler that plays a sample and updates the interrupt vector. The advantage of spacing all handlers evenly in memory is that they all had the instruction that loaded the sample in the same place, so they were all spaced 32 bytes apart as well. This made it easy to address these samples, and update them with self-modifying code from a mixing loop. It required some tricky code that backs up the existing interrupt vector table, then disables all interrupts except irq 0 (the timer interrupt), and restores everything upon exist. But after some trial-and-error I managed to get it working.

As we were discussing these routines, we were wondering if this would perhaps be good enough as a ‘replacement’ for Trixter’s Sound Blaster routines in 8088 Corruption and 8088 Domination. Namely, the Sound Blaster was the only anachronism in these productions, because streaming audio would have been impossible without the Sound Blaster and its DMA functionality.

So I decided to make a proof-of-concept streaming audio player for my 5160:

As you can see, or well, hear, it actually works quite well. At least, with the original controller and Seagate ST-225, as in my machine. Apparently this system uses the DMA controller, and as such, disk transfers can work in the background nicely. It introduces a small amount of jitter in the sample playback, since the DMA steals bus cycles. But for a 4.77 MHz 8088 system, it’s quite impressive just how well this works. With other disk controllers you may get worse results, when they use PIO rather than DMA. Fun fact: the floppy drive also uses DMA, and the samples are of a low enough bitrate that they can play from a floppy as well, without a problem.

Where we’re going, we don’t need… Wait, where are we going anyway?

So yes, audio programming. That has been my main focus since 8088 MPH. Because, aside from the endpart, the weakest link in the demo is the audio. The beeper is just a very limited piece of hardware. There must be some kind of sweet-spot somewhere between the MONOTONE audio and the chipmod player of the endpart. Something that sounds more advanced than MONOTONE, but which doesn’t hog the entire CPU like the chipmod player, so you can still do cool graphics effects.

Since there has not been any 8088 production to challenge us,  audio still remains our biggest challenge. Aside from the above-mentioned disk streaming and background chipmod routine, I also have some other ideas. However, to actually experiment with those, I need to develop a tool that lets me compose simple music and sound effects. I haven’t gotten too far with that yet.

We could also approach it from a different angle, and use some audio hardware. One option is the Covox LPT DAC. It will require the same high-frequency timer interrupt trickery to bang out each sample. However, the main advantage is that it does not use PWM, and therefore it has no annoying carrier wave. This means that you can get more acceptable sound, even at relatively low sample rates.

A slightly more interesting option is the Disney Sound Source. It includes a small 16-byte buffer. It is limited to 7 kHz playback, but at least you won’t need to send every sample indvidually, so it is less CPU-intensive.

Yet another option is looking at alternative PC-compatible systems. There’s the PCjr and Tandy, which have an SN76489 chip on board. This allows 3 square wave channels and a noise channel. Aside from that, you can also make any of the square wave channels play 4-bit PCM samples relatively easily (and again no carrier wave). Listen to one of Rob Hubbard’s tunes on it, for example:

What is interesting is that there’s a home-brew Tandy clone card being developed as we speak. I am building my own as well. This card allows you to add an SN76489 chip to any PC, making its audio compatible with Tandy/PCjr. It would be very interesting if this card became somewhat ‘standard’ for demoscene productions.

(Why not just take an AdLib, you ask? Well, various reasons. For one, it was rather late to the market, not so much an integral part of 8088 culture. Also, it’s very difficult and CPU-consuming to program. Lastly, it’s not as easy to play samples on as the other devices mentioned. So the SN76489 seems like a better choice for the 8088. The fact that it was also used in the 8088-based PCjr and Tandy 1000 gives it some extra ‘street cred’).

Aside from that, I also got myself an original Hercules GB102 card some time ago. I don’t think it would be interesting to do another demo on exactly the same platform as 8088 MPH. Instead, it would be more interesting to explore other hardware from the 8088 era. The Hercules is also built around a 6845 chip, so some of the trickery done in 8088 MPH may be translated to Hercules. At the same time, the Hercules also offers unique features, such as 64 kB of memory, arranged in two pages of 32 kB. So we may be able to make it do some tricks of its own. Sadly, it would not be a ‘world’s first’ Hercules demo, because someone already beat me to it some months ago:

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , | 6 Comments

AMD Zen: a bit of a deja-vu?

AMD has released the first proper information on their new Zen architecture. Anandtech seems to have done some of the most in-depth coverage, as usual. My first impression is that of a deja-vu… in more than one way.

Firstly, it reminds me of what AMD did a few years ago on the GPU-front: They ditched their VLIW-based architecture, and moved to a SIMD-based architecture, which was remarkably similar to nVidia’s architecture (nVidia had been using SIMD-based architectures since their 8800GTX). In this case, Zen seems to follow Intel’s Core i7-architecture quite closely. They are moving back to high-IPC cores, just as in their K7/K8 heyday (which at the time was following Intel’s P6-architecture closely), and they seem to target lower clockspeeds, around the 3-4 GHz area where Intel also operates. They are also adopting a micro-op cache. Something that Intel has been doing for a long time.

Secondly, AMD is abandoning their CMT-approach, and going for a more conventional SMT-approach. This is another one of those “I told you so”-moments. Even before Bulldozer was launched, I already said that having 2 ALUs hardwired per core is not going to work well. Zen is now using 4 ALUs per two logical cores. So technically they still have the same amount of ALUs per ‘module’. However, like the Core i7, they can now use all 4 cores with each thread, so you get much better IPC for single threads. This again is something I said a few years ago already. AMD apparently agrees with that. Their fanbase did not, sadly.

We can only wonder why AMD did not go for SMT right away with Bulldozer. I personally think that AMD knew all along that SMT was the better option. However, their CMT was effectively a ‘lightweight’ SMT, where only the FPU portion did proper SMT. I think it may be a combination of two factors here:

  1. SMT was originally developed by IBM, and Intel has been using their HyperThreading variation for many years. Both companies have collected various patents on the technology over the years. Perhaps for AMD it was not worthwhile to use fullblown SMT, because it would touch on too many patents and the licensing costs would be prohibitive. It could be that some of these patents have now expired, so the equation has changed to AMD’s favour. It could also be that AMD is now willing to take a bigger risk, because they have to get back in the CPU race at all cost.
  2. Doing a fullblown SMT implementation for the entire CPU may have been to much of a step for AMD in a single generation. AMD only has a limited R&D budget, so they may have had to spread SMT out over two generations. We don’t know how long it took Intel to develop HyperThreading, but we do know that even though their first implementation in Pentium 4 worked well enough in practice, there were still various small bugs and glitches in their implementations. Not necessarily stability-wise, but also security-wise. The concept of SMT is not that complicated, but shoehorning it into the massively complex x86 architecture, which has tons of legacy software which needs to continue working flawlessly, is an entirely different matter. This is quite a risky undertaking, and proper validation can take a long time.

At any rate, Zen looks more promising than Bulldozer ever did. I think AMD made a wise choice in going back to ‘follow the leader’-mode. Not necessarily because Intel’s architecture is the right one, but because Intel’s architecture is the most widespread one. I have said the same thing about Pentium 4 in the past: the architecture itself was not necessarily as bad as people think. Its biggest disadvantage was that it did not handle code optimized for the P6-architecture very well, and most applications had been developed for P6. If all applications would be recompiled with Pentium 4 optimizations, it would already have made quite a different impression. Let alone if developers actually optimized their code specifically for Pentium 4’s strengths (something we mainly saw with video encoding/decoding and 3D rendering).

Bulldozer was facing a similar problem: it required a different type of software. If Intel couldn’t pull off a big change in software optimization with the Pentium 4, then a smaller player like AMD certainly wouldn’t either. That is the main reason why I never understood Bulldozer.

Posted in Hardware news | Tagged , , , , , , , | 26 Comments