What is software development? An art? Craft? Trade? Occupation?… Part 1

From very early on, I noticed that although some of my friends in school had computers as well, they didn’t all use them for the same things. Some just liked to game on them. Others also liked to do their homework with them. Some liked to play around with a bit of programming as well. And in fact, some had different computers altogether. Some computers couldn’t even do what other computers could, even if you wanted to. So I started to realize that there was quite a bit of variation both between the personalities of computer users, as well as the ‘personalities’ of the computer systems and their software libraries.

As time went on, computers kept changing and improving, and new uses were invented for computers all the time. Although I knew very early on that I was “good with computers”, and I wanted to do “something with computers” when I grew up, it wasn’t entirely clear what that ‘something’ was going to be. At the time I wasn’t sure if that question was even relevant at all. And in fact, there was no way to predict that really.

I had some ‘online’ experience in the early 90s, because I got an old 2400 baud modem from my cousin, and I had dialed some BBSes and downloaded some things. But shortly before I went to university, the internet (mainly in the form of the World Wide Web) started taking off. This quite literally opened up a whole new world, and as I went through university, the internet was busy changing the world of computing altogether. But the education I was receiving was not able to change as quickly, so I was learning many ‘older’ technologies and skills, such as a lot of mathematics (calculus, linear algebra, discrete, numerical etc.), logic, computer organization, compiler construction, databases, algorithms and datastructures, etc. But not much in the sense of web technology. Not that it really mattered to me, it was not the sort of stuff that interested me in computers and engineering.

So when I set out to find my first job, there was demand for a lot of things that I had no prior experience with. Things that barely even existed only a few years ago. And since then, this pattern repeated itself. About a decade ago, smartphones started to emerge. I had no prior experience with developing apps, because the concept didn’t exist yet, when I went to university. Likewise, new programming languages and tools have arrived in the meantime, such as C#, Go, Swift and json. And things started moving to ‘the cloud’.

On the other end of the spectrum, there were things that I have taught myself as a hobby, things that were no longer relevant for everyday work. Like the C64, the Amiga, and MS-DOS. Using assembly had also gone out of style, so to say.

So, conclusion: there are a lot of different technologies out there. It is impossible to keep up with everything, so every software developer will have to focus on the technologies that are relevant to their specific situation and interests. On top of that, there are of course different levels of education for software developers these days. In the old days, software developers would have studied computer science at a university. In the really old days, they may even ‘just’ have studied mathematics, physics or such, and have come into contact with computers because of their need for and/or interest in automated computation.

Apparently there is lots of variation in the field of ‘software engineering’, both in the ways in which it is applied, and the people working in these fields, calling themselves ‘software engineers’. Many different cultures. Far more different than in other types of ‘engineering’ I would say, where the education is mostly the same, there is a certain specific culture, and the people attracted to that field are more similar types of people, I would say.

So what exactly is ‘software engineering’ anyway?

Software engineering has become something of a ‘household name’. In the old days, it was ‘computer programming’. Take for example Donald Knuth’s work “The Art of computer programming”. But at some point, it became ‘software engineering’. Where did this term come from, and why did we stop talking about just ‘programming’, and started talking about ‘developing’ and ‘engineering’ instead? Let us try to retrace history.

It would seem that the meaning of ‘computer programming’ is the process of developing software in the narrowest sense: converting a problem into a ‘computer program’. So, translating the problem into algorithms and source code.

‘Software development’ is developing software in the broad term: not just the computer programming itself, but also including related tasks such as documenting, bug fixing, testing and maintenance.

But how did ‘software development’ turn into ‘software engineering’? This was mainly a result of the ever-increasing complexity of software systems, and the poor control over this complexity, leading to poor quality and efficiency in software, and software projects failing to meet deadlines and stay within budget. By the late 1960s, the situation had become so bad that people were speaking of a “Software Crisis“.

Edsger Dijkstra explained it as such, in his article “The Humble Programmer“:

The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.

(Note also how his article also makes a point very similar to what I wrote above: Dijkstra didn’t know anything about vacuum tubes, the technology that was used for early computers, but was no longer relevant by the time Dijkstra got into programming, because transistor technology had taken over)

So, people started looking for ways to tackle these problems in software development. New tools and methodologies were developed. They saw analogies in various fields of engineering, where mathematics and science were being applied to solve problems, develop machines, infrastructure and such, also within large and complex projects with difficult deadlines and requirements. And so, Software Engineering was born: treat software development as a form of engineering, and apply systematic scientific and engineering principles to software development.

Oh really?

In the next part, we will see if they succeeded.

Advertisements
Posted in Software development | Tagged , , , , , , | 1 Comment

What makes the PCjr cool, and what makes it uncool?

The IBM PCjr was a huge flop in the marketplace. As such, it has only been in production for about 14 months, and never even reached my part of the world. When I grew up, I had a vague notion that these machines exist, since many games offered enhanced Tandy audio and video, and some would advertise it as PCjr (which is what it was). I never actually saw a Tandy machine in the flesh though, let alone a PCjr. But it had always intrigued me: apparently there were these PCs that had better graphics and sound than the standard PCs and clones that I knew. A few weeks ago though, I finally got my own PCjr, a 128 kb model with floppy drive, and I would like to share a quick list of what makes it cool, and what does not. Not just as a user, but as a retro-coder/demoscener.

What is cool?

  • The video chip does not suffer from ‘CGA snow’ in 80 column mode
  • You get 16 colours in 320×200 mode and 4 colours in 640×200 mode, as opposed to 4 and 2 colours respectively on CGA
  • You get a 4-channel SN76496 sound chip
  • There is no separate system and video memory, so you can use almost the entire 128k of memory for graphics, if you want (the video memory is bank-switched, much like on a C64)
  • The machine comes in a much lighter and smaller case than a regular PC
  • The keyboard has a wireless infrared interface
  • It has a ‘sidecar’ expansion mechanism
  • It has two cartridge slots

What is uncool?

  • Because the video memory and system memory are shared, the video chip steals cycles from the CPU
  • 128k is not a lot for a machine that has to run PC DOS, especially if part of that memory is used by the video chip
  • IBM omitted the DMA controller on the motherboard
  • All connectors are proprietary, so you cannot use regular PC monitors, joysticks, expansion cards or anything
  • The keyboard has a wireless infrared interface

Let me get into the ‘uncool’ points in some more detail.

Shared video memory

Shared memory was very common on home computers in the 80s. Especially on a 6502-based system, this could be done very elegantly: The 6502 can only access memory every other cycle. So by cleverly designing your video circuitry, you could make it run almost entirely in the unused memory cycles of the 6502. The C64 is an excellent example of this: most of the video is done in the unused cycles. There are only two exceptions: sprites and colorram. At the beginning of each scanline, the VIC-II chip will steal some cycles to read data for every enabled sprite. And every 8th scanline, the VIC-II will load a new line from colorram. Those are the only cycles it steals from the CPU.

The PCjr however, does not use a 6502, it uses an 8088. And an 8088 can and will access memory at every cycle. As a result, the video circuit will slow down the CPU. It will steal one in every 4 IO cycles (one IO cycle is 4 CPU cycles at 4.77 MHz). As a result, the CPU runs at only 3/4th of the effective speed, about 3.57 MHz effectively.

On the bright side though, the video accesses also refresh the memory. This is also very common on home computers in the 80s. PCs are an exception however. The solution that IBM came up with for this is both creative and ugly: IBM wired the second channel of the 8253 timer to the first channel of the 8237 DMA controller. This way the timer will periodically trigger a DMA read of a single byte. This special read is used as a memory refresh trigger. By default, the timer is set to 18 IO cycles. So on a regular PC, the CPU runs at about 17/18th of the effective speed, about 4.5 MHz. Considerably faster than the PCjr.

The downside of the regular PC however is that the memory refresh is not synchronized to the screen in any way. On the PCjr it is, so it is very predictable where and when cycles are stolen. It always happens in the same places on the same scanline (again, much like the C64 and other home computers). In 8088 MPH, we circumvented this by reprogramming the timer to 19 IO cycles (this means the memory is refreshed more slowly, but there should be plenty of tolerance in the DRAM chips to allow this without issue in practice). An entire scanline on CGA takes 76 IO cycles, so 19 is a perfect divider of the scanline. The trick was just to get the timer and the CRTC synchronized: ‘in lockstep’. On a PCjr you get this ‘lockstep’ automatically, it is designed into the hardware.

128 kb ought to be enough for anyone

The first PC had only 16kb in the minimum configuration. This was enough only for running BASIC and using cassettes. For PC DOS you would need 32kb or more. However, by 1984, when the PCjr came out, it was common for DOS machines to have much more memory than that. Since the PCjr shares its video memory with the system, you lose up to 32kb for the framebuffer, leaving only 96kb for DOS. That is not a lot.

What is worse, the unique design of the PCjr makes it difficult to even extend the memory beyond 128kb. There are two issues here:

  1. The memory is refreshed by the video circuit, so only the 128kb that is installed on the mainboard can be refreshed automatically.
  2. The video memory is placed at the end of the system memory, so in the last 32kb of the total 128kb.

It is complicated, but there are solutions to both. Memory expansions in the form of sidecars exist. These contain their own refresh logic, separate from the main memory. An interesting side-effect is that this memory is faster than the system memory. Namely, the system memory is affected by every access of the video circuit, which is a lot more than the minimum number of accesses required for refreshing. So the memory expansion steals less cycles from the CPU. So when you use code and data in this part of the memory, the CPU will run faster. With some memory expansions (for example ones based on SRAM, which does not need refresh at all), the CPU is actually faster than on a regular PC.

The second problem is that if you extend the memory beyond 128kb, there will be a gap’ for the video memory in the first 128kb. DOS and applications expect the system memory to be a single consecutive block of up to 640kb. So it will just allocate the video memory as regular memory, leading to corruption and crashes.

There is a quick-and-dirty solution to this: after DOS itself is loaded, load a device driver that allocates the remaining memory up to 128kb. This driver then effectively ‘reserves’ the memory, so that it will not be re-used by DOS or applications. You will lose some of your memory, but it works.

Most games with enhanced PCjr graphics and/or audio are actually aimed at Tandy 1000 machines, and will require more than 128kb. The Tandy 1000 however is designed to take more than 128kb, and its videomemory is always at the end of the system memory, regardless of the size. This means that not all games for Tandy 1000 will run on a PCjr as-is. If it’s games you want, the Tandy is the better choice hands down.

To preserve as much memory as possible, you will probably want to use the oldest possible DOS, which is PC DOS 2.10. The latest version of DOS to officially support the PCjr is PC DOS 3.30. The main feature you would be missing out on, is support for 3.5″ floppies and HD floppies. But your PCjr does not support drives for those floppies anyway, so there’s no real reason to run the newer version. There was never any support for hard disks for the PCjr either, although in recent years, some hobbyists have developed the jr-IDE sidecar. Since this also gives you a full 640k memory upgrade, you can run a newer DOS with proper support for the hard drive without a problem anyway.

No DMA controller

As already mentioned, the original PC uses its DMA controller for memory refresh. That part is solved by using the video chip on the PCjr. But the DMA controller is also used for other things. As I blogged earlier, it is used for digital audio playback on sound cards. That will not be a problem, since there are no ISA slots to put a Sound Blaster or compatible card in a PCjr anyway.

But the other thing that DMA is used for on PCs is floppy and harddisk transfer. And that is something that is great for demos. Namely, we can start a disk transfer in the background, while we continue to play music and show moving graphics on screen, so we can get seamless transitions between effects and parts.

On the PCjr, not a chance. The floppy controller requires you to poll for every incoming byte. Together with the low memory, that is a bad combination. This will be the most difficult challenge for making interesting demos.

Proprietary connectors

This one is self-explanatory: you need special hardware, cables and adapters for PCjr. You cannot re-use hardware from other PCs.

Wireless keyboard

I listed this as both ‘cool’ and ‘uncool’. The uncool parts are:

  1. It requires batteries to operate.
  2. You can’t use a regular keyboard, only the small and somewhat awkward 62-key one.
  3. The wireless interface is very cheap. It is connected to the Non-Maskable Interrupt (as discussed earlier), and requires the CPU to decode the signals.

This means that the keyboard can interrupt anything. The most common annoyance that people reported is that you cannot get reliable data transfers via (null-)modem, since the keyboard interface will interrupt the transfer and cause you to lose bytes.

It also means that keyboard events are much slower to handle on the PCjr than on a regular PC.

And it means that the interface is somewhat different. On a real PC, the keyboard triggers IRQ 1 directly. You can then read the scancode directly from the keyboard controller (port 60h). On the PCjr, the NMI is triggered by hardware. This has to decode the bits sent via the wireless interface with time-critical loops. This will give PCjr-specific scancodes. These are then translated by another routine on the CPU. And finally the CPU will generate a software interrupt to trigger the IRQ 1 handler, for backward compatibility with the PC.

Conclusion

For me personally, the PCjr definitely scores as ‘cool’ overall. I don’t think I would have liked it all that much if it were my main PC back in the day. It is very limited with so little memory, just one floppy drive, and no hard drive. But as a retro/demoscene platform, I think it offers just the right combination of capabilities and limitations.

Posted in Oldskool/retro programming | 3 Comments

More PC(jr) incompatibilities!

The race for PC-compatibility

Since the mid-80s, there have been many clones of the original IBM PC. IBM themselves also made new-and-improved versions of the PC, aiming for backward-compatibility. DOS itself was more or less a clone of CP/M, and in the CP/M-world, the main common feature between machines was the use of a Z80 CPU. Other hardware could be proprietary, and each machine would run a special OEM version of CP/M, which contained support for their specific hardware (mainly text display, keyboard and drives), and abstracted the differences behind an API. As long as CP/M programs would stick to using Z80 code and using the API rather than accessing hardware directly, these programs would be interchangeable between the different machines. The lowest level of this API was the Basic Input/Output System, or BIOS (note that in CP/M, this was not stored in a ROM, but was part of the OS itself).

Since DOS was based on CP/M, initially it was meant to work in the same way: a DOS machine would have an Intel 8086 or compatible CPU, and the BIOS and DOS APIs would abstract away any hardware differences between the machines. Each machine would have their own specific OEM release of DOS. As people quickly found out though, just being able to run DOS would be no guarantee that all software written for the IBM PC would work. In order to get the most out of the hardware, a lot of PC software would access the BIOS or even the hardware directly.

So clone-builders started to clone the IBM PC BIOS, in order to improve compatibility. The race for 100% PC compatibility had begun. Some of the most troublesome applications of the day included Microsoft Flight Simulator and Lotus 1-2-3. These applications would become a standard test for PC compatibility.

Did they succeed?

By the late 80s, clones had reached the maturity that they would generally run anything that you could throw at them. The OEM versions of MS-DOS would also disappear, as a single version of MS-DOS could run on all PC-compatibles.

But how compatible were all these PCs really? Were they functionally identical? Well no. But this was a given in the world of PCs and DOS. The different IBM machines and PC-clones were ‘close enough’, and software was written in a way that 100% hardware equivalence was not required. It was a given that there were different types of CPUs, different speeds, different chipsets and different video adapters. So software would settle on a certain ‘lowest common denominator’ of compatibility.

But it is even worse than you might think at first. With our demo 8088 MPH, we have already seen that even clones that use an 8088 CPU at 4.77 MHz and a CGA-compatible video adapter aren’t compatible enough to run the entire demo. But beyond that, even IBM’s own hardware isn’t entirely consistent. There are two different types of CGA, the ‘old style’ and ‘new style’, which have differences in the colour output.

Beyond that, IBM did not always use the same 6845 chips. Some IBM CGA cards use a Motorola chip, others may use 6845s from other sources, such as Hitachi or UMC. Beyond that, there are different revisions of the 6845 chip from Motorola as well. Which would not be that bad, if it wasn’t for the fact that they may have slightly different behaviour. In the case of 8088 MPH, apparently all our IBM CGA cards used a Motorola 6845 chip, which supported a hsync width of ‘0’, which it translated to 16 internally. Other 6845s would not have this behaviour, and as a result, the hsync width actually was 0, which meant that there effectively was no hsync, and the monitor could not synchronize to the signal.

Another thing I have already mentioned before, is the 8259A Programmable Interrupt Controller. There were two main problems there:

  1. There are two revisions of the 8259A chip, models before and after 1985. Auto-EOI mode does not work on the earlier chips when in slave mode.
  2. A PC-compatible machine can have either one or two 8259A chips. The AT introduced a second 8259A to support more interrupt levels. As a result, the first 8259A had to be set to ‘cascaded’ mode. Also, in early PCs, the 8259A was used in ‘buffered’ mode, but in the ATs it was not.

And I also briefly mentioned that there is a similar issue with the 8257 DMA controller: early PCs had only one 8257, the AT introduced a second DMA controller, for 16-bit transfers.

Peanut?

I also gave the IBM PCjr (codename: peanut) a honourable mention. Like early PC-clones, its hardware is very similar to that of the PC (8088 CPU, 8253 PIT, 8259A PIC, CGA-compatible graphics), and it runs PC DOS, but it is not fully compatible.

IBM PCjr

I have recently obtained a PCjr myself, and I have been playing around with it a bit. What I found is that IBM made an even bigger mess of things than I thought.

As you might know, the IBM PCjr has advanced audio capabilities. It uses a Texas Instruments SN76496 sound chip (also used by Tandy 1000 machines). There has been an attempt by James Pearce of lo-tech to create an ISA card to add this functionality to any PC. I have built this card, and developed some software for it, and it was reasonably successful.

One thing we ran into, however, is that IBM chose port C0h for the SN76496. But for the AT, they chose the same port C0h for the second DMA controller. This caused us some headaches, since the card would never be able to work at port C0h on any AT-compatible system. So, we have added a jumper to select some additional base addresses. Tandy had also run into this same issue, when they wanted to extend their 1000-range of PC-compatibles to AT-compatibility. Their choice was to move the sound chip from C0h to 1E0h, out of the way of the second DMA controller.

This wasn’t a very successful move however: games written for the PCjr or early Tandy 1000 were not aware of the fact that the SN76496 could be anywhere other than at port C0h, so it was just hardcoded, and would not work on the new Tandy. So we had to patch games to make them work with other addresses.

But as I experimented a bit with the real PCjr, I also ran into another issue: the keyboard. The PCjr uses a wireless keyboard, so it has an entirely different keyboard controller than a regular PC. In order to save cost, IBM implemented the decoding of the wireless signal in software. They have connected the wireless receiver to the Non-Maskable Interrupt, the highest-level interrupt in the system.

But that brings us to the next incompatibility that IBM introduced: on the PC, XT and PCjr, they put the NMI control register at port A0h. On the AT however, they moved the NMI control to port 70h, as part of the new MC146818 CMOS configuration chip. What’s worse though, is that they put the second 8259A PIC at addresses A0h and A1h, so exactly where the old NMI control register used to be. On a regular PC or XT it is not that big of a deal, NMI is only used to report parity errors. The PCjr however uses it all the time, since it relies on it for the keyboard.

Oh, and a last annoying re-use of IBM: the PCjr’s enhanced graphics chip is known as the Video Gate Array, or ‘VGA’. Yes, they re-used ‘VGA’ later, for the successor of the Enhanced Graphics Adapter (‘EGA’), the Video Graphics Array.

Incomplete decoding

What caused me headaches however, is a cost-saving feature that was common back in the day: incomplete decoding of address lines. By not connecting all address lines, the same device is actually ‘mirrored’ at multiple ports. For example, the SN76496 is not just present at C0h, but since it ignores address lines A0, A1 and A2, it is present at C0h-C7h.

The same goes for the NMI register: it is not present only at A0h, but through A0h-A7h. So guess what happened when I ran my code to detect a second PIC at address A1h? Indeed, the write to A1h would also go to the NMI register, accidentally turning it off, and killing my keyboard in the process.

It took me two days to debug why my program refused to respond to the keyboard, even though it was apparent that the interrupt controller was successfully operating in auto-EOI mode. Namely, the PCjr has a vertical blank interrupt, and I wanted to experiment with this. I could clearly see that the interrupt fired at every frame, so I had not locked up the interrupt controller or the system.

While tracking down the bug, I also discussed with Reenigne.  Once I started suspecting that the NMI register is not just at A0h, but is actually mirrored over a wider range, I started to worry that the PC and XT may have the same problem. He told me that it is actually even worse on the original PC and XT. They are even more sloppy in decoding address lines (ignoring A0-A4), so the NMI register is present all through A0h-BFh.

In the end I had to make my detection routine for auto-EOI more robust. I already tried to use BIOS int 15h function C0h first, to get the information, but that fails on older systems such as the PCjr, since it was not introduced until 1986. This is why my PCjr got into the fallback code that tries to poll A1h to see if it responds like a PIC. I have added an extra level of safety now: If the int 15h function is not supported, I will first try to examine the machine model byte, located in the BIOS at addresss F000:FFFEh. This should at least allow me to filter out original IBM PCs, XTs and PCjrs, as well as clones that report the same byte. It may still not be 100% though.

Sound engineering

This might be a good point to mention a similar issue I encountered some weeks earlier. Namely, I have a machine with an IBM Music Feature Card. Recently, I built a Game Blaster clone, designed by veovis. When I put it in the same machine, the IMFC started acting up.

What is the problem here? Well, the IMFC is configured to a base address of 2A20h. The first time I saw this, it already struck me as odd. That is, most hardware is limited to a range of 200h-3FFh (range 0-1FFh is originally documented as ‘reserved’, although the Tandy Clone card proves that at least writing to C0h works on the ISA bus). But, the IO range of an 8086 is 16-bit, so indeed any address from 0-FFFFh is valid. There is no reason to limit yourself to 3FFh, aside from saving some cost on the address decoding circuitry.

The problem is that the Game Blaster clone only decodes the lower 10 address bits (A0-A9). I configured it to the default address of 220h, which would be mirrored at all higher addresses of the form x220h (xxxxxx1000100000‬b). And indeed, that also includes 2A20h (‭0010101000100000‬b).

Now, was this a flaw in veovis’ design? Not at all. He made a clone of the Game Blaster, and the original Game Blaster does exactly the same thing, as do many other cards of that era (including IBM’s own joystick adapter for example). In fact, many later Sound Blasters still do this. So, this is a bit of a shame. Using a Game Blaster or Sound Blaster at the default base address of 220h will conflict with using an IMFC at its default base address of 2A20h.

 

Posted in Oldskool/retro programming | Tagged , , , , , , , , , , , , , , , , , , , | 8 Comments

The Covox years

I have covered various early sound solutions from the DOS/8088 era recently, including AdLib, Sound Blaster, PCjr/Tandy’s SN76489 chip, and the trusty old PC speaker itself. One device that has yet to be mentioned however, is the Covox Speech Thing.

It is a remarkably simple and elegant device. Trixter has made a very nice video showing you the Covox Speech Thing, and the included speakers:

So it is basically an 8-bit DAC that plugs into the printer port of a PC (or in theory other computers with a compatible printer port). The DAC is of the ‘resistor ladder‘ type, which is interesting, because a resistor ladder is a passive device. It does not require a power source at all. The analog signal coming from the DAC is not very powerful though, so there is an amplifier integrated into the speakers. You can also run the output into a regular amplifier or recording equipment or such, but the output is not ‘line-level’, it is closer to ‘microphone-level’, so you may want to use a small pre-amplifier between the Covox and your other equipment.

So how does one program sound on a Covox? Well, it is very similar to outputting samples via PWM on the PC speaker, or outputting 4-bit samples by adjusting the volume register on the SN76489. That is, there is no DMA, buffering or timing inside the device whatsoever. The CPU has to write each sample at the exact time that it should be output, so you will either be using a timer interrupt running at the sampling frequency, or a cycle-counted inner loop which outputs a sample at every N CPU cycles. So this means that the sound quality is at least partly related to the replay code being used: the more accurate the timing is, the less temporal jitter the resulting analog signal will have.

So, such a simple device, with so few components (the real Covox has only one component inside, it uses a resistor ladder in a DIP package), and relying so much on the CPU, how good can this sound really? Well, my initial expectations were not that high. Somewhere in the early 90s, before I had my first sound card, I found the schematics for building your own Covox-compatible device. As said, it’s just a simple resistor ladder, so it can be built with just a few components. Various PC demos, trackers and MOD players included Covox support, and schematics. There were (and still are) also various cheap clones of the Covox available.

The one I built was very minimal, and I didn’t use resistors with extremely low tolerance. It produced sound, and it wasn’t that bad, but it wasn’t that great either. Not ever having heard any other Covox, either the real thing or a clone, I had no idea what it should sound like, and given the low cost and simplicity of the device, I figured it was nice that it produced any recognizable sound at all.

Fast forward to a few months ago, when there was talk of building a Covox clone on the Vogons forum. User ‘dreamblaster’ had made a clone, and the recordings he made sounded quite impressive to me. Way better than what I remember from my own attempt back in the day. But then Trixter came round, and said the real Covox sounded even better. And as you can hear at the end of the video above, he is right: as far as 8-bit DACs go, the Covox sounds excellent. It is crisp and noise-free, and actually sounds better than early Sound Blasters if you ask me (they are quite noisy and have very aggressive low-pass filtering to reduce aliasing, presumably because they were mostly expecting people to use them with samples of 11 kHz or less). You would never guess that this great sound quality would come from just a simple passive device hanging off the printer port of a PC.

So we set off to investigate the Covox futher, and try to get to the bottom of  its design, aiming to make a ‘perfect’ clone. I made a simple application to stream samples from disk to the printer port (basically a small adaptation of the streaming player for PC speaker that I showed here), so we could play back any kind of music. Trixter then selected a number of audio samples, and we did comparisons between the Covox clones and the real thing.

One thing that stood out was that the Covox DAC had a very linear response, where the clones had a much more limited dynamic range. Aside from that, the clones also produced a much louder signal.

The first thing where various clones go wrong, is that there are various ways to construct a resistor ladder. Which type of ladder is the ‘correct’ one for Covox? Luckily, Covox patented their design, and that meant that they had to include the schematic of their resistor pack DAC as well:

us4812847-3So this tells us what the circuit should look like for an exact clone. The patent further explains what values should be used for the different parts:

Nominal resistor values are 200K ohms each for resistors R1 through R8, 100K ohms for R9 through R15, and 15K ohms for R16.

Capacitor C1 has a value of about 0.005 microfarads, yielding a low-pass filter bandwidth of about 3000 hertz.

NB: The part of the schematic on the right, with registers R30-R37, are part of an example printer circuit with pull-up resistors, to show how the Covox could work when it was used in combination with a printer connected to its pass-through printer port. They are not part of the Covox itself. There are also two related schematics in the patent, one with the pull-up resistors added to the pass-through port on the Covox itself, and another with active buffer amplifiers. The only variations of Covox Speech Things that we’ve seen all use the most simple schematic, with only the resistor ladder and the pass-through port without pull-up resistors.

To make sure however, we did some measurements on the real Covox to try and verify if the actual unit was exactly like the patent or not. It would appear that indeed, the ladder is as designed, with 100k and 200k resistors. We are not entirely sure of the 15k ‘load’ resistor though. The measurements seem to indicate that it is probably closer to a 25k resistor. This resistor is used to bring the signal down from the initial +5v TTL levels of the printer port to about ~1v (measured), which would be an acceptable ‘line level’ input for audio equipment (when connected to an actual device, the impedance of the device will pull it down a bit further, the effective maximum amplitude should be around +0.350v in practice).

It appears that many clones did not clone the Covox exactly. Possibly to reduce cost by choosing a simpler design to reduce the amount of parts, and possibly to avoid the patent. The result however is that they generally sound considerably worse than the real thing (and in fact, perhaps because of this, Covox may have inadvertently gotten a reputation for a low-quality solution, because most people would use clones that didn’t sound very good, and not many people were aware of just how good the real Covox sounded. I certainly fall into that category, as I said above). For example, there is a schematic included in the readme-file that comes with Triton’s Crystal Dream demo:

TritonCovoxSchematic

As you can see, it is a simplified circuit and not really a ‘ladder’ as such. It uses less components, but is also less accurate. One interesting characteristic of a resistor ladder is that you can build it from batches of the same resistor values (especially considering the fact that you only need R and 2R values, and 2R can be constructed from two R resistors in series). If you buy resistors in a batch, then their tolerance in absolute sense will be as advertised (you can buy them eg with 5% tolerance, 1% tolerance or 0.1% tolerance). However, the relative tolerance of the resistors in the same batch is likely much smaller. And in the case of a resistor ladder, the relative tolerance is more important than the absolute tolerance.

Since this schematic uses resistors of various values, it cannot exploit the advantage of resistors in the same batch. Also, the values of these resistors do not correspond with the values in the real Covox circuit. Aside from that, the load resistor is missing, and they chose a different value for the capacitor.

Another popular one came with Mark J. Cox’ Modplay:

ModplayCovoxSchematic

This schematic at least appears to be closer to the Covox, although not entirely the same. Again, the resistor and capacitor values are different from the Covox.

In general, what happens is that the response of the DAC is nowhere near linear. We’ve found that the clones tend to have much higher output levels than the real Covox, but espescially the dynamic range is far worse. You hear that especially when there is a fade-out in music: the actual level doesn’t drop off very much, and as a result, the 8-bit quantization noise becomes very obvious, and the sound is perceived as ‘grainy’ and low-quality. The real Covox gets a very linear response, so it sounds about as good as you can expect from an 8-bit DAC.

Our aim was to make one that sounds as close to the real thing as possible, or possibly even better. The end-result is the CVX4, which you can find here: http://www.serdashop.com/CVX4

It has a number of dip-switches so you can fine-tune the filtering and output level to suit your taste. This of course includes a setting that is completely true to the original Covox. Be sure to check out the example videos and reviews that are posted on the shop page. You can hear just how good it sounds. I will post one video here, which uses CvxPlay to demonstrate a number of samples selected by Trixter, which we used to compare the real Covox with the CVX4, and fine-tune the design:

If you are looking for a Covox clone for your retro gaming/demo/tracker PC, then look no further, the CVX4 is as good as it gets!

Posted in Hardware news, Oldskool/retro programming | Tagged , , , , , , , , , , , , , , , | 2 Comments

Scali on Agile Development

I recently wrote The Pessimist Pamphlet. I said there:

I agreed with most of the Agile principles and the underlying reasoning

‘Most of’? So not everything? Well, perhaps I should clarify this. So let’s first walk down the 4 main values of the Agile Manifesto:

Individuals and interactions over processes and tools

Certainly! Tap into the knowledge and experience of people, those are the most valuable assets in your team, or even better, the entire company. Borrow from other teams if you must. Apply critical thinking to find the best solutions, rather than following processes and tools like a Cargo Cult.

Working software over comprehensive documentation

‘Comprehensive’ being the operative word here. I have argued before that source code is not documentation. For trivial stuff, yes, having the source code will be enough to understand what is going on, and why. So you will be able to maintain and extend the code.

But there is another class of code, where there is more to it than meets the eye. I had already given the example of Marching Cubes in the aforementioned article. In that case, the code mainly does some table lookups. The actual cleverness of the algorithm is in how these tables are constructed. And you cannot explain that with just the code and a few lines of comments. You will want some pictures of the base cases to really make sense of it.

I could give our demo 8088 MPH as another example of code that is more than meets the eye. Some parts are cycle-exact. If you were to change even a single instruction, the entire effect may fall apart, because the timing is thrown off ever so slightly. You really need to understand the timing of the instructions and how they interact with the other hardware to understand why certain things work the way they do.

So you certainly want, or even need, to have documentation in these cases. Of course, you will want the documentation to be as clear and concise as possible. But you do want good documentation.

Customer collaboration over contract negotiation

This one seems obvious enough. If you want to be agile in handling changes and unforeseen problems, then you don’t want everything set in stone in a contract.

Responding to change over following a plan

This one is also quite obvious. If there are indications that your current plan is not working properly, you try to make changes for the better.

And now, let’s look at the 12 principles of Agile Software:

Our highest priority is to satisfy the customer
through early and continuous delivery
of valuable software.

This does not only satisfy the customer, it also is more satisfying for the team to see the software coming to life. It also helps to weed out problems early, especially those related to the delivery. You are forced to build, test and distribute your software early-on, so it doesn’t just have to work on the developer’s machine, it has to work everywhere.

The ‘highest priority’-part also seems to imply that delivering valuable software is more important than any kind of management overhead or other red tape.

Welcome changing requirements, even late in
development. Agile processes harness change for
the customer’s competitive advantage.

Changing requirements… They are something like Murphy’s Law in software development, the client never really knows what they want. I think of this as a form of Defensive programming. Expect changes to happen at any point, and try to develop your software in a way that you can incorporate changes wherever and whenever.

Deliver working software frequently, from a
couple of weeks to a couple of months, with a
preference to the shorter timescale.

This seems to be very similar to the first principle.

Business people and developers must work
together daily throughout the project.

Very important. Business people need to understand that software development is very much a creative and unpredictable process. It does not let itself be managed that easily. Perhaps a case of “Keep your friends close, keep your enemies closer”?

Build projects around motivated individuals.
Give them the environment and support they need,
and trust them to get the job done.

Again, this seems to be a hint to the managers. They shouldn’t just try to push the developers, but work with them to find solutions. Especially for managers with no background in software development it is important to trust that developers are not there to frustrate you. If they signal problems, there are problems. If they try to advise you to change course, they have good reasons to do so, because it will benefit their work, and ultimately the end product.

The most efficient and effective method of
conveying information to and within a development
team is face-to-face conversation.

Yes, face-to-face conversation allows direct dialogue. Discussions over email generally do not work that well, because it is mostly an exchange of monologues.

Working software is the primary measure of progress.

I don’t think ‘working software’ is a strong enough criterion. I can think of plenty examples of software that ‘works’, but which is in no way properly designed and maintainable software. You are fooling yourself if you think that the fact that it ‘works’ is a measure of ‘progress’ in this case.

You may have to end up doing huge refactorings or big rewrites to fix the code later in the process, because you have painted yourself in a corner. So all the progress you thought you had made is undone.

Agile processes promote sustainable development.
The sponsors, developers, and users should be able
to maintain a constant pace indefinitely.

This is a very important one. People can only be in ‘crunch mode’ for so long, and fatigue will mean that their progress will slow down, and the quality of the work will degrade. You should avoid any ‘crunch mode’ or overtime or whatnot, and try to keep everyone in the team as fresh as possible, so that they can continue to deliver their best.

Remember, developing software is a creative process. Make sure to keep those creative juices flowing. You can’t manage inspiration.

Continuous attention to technical excellence
and good design enhances agility.

This is similar to what I said earlier about defensive programming. Also, as I said in The Pessimist Pamphlet, this is a huge catch-22: how does the team know what ‘technical excellence’ and ‘good design’ are?

Simplicity–the art of maximizing the amount
of work not done–is essential.

This one I am not sure about. Sometimes it pays off to invest some extra time in developing a more elaborate solution, so that you can more easily extend the functionality at a later point. So this is somewhat in conflict with the earlier points about defensive programming. You will want the simplest solution that does NOT affect technical excellence and good design. But that again is a very vague, subjective thing.

Also, there are cases where other characteristics, such as performance or scalability, are more important than simplicity. A well-optimized, parallelized, scalable solution is inherently more complex than the simplest, most naïve solution that just gets the job done. But sometimes, getting the job done isn’t the requirement. Getting the job done in the shortest amount of time is.

So, the simplest solution that does NOT affect technical excellence, good design, or any requirements (functional or non-functional)?

I could also argue that sometimes finding the simplest solution is the hardest. That is, you first have to create a solution and analyze and study it thoroughly before you can ‘connect the dots’ and start seeing ways to generalize, abstract and simplify the solution. It is a case of iterative refinement.

The best architectures, requirements, and designs
emerge from self-organizing teams.

Again, this one I am not sure of. As I argued in The Pessimist Pamphlet, this assumes that the people who can build the best architectures, requirements and designs are already present inside the team. This is not necessarily the case.

Another objection I have is that we are still talking about science, computer science. In the history of science, breakthroughs were more often than not the result of a brilliant leap of mind of a single individual, rather than some team working towards a solution. Which brings me back to the earlier objection about working software and documentation: In the trivial case it may be true, but not all problems are trivial.

Sometimes a single person can make all the difference, by viewing the problem in a unique new way, and opening the door to solutions that the others were blind to until then. This person might not necessarily be on your team.

This is also what is argued in the No Silver Bullet article: great designers are very rare. You can’t expect every software team to always have just the right ‘great designer’ on board for the specific problems they run into.

At regular intervals, the team reflects on how
to become more effective, then tunes and adjusts
its behavior accordingly.

I think this last one is obvious enough. Apply the Agile principles not only to software development, but also to the structure and workflow of the team itself: don’t just follow a plan, welcome changes for the better.

So, in conclusion, I suppose the trick (or catch-22 if you will) is to build a team that has all the relevant skills and insights to be able to properly reflect on their own work, know what technical excellence and good design are, and how to get there. This calls for a heterogeneous team buildup. A team consisting entirely of younger, less experienced programmers, especially if they all have similar education and interests, is a recipe for disaster. As they say: in the land of the blind, the one-eyed man is king.

Also, you should see the Agile Manifesto (or any other kind of process/methodology) as a set of guidelines, not strict rules. The key is to use common sense in when to follow the rules, and when to pick a better option in a specific situation.

Posted in Software development | Tagged , , , , , , , , , , | 4 Comments

Experience and skill in software development

I just spotted a number of hits from Ars Technica to my blog. It is a regular event that one of my blog posts gets posted in some online discussion, causing a noticeable spike in my statistics. When it does, I usually check out that discussion. This was a rare occasion where I actually enjoyed the discussion. It also reminds me directly of a post I made only a few weeks ago: The Pessimist Pamphlet.

You can find this particular discussion here on Ars Technica. In short, it is about a news item on one of Microsoft’s recent patches, namely to the Equation Editor. The remarkable thing here is that they did a direct binary patch, rather than patching the source code and rebuilding the application.

The discussion that ensued, seemed to split the crowd into two camps: One camp that was blown away by the fact that you can actually do that. And another camp that had done the same thing on a regular basis. My blog was linked because I have discussed patching binaries on various occasions as well. In this particular case, the Commander Keen 4 patch was brought up (which was done by VileR, not myself).

Anyway, the latter camp seemed to be the ‘old warrior’/oldskool type of software developer, which I could identify with. As such, I could also identify with various statements made in the thread. Some of them closely related to what I said in the aforementioned Pessimist Pamphlet. I will pick out a few relevant quotes:

(In response to someone mentioning various currently popular processes/’best practices’ such as unit tests, removing any compiler warnings etc):

I know people who do all this and still produce shitty code, as in it doesn’t do what its supposed to do or there are some holes that users’ can exploit, etc. There’s no easy answer to it as long as its a human that is producing the code.

I have said virtually the same thing in another discussion the other day:

That has always been my objection against “unit-test everything”.
If you ask me, that idea is mainly propagated by people who aren’t mathematically inclined, so to say.
For very simple stuff, a unit-test may work. For complicated calculations, algorithms etc, the difficultly is in finding every single corner-case and making tests for those. Sometimes there are too many corner-cases for this to be a realistic option to begin with. So you may have written a few unit-tests, but how much of the problem do they really cover? And does it even cover relevant areas in the first place?

I think in practice unit-tests give you a false sense of security: the unit-tests that people write are generally the trivial ones that test things that people understand anyway, and will not generally go wrong (or are trivial to debug when they do). It’s often the unit-tests that people don’t write, where the real problems are.

(People who actually had an academic education in computer science should be familiar both with mathematics and also the studies in trying to formally prove correctness of software. And it indeed is a science).

On to the next:

What you consider “duh” practices are learned. Learned through the trials and efforts of our elders. 20 years from now, a whole generation of developers will wonder why we didn’t do baby-simple stuff like pointing hostile AIs at all our code for vulnerability testing. You know, a thing that doesn’t exist yet.

This touches on my Pessimist Pamphlet, and why something like Agile development came into existence in the first place. Knowing where something came from and why is very important.

The one process that I routinely use is coding standards. Yes, including testing for length required before allocating the memory and for verifying that the allocation worked.

The huge process heavy solutions suck. They block innovation, slow development and still provide plenty of solutions for the untrained to believe their work is perfect – because The Holiest of Processes proclaims it to be so.

Try getting somewhat solid requirements first. That and a coding standard solves nearly every bug I’ve even seen. The others, quite honestly, were compiler issues or bad documentation.

Another very important point: ‘best practices’ often don’t really work out in reality, because they tend to be very resource-heavy, and the bean counters want you to cut corners. The only thing that REALLY gives you better code quality is having humans write better code. Which is not done with silly rules like ‘unit tests’ or ‘don’t allow compiler warnings’, but having a proper understanding of what your code is supposed to do, and how you can achieve this. Again: as the Pessimist Pamphlet says: make sure that you know what you’re doing. Ask experienced people for their input and guidance, get trained.

Another one that may be overlooked often:

There’s also the problem that dodgy hacks today are generally responses to the sins of the past.

“Be meticulous and do it right” isn’t fun advice; but it’s advice you can heed; and probably should.

“Make whoever was on the project five years ago be meticulous and do it right” is advice that people would generally desperately like to heed; but the flow of time simply doesn’t work that way; and unless you can afford to just burn down everything and rewrite, meticulous good practice takes years to either gradually refactor or simply age out the various sins of the past.

Even if you have implemented all sorts of modern processes today, you will inevitably run into older/legacy code, which wasn’t quite up to today’s standards, but which your system still relies on.

And this one:

You can write shit in any language, using any process.

Pair programming DOES tend to pull the weaker programmer up, at least at first, but a weird dynamic in a pair can trigger insane shit-fails (and associated management headaches).

There’s no silver bullet.

Exactly: no silver bullet.

The next one is something that I have also run into various times, sadly… poor management of the development process:

Unfortunately in the real world, project due dates are the first thing set, then the solution and design are hammered out.

I’m working coding on a new project that we kicked off this week that is already “red” because the requirements were two months behind schedule, but the due date hasn’t moved.

And the reply to that:

It’s sadly commonplace for software project to allot zero time for actual code implementation. It’s at the bottom of the development stack, and every level above it looks at the predetermined deadline and assumes, “well, that’s how long I’VE got to get MY work done.” It’s not unusual for implementation to get the green light and all their design and requirements documents AFTER the original delivery deadline has passed. Meanwhile, all those layers – and I don’t exclude implementation in this regard – are often too busy building their own little walled-off fiefdoms rather than working together as an actual team.

Basically the managers who think they’re all-important, and once they have some requirements, they’ll just shove it into a room with developers, and the system will magically come out on the other end. Both Agile development and the No Silver Bullet article try to teach management that software development is a team sport, and management should work WITH the developers/architects, not against them. As someone once said: Software development is not rocket science. If only it were that simple.

Another interesting one (responding to the notion that machine language and assembly are ‘outdated’ and not a required skill for a modern developer):

The huge difference is that we no longer use punchcards, so learning how punchcards work is mostly a historic curiosity.

On the other hand every single program you write today, be it Haskell, JavaScript, C#, Swift, C++, Python, etc, would all ultimately be compiled to or run on top of some code that still works in binary/assembly code. If you want to fully understand what your program is doing, it’s wise to understand to at least read assembly. (And if you can read and understand it it’s not a big stretch to then be able to modify it)

And really, most of the skill in reading assembly isn’t the assembly itself. It’s in understanding how computers and OS actually work, and due to Leaky Abstraction (https://en.wikipedia.org/wiki/Leaky_abstraction) it’s often abstractions can be broken, and you need to look under the curtain. This type of skill is still pretty relevant if you do computer security related work (decompiling programs would be your second nature), or if you do performance-sensitive work like video games or VR or have sensitive real-time requirements (needing to understand the output of the compiler to see why your programs are not performing well).

Very true! We still use machine code and assembly language in today’s systems. And every now and then some abstraction WILL leak such details. I have argued that before in this blogpost.

Which brings me to the next one:

We can celebrate the skill involved without celebrating the culture that makes those skills necessary. I’d rather not have to patch binaries either, but I can admire the people who can do it.

A common misunderstanding on the blogpost I mentioned above is that people mistook my list of skills for a list of  ‘best practices’. No, I’m not saying you should base all your programming work around these skills. I’m saying that these are concepts you should master to truly understand all important aspects of developing and debugging software.

This is also a good point:

My point is: software engineering back in the days might not have all those fancy tools and “best practises” in place: but it was an art, and required real skills. Software engineering skills, endurance, precision and all that. You had your 8 KB worth of resources and your binary had to fit into that, period.

I am not saying that I want to switch my code syntax highlighter and auto-completion tools and everything, and sure I don’t want to write assembler ;) But I’m just saying: don’t underestimate the work done by “the previous generations”, as all the lessons learned and the tools that we have today are due to them.

If you learnt coding ‘the hard way’ in the past, you had to hone your skills to a very high level to even get working software out of the door. People should still strive for such high levels today, but sadly, most of them don’t seem to.

And again:

Just as frustrating is that quite a few developers have this mania with TDD, Clean Architecture, code reviews processes etc. without really understanding the why. They just repeat the mantras they’ve learnt from online and conference talks by celebrities developers. Then they just produced shitty code anyway.

And the response to that:

A thousand times this. Lately I have a contractor giving me grief (in the form of hours spent on code reviews) because his code mill taught him the One True Way Of Coding.. sigh.

As said before, understand what the ideas are behind the processes. Understanding the processes and thoughts makes you a much better developer, and allow you to apply the processes and ideas in the spirit they were meant by the initiators, for best effect. And I cannot repeat it often enough: There is no silver bullet! No One True Way Of Coding!

Well, that’s it for now. I can just say that I’m happy to see I’m not quite alone in my thoughts on software development. On some forums you only see younger developers, and they generally all have the same, dare I say, naïve outlook on development. I tend to feel out-of-place there. I mostly discuss programming on vintage/retro-oriented forums these days, since they are generally populated with older people and/or people with a more ‘oldskool’ view on development, and years of hands-on experience. They’ve seen various processes and tools come and go, usually failing to yield a lot of result. The common factor in quality has always been skilled developers. It is nice to see so many ‘old warriors’ also hanging out on Ars Technica.

And again, I’d like to stress that I’m not saying that new tools or processes are bad. Rather that there’s no silver bullet, no One True Way of Coding. Even with the latest tools and processes, humans can and will find ways to make horrible mistakes (and conversely, even many moons ago, long before current languages, tools and processes had been developed, there were people who wrote some great software as well). Nothing will ever replace experience, skill and just common sense.

Posted in Software development | Tagged , , , , , , , , , , , , , , , | 8 Comments

Software: How to parallel?

Software optimization has always been one of my favourite tasks in software development. However, the hardware you are optimizing for, is a moving target (unless of course you are doing retroprogramming/oldskool demoscening, where you have nicely fixed targets). That nice algorithm that you fine-tuned last year? It may not be all that optimal for today’s CPUs anymore.

One area where this was most apparent, was in the move from single-core CPUs to multi-core CPUs, in the early 2000s. Prior to the first commonly available consumer dualcore x86 CPUs from Intel and AMD, multi-core/multi-CPU systems were very expensive and were mainly used in the server and supercomputer markets. Most consumers would just have single-core systems, and this is also what most developers would target.

The thing with optimizing tasks for single-core systems is that it is a relatively straightforward process (pun intended). That is, in most cases, you can concentrate on just getting a single job done as quickly as possible. This job will consist of a sequence of 1 or more processing steps, which will be executed one after another in a single thread. Take for example the decoding of a JPEG image. There are a number of steps in decoding an image, roughly:

  • Huffman decoding
  • Zero run-length decoding
  • Dequantization
  • Inverse DCT
  • De-zigzag the 8×8 blocks
  • Upscaling U and V components
  • YUV->RGB conversion

There can be only one

For a single-threaded solution, most of the optimization will be about making each individual step run as quickly as possible. Another thing to optimize is making the transition from one step to the next as efficient as possible, using the most optimal data flow with the least amount of copying or transforming data between steps.

But that is about it. If you want to decode 10 JPEG images, or even 100 JPEG images, you will process them one after another. Regardless of how many images you want to process, the code is equally optimal in every case. There are some corner-cases though, since even a single-core system consists of various bits of hardware which may be able to process in parallel. For example, you could have disk transfers that can be performed in the background via DMA. Your OS might provide APIs to perform asynchronous disk access with this functionality. Or your system may have a GPU that can run in parallel with the CPU. But let us stick to just the CPU-part for the sake of this article.

How does one parallelize their code? That is the big question. And there is a reason why the title is a question. I do not pretend to have an answer. What I do have however, is various ideas and concepts, which I would like to discuss. These may or may not apply to the problem you are currently trying to solve.

When you are used to optimizing for single-core systems, then intuitively you might take the same approach to parallelization: you will try to make each step of the process as fast as possible, by applying parallel algorithms where possible, and trying to use as many threads/cores as you can, to maximize performance. This is certainly a valid approach in some cases. For example, if you want to decode a single JPEG image as quickly as possible, then this is the way to do it. You will get the lowest latency for a single image.

However, you will quickly run into Amdahl’s law this way: not every step can be parallelized to the same extent. After the zero-length decoding, you can process the data in 8×8 blocks, which is easy to parallelize. However, the Huffman decoding is very difficult to parallelize, for the simple reason that each code in the stream has a variable length, so you do not know where the next code starts until you have decoded the previous one. This means that you will not make full use of all processing resources in every step. Another issue is that you need to have explicit synchronization between the different steps now. For example, the 8×8 blocks are separated into Y, U and V components. But at the end, when you want to convert from YUV to RGB, you need to have all three components decoded before you can do the conversion. Instead of just waiting for a single function to return a result, you may now need to wait for all threads to have completed their processing, causing extra overhead/inefficiency.

When you want to decode more than one image, this may not be the fastest way. The well-parallelized parts may be able to use all cores at the same time, but the serial parts will have very inefficient use of the cores. So the scaling will be less than linear with core count. You will not be getting the best possible throughput.

Batch processing

People from the server world are generally used to exploiting parallelism in another way: if they want to process multiple items at the same time, they will just start multiple processes. In this case, you can run as many of the single-core optimized JPEG decoders side-by-side as you have cores in your system. Generally this is the most efficient way, if you want to decode at least as many JPEG images as you have cores. You mostly avoid Amdahl’s law here, because each core runs very efficient code, and all cores can be used at all times. The main way in which Amdahl’s law will manifest itself in such a situation is in the limitations of shared resources in the system, such as cache, memory and disk I/O. For this reason, scaling will still not be quite linear in most cases, but it generally is as good as it gets, throughput-wise.

However, in this case, if you have say 16 cores, but you only want to decode 10 images, then you will have 6 idle cores, so like the earlier parallelized approach, you are still not making full use of all processing resources in that case, so again your throughput is not optimal.

You could try to run the above parallelized solution for multiple images in parallel, but then you run into other problems: since the parallel parts are designed to use as many resources as available for a single image, running two or more instances in parallel will be very inefficient, because the instances will be fighting for the resources and end up starving eachother.

It gets even more complicated if we would add a time component: say you want to decode a batch of JPEG images, but they are not available at the same time. The images are coming in from some kind of external source, and you do not know exactly when they are coming in, or how many you need to process at a time. Say, if you expect a batch of 100 images, you may get 3 images at a time, then nothing for a while, then another 10 images, etc. So you never know how many images you want to process at the same time, or how many cores you may have available. How would you try to make this as efficient as possible in the average case? So the question is: do you optimize for lowest latency, highest throughput, or some balance between them?

Take a cue from hardware

I think it is interesting to look at CPUs and GPUs at this point, because they need to solve very similar problems, but at a lower level. Namely, they have a number of execution units, and they have batches of instructions and data coming in in unpredictable circumstances. Their job is to allocate the execution units as efficiently as possible.

An interesting parallel to draw between the above two software solutions of parallelizing the decoding of JPEG images and GPU technology is the move from VLIW to scalar SIMD processing in GPUs.

To give some background: GPUs traditionally processed either 3d (XYZ) or 4d vectors (XYZW), or RGB/ARGB colours (effectively also 3d or 4d vectors). So it makes sense to introduce vectorized SIMD instructions (much like MMX, SSE and AVX on x86 CPUs):

add vec1.xyzw, vec2.xyzw, vec3.xyzw

So a single add-instruction can add vectors of up to 4 elements. However, in some cases, you may only want to add some of the elements, perhaps just 1 or 2:

add vec1.x, vec2.x, vec3.x

add vec1.xy, vec2.xy, vec3.xy

The below image is a nice illustration of this I believe:

VLIW

What you see here is an approach where a single processing unit can process vectors of up to 5 elements wide. You can see that the first instruction is 5 wide, so you get full utilization there. Most other instructions however are only 1d or 2d, and there is one more 4d one near the end. So most of the time, the usage of the 5d processing unit is very suboptimal. This is very similar to the above example where you try to parallelize a JPEG decoder and optimize for a single image: some parts may be ’embarrassingly parallel’ and can use all the cores in the CPU. Other parts can extract only limited or even no parallelism, leaving most of the units idle. Let’s call this ‘horizontal’ parallelization. In the case of the GPU, it is instruction-level parallelism.

The solution with GPUs was to turn the processing around by 90 degrees. Instead of trying to extract parallelism from the instructions themselves, you treat all code as if it is purely scalar. So if your shader code looked like this:

add vec1.xyzw, vec2.xyzw, vec3.xyzw

The compiler would actually compile it as a series of scalar operations, like this:

add vec1.x, vec2.x, vec3.x

add vec1.y, vec2.y, vec3.y

add vec1.z, vec2.z, vec3.z

add vec1.w, vec2.w, vec3.w

The parallelism comes from the fact that the same shader is run on a large batch of vertices or pixels, so you can place many of these ‘scalar threads’ side-by-side, running on a SIMD unit. For example, if you take a unit like the above 5d vector unit, you could pack 5 scalar threads this way, and always make full use of the execution unit. It is also easy to make it far wider than just 5 elements, and still have great efficiency, as long as you have enough vertices or pixels to feed. Let’s call this ‘vertical’ parallelization. In the case of the GPU, this is thread-level parallelism.

Now, you can probably see the parallel with the above two examples of the JPEG decoding. One tries to extract as much parallelism from each step as possible, but will not reach full utilization of all cores at all times, basically a ‘horizontal’ approach. The other does not try to extract parallelism from the decoding code itself, but instead parallelizes by running multiple decoders side-by-side, ‘vertically’.

Sharing is caring

The ‘horizontal’ and ‘vertical’ approaches here are two extremes. My example above with images coming in ‘randomly’ shows that you may not always want to use one of these extremes. Are there some hybrid forms possible?

In hardware, there certainly are. On CPUs we have SMT/HyperThreading, to share the execution units of a single core between multiple threads. The idea behind this is that the instructions in a single thread will not always keep all execution units busy. There will be ‘bubbles’ in the execution pipeline, for example when an instruction has to wait for data to become available from cache or memory. By feeding instructions from multiple threads at a time, the execution units can be used more efficiently, and bubbles can be reduced.

GPUs have recently acquired very similar functionality, known as asynchronous compute shaders. This allows you to feed multiple workloads, both graphics and compute tasks, simultaneously, so that (if things are balanced out properly) the GPU’s execution units can be used more efficiently, because one task can use resources that would otherwise remain an idle ‘bubble’ during another task.

DeadpoolThreadpool

The software equivalent of this is the threadpool: a mechanism where a number of threads are always available (usually the same amount as you have cores in your machine), and these threads can receive any workload. This has some advantages:

  • Creating or destroying threads on demand is quite expensive, a threadpool has lower overhead
  • Dependencies can be handled by queuing a next task to start when a current task completes.
  • The workloads are scheduled dynamically, so as long as you have enough workloads, you can always keep all threads/cores busy. You do not have to worry about how many threads you need to run in parallel at a given time

That last one might require an example to clarify. Say you have a system with 8 cores. Normally you will want to run 8 tasks in parallel. If you were to manually handle the threads, then you could create 8 threads, but that only works if you’re running only one instance of that code. If you were to run two, then it would create 16 threads, and they would be fighting for the 8 cores. You could try to make it smart, but then you’d probably quickly come to the conclusion that the proper way to do that is to… create a threadpool.

Because if you use a threadpool, it would always have 8 threads running for your 8 cores. If you create 8 independent workload tasks, it can run them all in parallel. If you created 16 however, it would run the first 8 in parallel, and then start with the 9th as soon as the first task is complete. So it will always keep 8 tasks running.

Another advantage is that you can run any type of task in parallel. So in the case that the images do not come in at the same time, the different images can be in different steps. Instead of the massively parallel steps hogging all the CPU cores, the non-parallelized steps can be run in parallel with the massively parallel ones, finding a decent balance of resource sharing.

In this case, I suppose the key is to find the right level of granularity. You could in theory create a separate task for each 8×8 block for every step. But that will create a lot of overhead in starting, stopping and scheduling each individual task. So you may want to group large batches of 8×8 blocks together in single tasks. You might also want to group multiple decoding steps together on the same batch of 8×8 blocks, to reduce the total amount of tasks further.

Anyway, these are just some ideas of how you can parallelize your code in various ways on modern multi-core systems. Whichever is the best way depends on your specific needs. Perhaps there are also other approaches that I have not mentioned yet, or which I am not even aware of. As I said, I don’t have all the answers, just some ideas to try. Feel free to share your experiences in the comments!

Posted in Software development | Tagged , , , , , , , , , , , | Leave a comment