Bugs you can’t fix

Although I generally want to avoid the usual car-analogy, in this case I am talking from real-world experience which happened to be car-related, so you will have to excuse me.

No car is ‘perfect’… Every car has certain ‘bugs’, as in design-flaws. There may be excessive wear on certain parts, or some things may just fail unreasonably quickly. You can go to the dealership and try to have it fixed, but if they only have replacement parts for you, the wear or failure will occur again and again. They have a nice term for this: “product characteristic”. It’s not quite the same as “it’s not a bug, it’s a feature!”, although it may feel like it. It’s just that there are inherent flaws in the design that cause the excessive wear or failure. And the mechanic can replace parts or make adjustments, but he can’t redesign your car.

Over the years, I’ve encountered a handful of situations where I ran into software bugs, which, as I progressed in my debugging, turned out to be ‘unsolvable’, much like the above example in the car design. In my experience they are very rare, luckily. But they do pop up every now and then, and when they do, it’s a huge mess. I thought that would make them an interesting topic to discuss.

Shared code is great

The first example I want to give, was in a suite of applications that were distributed over a number of workstations, and connected together via a network-based messaging system.

A bug report came in, and I was asked to investigate it: An application that printed out graphs of sensor-data in realtime, would often print out random empty pages in between, but continue to function fine otherwise.

So I started debugging on the side of the printing app. I found that the empty pages were actually correct: sometimes it would receive empty messages. It just printed the data as it received it, all the code appeared to function as expected.

Then I approached it from the other side, to see if there were empty messages being sent by the sensor. But no, the sensor was working fine, and didn’t have any drop-out. So… the problem is somewhere between the sensor and the printing app. What is happening in the messaging system?

And that’s where I found the problem: The message system was designed so that you could register a message of a certain ‘type’ under a unique name. It would allocate a receive-buffer for each registered message. Do you see the problem already? There is only one buffer that is being re-used for each message sent under that type and name. For small messages you can usually get away with it. However, this sensor was sending large batches of data in each message, and it also had a relatively high frequency.

This led to a race-condition: The printing app would have to finish printing the data before the next message comes in, because the new message would just overwrite the buffer.

There was no locking mechanism in place, so there was no way for the printing app to tell the messaging system to wait with the new message until it was finished with the previous one. So the only thing I could do in the printing app was to just copy every message to a new internal buffer as quickly as possible, so that I minimize the ‘critical time’ that the data needs to remain valid in the buffer.

This improved the situation, but still, it did not fix it completely. There was still the occasional empty page that slipped through, probably because of network congestion. So that the last packet of the last message was immediately followed by the first packet of the new message, overwriting it immediately.

Why was this bug unfixable? Because the API and protocol design of the messaging system were just flawed. It would require a rewrite of the messaging system and all applications using it to take care of the race-condition. In theory it can be done, but it meant that you could not just roll out an update of the printing app. You’d have to roll out updates for all applications in the entire suite, because they all share the same messaging system, and need to be working with the same version of the protocol and API to avoid problems. This was just not economically viable. So the bug couldn’t be fixed.

The wrong data in the wrong structure

The second example is one that I already briefly mentioned before: A system that was designed with FIFO queues, when the requirements needed far more flexibility than just FIFO to get the required routing and prioritization.

Again this is a case where someone made a fundamental design decision that was downright wrong. Since it is so fundamental to the system, the only way to fix it is to do a complete redesign of the core functionality. ‘Fixing’ the system is basically the same as just scrapping it and restarting from scratch.

Basically they spent a few months designing and building a bicycle. Which does the job (checks the ‘working software’ Agile-box) for short-distance trips. But they did not read the requirements carefully, which clearly stated that they had to be able to reach places like Tokyo. What they should have built was a plane. And a plane is so fundamentally different from a bicycle, that there are virtually no parts of the design or implementation that can be shared with a plane.

Two for the price of one

That same system also had another ‘interesting’ problem: The queue sizes that it reported on its dashboard were never accurate. How these queue sizes are calculated takes a bit of explanation, so I hope I can get the point across.

The system is designed to handle a number of queues, where each queue can have hundreds to thousands of items. Or at least, the ‘queue’ as the user thinks of it: the amount of items they are actually trying to process with the system.

The implementation however was built up of a number of processes, which were connected via a network, and each process had a small internal queue. They could not hold all data in memory at any one time, and it was also undesirable to do so, since a system crash would mean that all the data would be lost, so the philosophy was to keep the number of in-memory items to a minimum.

What this meant was that there were a small number of items ‘in flight’ in these processes, and there was a larger ‘offline buffer’ that was still waiting to be fed to the system. In the initial version, this ‘offline buffer’ was not visible at all, so all you saw was the number of ‘in flight’ items, which generally was an insignificant amount (perhaps in the range of 8-96 items) compared to the offline buffer.

So, the customers wanted to see the offline buffer as well. This is where things started to go wrong… The system was built on top of an existing protocol, which was meant for local use only: items would be fed directly from one machine to one workstation, and there was really no concept of any queues. For some reason, the same protocol was still used now that it had become a cloud-based application, and items would be processed remotely in an asynchronous way, and on a much larger scale (many machines sending items, and many workstations processing them)… so that now the items would indeed queue up.

So they created a very nasty hack to try and get the offline buffer size into the cloud system: Each item contains an XML message. They added a new field to the header part of the XML, so that an item can contain the current offline buffer size. The system can then parse the header, and add this size to its own ‘in flight’ buffer, and show this on the dashboard.

Do you already see why this can go horribly wrong? Well, it’s subtle, but the results are disastrous… There are two basic flaws here:

  1. The protocol only transfers data whenever a new item is fetched. As long as no item is processed at the back, no new item is fetched at the front.
  2. The value in the XML header is static, so it is ‘frozen in time’ when the XML is generated.

The first flaw could be worked around with a very nasty hack: use small time-outs on items, so that even when nothing is being processed, items will time-out, leading to the fetching of a new item, so that its XML header can be parsed and the offline buffer size can be updated.

The second flaw is a bigger problem: New items can be added to the offline buffer continuously. So the first item you add would have an offline buffer size of 1. It was the first item. By the time the system fetches it for processing, perhaps hundreds of new items have been added. But still, the XML of the first item will contain ‘1’. Likewise, if the last item was added while the offline buffer had say 3000 items, its XML header would read ‘3000’. So the system will fetch it, and it will update its dashboard to show ‘3000’, even though the buffer is now empty.

The workaround for the first flaw doesn’t exactly make things better: you can use short time-outs, but these items need to be processed. So you make another workaround to re-feed these items into the system. But now you are re-feeding items with offline buffer sizes that do not even reflect the current system state. They still read the size from the time they were created.

I just can’t get over what a huge brainfart this whole thing is. This ‘system’ can never even be remotely accurate. The problem is similar to the first example with the messaging system though: the same protocol is used in various codebases in various forms for various clients. Trying to change it now is opening a huge can of worms. There are so many components that you’d need to modify, re-test and update with various clients that it’s not economically viable.

What surprised me most is that the company actually got away with this. Or at least, they did, until they sold it to one customer that indeed WAS picky, as the queue sizes on the dashboard were very relevant to the way they optimized their business process. How many items do we have? How many new items can we still feed? How quickly should we process the items?

Posted in Software development | Tagged , , , , , , , | 1 Comment

What is software development? An art? Craft? Trade? Occupation?… Part 2

Software Engineering seemed like a good idea at the time, and the analogy was further extended to Software Architecture around the 1990s, by first designing a high-level abstraction of the complex system, trying to reduce the complexity and risks, and improving quality by dealing with the most important design choices and requirements first.

And while proper Software Engineering and Software Architecture can indeed deliver very high-quality software, there is a huge practical problem: time. It takes a lot of time to do everything ‘right’, also, things can get very bureaucratic with lots of documentation and rules and such. A related problem is that requirements may change during the design, so by the time the design is done, the problem it was designed to solve may have changed into an entirely different problem, and the design is no longer adequate.

So, Software Engineering/Architecture, once again not the Silver Bullet that people were hoping for. Which led to new approaches, such as Design Patterns, and slogans such as ‘design for change’, ‘best practices’ and all that. One thing that strikes me when I read books or articles about such topics, or object-oriented programming in general, is that everyone seems to use different terminology. They may use different names for the same concepts, or even worse, they use the same names for different concepts. And we are talking about very basic concepts such as ‘encapsulation’, ‘implementation’, ‘aggregation’ and things like that.

This shows at the very least, that Software Engineering is still a very new and immature field. It might also show a different aspect, namely that the people within the Software Engineering community are not communicating and working together very well.

If you read David Parnas’ work on Software Aging, he already mentioned this back in 1994:

Software is used in almost every industry, e.g. aircraft, military, automotive, nuclear power, and telecommunications Each of these industries developed as an intellectual community before they became dependent upon software. Each has its own professional organisations, trade organisations, technical societies and technical journals. As a result, we find that many of these industries are attacking their software problems without being aware of the efforts in other industries. Each industry has developed its own vocabulary and documents describing the way that software should be built. Some have developed their own specification notations and diagraming conventions. There is very little cross-communication. Nuclear Industry engineers discuss their software problems at nuclear industry meetings, while telecommunications engineers discuss very similar problems at entirely different meetings. To reach its intended audience, a paper on software engineering will have to be published in many different places. Nobody wants to do that (but promotion committees reward it).
This intellectual isolation is inappropriate and costly. It is inappropriate because the problems are very similar. Sometimes the cost structures that affect solutions are different, but the technical issues are very much the same. It is costly because the isolation often results in people re-inventing wheels, and even more often in their re-inventing very bumpy and structurally weak wheels. For example, the telecommunications industry and those interested in manufacturing systems, rarely communicate but their communication protocol problems have many similarities. One observes that the people working in the two industries often do not realise that they have the same problems and repeat each other’s mistakes. Even the separation between safety-critical and non safety-critical software (which might seem to make sense) is unfortunate because ideas that work well in one situation are often applicable in the others.
We need to build a professional identity that extends to people in all industries. At the moment we reach some people in all industries but we don’t seem to be reaching the typical person in those industries.

The paper itself ironically enough proves his very point: the term Software Aging has since taken on a different meaning. Parnas meant aging of code/applications because of maintenance, adding features, and needs and expectations changing:

There are two, quite distinct, types of software aging. The first is caused by the failure of the product’s owners to modify it to meet changing needs; the second is the result of the changes that are made. This “one-two punch’ can lead to rapid decline in the value of a software product.

These days, the term Software Aging instead is used for software that exhibits problems after running for an extended period of time, such as resource leaks or data corruption. So the term was ‘reinvented’ by other ‘software engineers’ to mean something entirely different than what Parnas meant by it.

When is a crisis not a crisis

Parnas also points out that the so-called ‘software crisis’ had been going on for some 25 years already, at the time of writing. And despite advancements in ‘software engineering’, apparently the ‘crisis’ has not been solved. So this is not really a crisis, it is a long-term problem. And there are apparently other problems than just the ones that ‘software engineering’ has tried to address so far.

He goes on to explain that ‘engineering’ is something quite different from what people understand in terms of ‘software engineering’. It is also about a certain professional ethic. A certain professional/academic education, professional standards, a professional identity. In that sense it doesn’t really help that many people who develop software aren’t actually formally educated in computer science, but rather some other field, such as for example physics or electrical engineering, and happen to be writing software related to their field.

I personally would describe this as something like “cheapening of the trade”. It can be annoying to have to work with people who haven’t ‘read the classics’, and aren’t familiar with various concepts or customs that proper computer science people take for granted. It can be very difficult to communicate with these people, because they are not working from the same common basis of knowledge. Yet, they are seen as ‘software engineers’ as much as those of us who studied CS at a university.

So is Agile Development the answer?

In recent years, there has been more criticism on Software Engineering/Architecture, mainly from the realm of Extreme Programming and Agile Development. Their philosophy argues that proper Software Engineering/Architecture is too ‘rigid’, has a lot of overhead, and cannot deal with change effectively. So instead, a more lightweight and ‘agile’ way of development is proposed.

But is that the answer? Well, not entirely, as I have argued before. In fact, I would go as far as to say that Agile Development has mostly been adopted by people outside of Computer Science, such as project managers. To them it is attractive that Agile Development, and in particular Scrum, appears to give a very good overview of progress and use of resources. I say ‘appears to give’, because it is a false sense of accuracy: all the ‘progress’ you see is based on estimates, which may or may not be realistic.

While I agree that Extreme Programming and Agile Development make certain good points, and provide some useful methodologies, the obvious problem is that they tend to completely ignore the process of developing, testing and debugging software itself.

In the next part I want to go into these areas, and introduce some movements in software development that focus more on these topics.

Posted in Uncategorized | 8 Comments

Putting the things together, part 2: MIDI and other problems

Remember a few months ago, when I explained my approach to playing VGM files? Well, VGM files are remarkably similar to Standard MIDI files. In a way, MIDI files are also just time-stamped captures of data sent to a sound device. MIDI however is an even stronger case for my approach than VGM is, since MIDI has even higher resolution (up to microsecond resolution, that is 1 MHz).

So when I was experimenting with some old MIDI hardware, I developed my own MIDI player. I then decided to integrate it with the VGM preprocessor, and use the same technology and file format. This of course opened up a can of worms…

(For more background information on MIDI and its various aspects, see also this earlier post).

You know what they say about assumptions…

The main assumption I made with the VGM replayer is that all events are on an absolute timeline with 44.1 kHz resolution. The VGM format has delay codes, where each delay is relative to the end of the previous delay. MIDI is very similar, the main difference is that MIDI is more of a ‘time-stamped event’ format. This means that each individual event has a delay, and in the case of multiple events occuring at the same time, a delay value of 0 is supported. VGM on the other hand supports any number of events between delays.

So implicitly, you assume here that the events/commands do not take any time whatsoever to perform, since the delays do not take any processing time for the events/commands into account. This means that in theory, you could have situations where there is a delay shorter than the time it takes to output all data, so the next event starts while the previous data is still in progress:

Overlapping data

In practice, this should not be a problem with VGM. Namely, VGM was originally developed as a format for capturing sound chip register writes in emulators. Since the software was written on actual hardware, the register writes will implicitly never overlap. As long as the emulator accurately emulates the hardware and accurately generates the delay-values, you should never have any ‘physically impossible’ VGM data.

MIDI is different…

With MIDI, there are a number of reasons why you actually can get ‘physically impossible’ MIDI data. One reason is that MIDI is not necessarily just captured data. It can be edited in a sequencer, or even generated altogether. Aside from that, a MIDI file is not necessarily just a single part, but can be a combination of multiple captures (multi-track MIDI files).

Aside from that, not all MIDI interfaces may be the same speed. The original serial MIDI interface is specified as 31.25 kbps, one start bit, one stop bit, and no parity. This means that every byte is transmitted as a frame of 10 bits, so you can send 3125 bytes per second over a serial MIDI link. However, there are other ways to transfer MIDI data. For example, if you use a synthesizer with a built-in sequencer, it does not necessarily have to go through a physical MIDI link, but the keyboard input can be processed directly by the sequencer, via a faster bus. Or instead of a serial link, you could use a more modern connection, such as USB, FireWire, ethernet or WiFi, which are much faster as well. Or you might not even use physical hardware at all, but virtual instruments with a VSTi interface or such.

In short, it is certainly legal for MIDI data to have delays that are ‘impossible’ to play on certain MIDI interfaces, and I have actually encountered quite a few of these MIDI files during my experiments.

But what is the problem?

We have established that ‘impossible’ delays exist in the MIDI world. But apparently this is not usually a problem, since people use MIDI all the time. Why is it not a problem for most people? And why is it a problem for this particular method?

The reason why it is not a problem in most cases, is because the timing is generally decoupled from the sending of data. That is, the data is generally put into some FIFO buffer, so you can buffer some data while it is waiting for the MIDI interface to finish sending the earlier data.

Another thing is that timing is generally handled by dedicated hardware. If you implement the events with a simple timer that is being polled, and the event being processed as soon as the timer has passed the delay-point, then the timing will remain absolute, and it will automatically correct itself as soon as all data has been sent. The timer just continues to run at the correct speed at all times.

Why is this not the case with this specific approach? It is because this approach relies on reprogramming the timer at every event, making use of the latched properties of the timer to avoid any jitter, as explained earlier. This only works however if the timer is in the rate-generator mode, so it automatically restarts every time the counter reaches 0.

This means that we have to write a new value to the timer before it can reach 0 again, otherwise it will repeat the previous value. And this is where our problem is: when the counter reaches 0, an interrupt is generated. In the handler for this interrupt, I output the data for the event, and then write the new counter value (actually for two interrupts ahead, not the next one). If I were to write a counter value that is too small, then that means that the next interrupt will be fired while we are still in the interrupt handler for the previous event. Interrupts will still be disabled, so this timer event will be missed, and the timer will restart with the same value, meaning that our timing is now thrown off, and is no longer on the absolute scale.

Is there a solution?

Well, that is a shame… we had this very nice and elegant approach to playing music data, and now everything is falling apart. Or is it? Well, we do know that worst-case, we can send data at 3125 bytes per second. We also know how many bytes we need to send for each event. Which means that we can deduce how long it takes to process each event.

This means that we can mimic the behaviour of ‘normal’ FIFO-buffered MIDI interfaces: When an event has an ‘impossible’ delay, we can concatenate its data onto the previous event. Furthermore, we can add up the delay values, so that the absolute timing is preserved. This way we can ensure that the interrupt will never fire while the previous handler is still busy.

So, taking the problematic events in the diagram above, we fix it like this:

Regrouped data

The purple part shows the two ‘clashing events’, which have now been regrouped to a single event. The arrows show that the delays have been added together, so that the total delay for the event after that is still absolute. This means that we do not trade in any accuracy either, since a ‘real’ MIDI interface with a FIFO buffer would have treated it the same way as well: the second MIDI event would effectively be concatenated to the previous data in the FIFO buffer. It wouldn’t physically be possible to send it any faster over the MIDI interface.

This regrouping can be done for more than just two events: you can keep concatenating data until eventually you reach a delay that is ‘possible’ again: one that fires after the data has been sent.

Here is an example of the MIDI player running on an 8088 machine at 4.77 MHz. The MIDI device is a DreamBlaster S2P (a prototype from Serdaco), which connects to the printer port. This requires the CPU to trigger the signal lines of the printer port at the correct times to transfer each individual MIDI byte:

Posted in Oldskool/retro programming | Tagged , , , , , , , , , , , , , , , | 5 Comments

What is software development? An art? Craft? Trade? Occupation?… Part 1

From very early on, I noticed that although some of my friends in school had computers as well, they didn’t all use them for the same things. Some just liked to game on them. Others also liked to do their homework with them. Some liked to play around with a bit of programming as well. And in fact, some had different computers altogether. Some computers couldn’t even do what other computers could, even if you wanted to. So I started to realize that there was quite a bit of variation both between the personalities of computer users, as well as the ‘personalities’ of the computer systems and their software libraries.

As time went on, computers kept changing and improving, and new uses were invented for computers all the time. Although I knew very early on that I was “good with computers”, and I wanted to do “something with computers” when I grew up, it wasn’t entirely clear what that ‘something’ was going to be. At the time I wasn’t sure if that question was even relevant at all. And in fact, there was no way to predict that really.

I had some ‘online’ experience in the early 90s, because I got an old 2400 baud modem from my cousin, and I had dialed some BBSes and downloaded some things. But shortly before I went to university, the internet (mainly in the form of the World Wide Web) started taking off. This quite literally opened up a whole new world, and as I went through university, the internet was busy changing the world of computing altogether. But the education I was receiving was not able to change as quickly, so I was learning many ‘older’ technologies and skills, such as a lot of mathematics (calculus, linear algebra, discrete, numerical etc.), logic, computer organization, compiler construction, databases, algorithms and datastructures, etc. But not much in the sense of web technology. Not that it really mattered to me, it was not the sort of stuff that interested me in computers and engineering.

So when I set out to find my first job, there was demand for a lot of things that I had no prior experience with. Things that barely even existed only a few years ago. And since then, this pattern repeated itself. About a decade ago, smartphones started to emerge. I had no prior experience with developing apps, because the concept didn’t exist yet, when I went to university. Likewise, new programming languages and tools have arrived in the meantime, such as C#, Go, Swift and json. And things started moving to ‘the cloud’.

On the other end of the spectrum, there were things that I have taught myself as a hobby, things that were no longer relevant for everyday work. Like the C64, the Amiga, and MS-DOS. Using assembly had also gone out of style, so to say.

So, conclusion: there are a lot of different technologies out there. It is impossible to keep up with everything, so every software developer will have to focus on the technologies that are relevant to their specific situation and interests. On top of that, there are of course different levels of education for software developers these days. In the old days, software developers would have studied computer science at a university. In the really old days, they may even ‘just’ have studied mathematics, physics or such, and have come into contact with computers because of their need for and/or interest in automated computation.

Apparently there is lots of variation in the field of ‘software engineering’, both in the ways in which it is applied, and the people working in these fields, calling themselves ‘software engineers’. Many different cultures. Far more different than in other types of ‘engineering’ I would say, where the education is mostly the same, there is a certain specific culture, and the people attracted to that field are more similar types of people, I would say.

So what exactly is ‘software engineering’ anyway?

Software engineering has become something of a ‘household name’. In the old days, it was ‘computer programming’. Take for example Donald Knuth’s work “The Art of computer programming”. But at some point, it became ‘software engineering’. Where did this term come from, and why did we stop talking about just ‘programming’, and started talking about ‘developing’ and ‘engineering’ instead? Let us try to retrace history.

It would seem that the meaning of ‘computer programming’ is the process of developing software in the narrowest sense: converting a problem into a ‘computer program’. So, translating the problem into algorithms and source code.

‘Software development’ is developing software in the broad term: not just the computer programming itself, but also including related tasks such as documenting, bug fixing, testing and maintenance.

But how did ‘software development’ turn into ‘software engineering’? This was mainly a result of the ever-increasing complexity of software systems, and the poor control over this complexity, leading to poor quality and efficiency in software, and software projects failing to meet deadlines and stay within budget. By the late 1960s, the situation had become so bad that people were speaking of a “Software Crisis“.

Edsger Dijkstra explained it as such, in his article “The Humble Programmer“:

The major cause of the software crisis is that the machines have become several orders of magnitude more powerful! To put it quite bluntly: as long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a mild problem, and now we have gigantic computers, programming has become an equally gigantic problem.

(Note also how his article also makes a point very similar to what I wrote above: Dijkstra didn’t know anything about vacuum tubes, the technology that was used for early computers, but was no longer relevant by the time Dijkstra got into programming, because transistor technology had taken over)

So, people started looking for ways to tackle these problems in software development. New tools and methodologies were developed. They saw analogies in various fields of engineering, where mathematics and science were being applied to solve problems, develop machines, infrastructure and such, also within large and complex projects with difficult deadlines and requirements. And so, Software Engineering was born: treat software development as a form of engineering, and apply systematic scientific and engineering principles to software development.

Oh really?

In the next part, we will see if they succeeded.

Posted in Software development | Tagged , , , , , , | Leave a comment

What makes the PCjr cool, and what makes it uncool?

The IBM PCjr was a huge flop in the marketplace. As such, it has only been in production for about 14 months, and never even reached my part of the world. When I grew up, I had a vague notion that these machines exist, since many games offered enhanced Tandy audio and video, and some would advertise it as PCjr (which is what it was). I never actually saw a Tandy machine in the flesh though, let alone a PCjr. But it had always intrigued me: apparently there were these PCs that had better graphics and sound than the standard PCs and clones that I knew. A few weeks ago though, I finally got my own PCjr, a 128 kb model with floppy drive, and I would like to share a quick list of what makes it cool, and what does not. Not just as a user, but as a retro-coder/demoscener.

What is cool?

  • The video chip does not suffer from ‘CGA snow’ in 80 column mode
  • You get 16 colours in 320×200 mode and 4 colours in 640×200 mode, as opposed to 4 and 2 colours respectively on CGA
  • You get a 4-channel SN76496 sound chip
  • There is no separate system and video memory, so you can use almost the entire 128k of memory for graphics, if you want (the video memory is bank-switched, much like on a C64)
  • The machine comes in a much lighter and smaller case than a regular PC
  • The keyboard has a wireless infrared interface
  • It has a ‘sidecar’ expansion mechanism
  • It has two cartridge slots

What is uncool?

  • Because the video memory and system memory are shared, the video chip steals cycles from the CPU
  • 128k is not a lot for a machine that has to run PC DOS, especially if part of that memory is used by the video chip
  • IBM omitted the DMA controller on the motherboard
  • All connectors are proprietary, so you cannot use regular PC monitors, joysticks, expansion cards or anything
  • The keyboard has a wireless infrared interface

Let me get into the ‘uncool’ points in some more detail.

Shared video memory

Shared memory was very common on home computers in the 80s. Especially on a 6502-based system, this could be done very elegantly: The 6502 can only access memory every other cycle. So by cleverly designing your video circuitry, you could make it run almost entirely in the unused memory cycles of the 6502. The C64 is an excellent example of this: most of the video is done in the unused cycles. There are only two exceptions: sprites and colorram. At the beginning of each scanline, the VIC-II chip will steal some cycles to read data for every enabled sprite. And every 8th scanline, the VIC-II will load a new line from colorram. Those are the only cycles it steals from the CPU.

The PCjr however, does not use a 6502, it uses an 8088. And an 8088 can and will access memory at every cycle. As a result, the video circuit will slow down the CPU. It will steal one in every 4 IO cycles (one IO cycle is 4 CPU cycles at 4.77 MHz). As a result, the CPU runs at only 3/4th of the effective speed, about 3.57 MHz effectively.

On the bright side though, the video accesses also refresh the memory. This is also very common on home computers in the 80s. PCs are an exception however. The solution that IBM came up with for this is both creative and ugly: IBM wired the second channel of the 8253 timer to the first channel of the 8237 DMA controller. This way the timer will periodically trigger a DMA read of a single byte. This special read is used as a memory refresh trigger. By default, the timer is set to 18 IO cycles. So on a regular PC, the CPU runs at about 17/18th of the effective speed, about 4.5 MHz. Considerably faster than the PCjr.

The downside of the regular PC however is that the memory refresh is not synchronized to the screen in any way. On the PCjr it is, so it is very predictable where and when cycles are stolen. It always happens in the same places on the same scanline (again, much like the C64 and other home computers). In 8088 MPH, we circumvented this by reprogramming the timer to 19 IO cycles (this means the memory is refreshed more slowly, but there should be plenty of tolerance in the DRAM chips to allow this without issue in practice). An entire scanline on CGA takes 76 IO cycles, so 19 is a perfect divider of the scanline. The trick was just to get the timer and the CRTC synchronized: ‘in lockstep’. On a PCjr you get this ‘lockstep’ automatically, it is designed into the hardware.

128 kb ought to be enough for anyone

The first PC had only 16kb in the minimum configuration. This was enough only for running BASIC and using cassettes. For PC DOS you would need 32kb or more. However, by 1984, when the PCjr came out, it was common for DOS machines to have much more memory than that. Since the PCjr shares its video memory with the system, you lose up to 32kb for the framebuffer, leaving only 96kb for DOS. That is not a lot.

What is worse, the unique design of the PCjr makes it difficult to even extend the memory beyond 128kb. There are two issues here:

  1. The memory is refreshed by the video circuit, so only the 128kb that is installed on the mainboard can be refreshed automatically.
  2. The video memory is placed at the end of the system memory, so in the last 32kb of the total 128kb.

It is complicated, but there are solutions to both. Memory expansions in the form of sidecars exist. These contain their own refresh logic, separate from the main memory. An interesting side-effect is that this memory is faster than the system memory. Namely, the system memory is affected by every access of the video circuit, which is a lot more than the minimum number of accesses required for refreshing. So the memory expansion steals less cycles from the CPU. So when you use code and data in this part of the memory, the CPU will run faster. With some memory expansions (for example ones based on SRAM, which does not need refresh at all), the CPU is actually faster than on a regular PC.

The second problem is that if you extend the memory beyond 128kb, there will be a gap’ for the video memory in the first 128kb. DOS and applications expect the system memory to be a single consecutive block of up to 640kb. So it will just allocate the video memory as regular memory, leading to corruption and crashes.

There is a quick-and-dirty solution to this: after DOS itself is loaded, load a device driver that allocates the remaining memory up to 128kb. This driver then effectively ‘reserves’ the memory, so that it will not be re-used by DOS or applications. You will lose some of your memory, but it works.

Most games with enhanced PCjr graphics and/or audio are actually aimed at Tandy 1000 machines, and will require more than 128kb. The Tandy 1000 however is designed to take more than 128kb, and its videomemory is always at the end of the system memory, regardless of the size. This means that not all games for Tandy 1000 will run on a PCjr as-is. If it’s games you want, the Tandy is the better choice hands down.

To preserve as much memory as possible, you will probably want to use the oldest possible DOS, which is PC DOS 2.10. The latest version of DOS to officially support the PCjr is PC DOS 3.30. The main feature you would be missing out on, is support for 3.5″ floppies and HD floppies. But your PCjr does not support drives for those floppies anyway, so there’s no real reason to run the newer version. There was never any support for hard disks for the PCjr either, although in recent years, some hobbyists have developed the jr-IDE sidecar. Since this also gives you a full 640k memory upgrade, you can run a newer DOS with proper support for the hard drive without a problem anyway.

No DMA controller

As already mentioned, the original PC uses its DMA controller for memory refresh. That part is solved by using the video chip on the PCjr. But the DMA controller is also used for other things. As I blogged earlier, it is used for digital audio playback on sound cards. That will not be a problem, since there are no ISA slots to put a Sound Blaster or compatible card in a PCjr anyway.

But the other thing that DMA is used for on PCs is floppy and harddisk transfer. And that is something that is great for demos. Namely, we can start a disk transfer in the background, while we continue to play music and show moving graphics on screen, so we can get seamless transitions between effects and parts.

On the PCjr, not a chance. The floppy controller requires you to poll for every incoming byte. Together with the low memory, that is a bad combination. This will be the most difficult challenge for making interesting demos.

Proprietary connectors

This one is self-explanatory: you need special hardware, cables and adapters for PCjr. You cannot re-use hardware from other PCs.

Wireless keyboard

I listed this as both ‘cool’ and ‘uncool’. The uncool parts are:

  1. It requires batteries to operate.
  2. You can’t use a regular keyboard, only the small and somewhat awkward 62-key one.
  3. The wireless interface is very cheap. It is connected to the Non-Maskable Interrupt (as discussed earlier), and requires the CPU to decode the signals.

This means that the keyboard can interrupt anything. The most common annoyance that people reported is that you cannot get reliable data transfers via (null-)modem, since the keyboard interface will interrupt the transfer and cause you to lose bytes.

It also means that keyboard events are much slower to handle on the PCjr than on a regular PC.

And it means that the interface is somewhat different. On a real PC, the keyboard triggers IRQ 1 directly. You can then read the scancode directly from the keyboard controller (port 60h). On the PCjr, the NMI is triggered by hardware. This has to decode the bits sent via the wireless interface with time-critical loops. This will give PCjr-specific scancodes. These are then translated by another routine on the CPU. And finally the CPU will generate a software interrupt to trigger the IRQ 1 handler, for backward compatibility with the PC.


For me personally, the PCjr definitely scores as ‘cool’ overall. I don’t think I would have liked it all that much if it were my main PC back in the day. It is very limited with so little memory, just one floppy drive, and no hard drive. But as a retro/demoscene platform, I think it offers just the right combination of capabilities and limitations.

Posted in Oldskool/retro programming | 3 Comments

More PC(jr) incompatibilities!

The race for PC-compatibility

Since the mid-80s, there have been many clones of the original IBM PC. IBM themselves also made new-and-improved versions of the PC, aiming for backward-compatibility. DOS itself was more or less a clone of CP/M, and in the CP/M-world, the main common feature between machines was the use of a Z80 CPU. Other hardware could be proprietary, and each machine would run a special OEM version of CP/M, which contained support for their specific hardware (mainly text display, keyboard and drives), and abstracted the differences behind an API. As long as CP/M programs would stick to using Z80 code and using the API rather than accessing hardware directly, these programs would be interchangeable between the different machines. The lowest level of this API was the Basic Input/Output System, or BIOS (note that in CP/M, this was not stored in a ROM, but was part of the OS itself).

Since DOS was based on CP/M, initially it was meant to work in the same way: a DOS machine would have an Intel 8086 or compatible CPU, and the BIOS and DOS APIs would abstract away any hardware differences between the machines. Each machine would have their own specific OEM release of DOS. As people quickly found out though, just being able to run DOS would be no guarantee that all software written for the IBM PC would work. In order to get the most out of the hardware, a lot of PC software would access the BIOS or even the hardware directly.

So clone-builders started to clone the IBM PC BIOS, in order to improve compatibility. The race for 100% PC compatibility had begun. Some of the most troublesome applications of the day included Microsoft Flight Simulator and Lotus 1-2-3. These applications would become a standard test for PC compatibility.

Did they succeed?

By the late 80s, clones had reached the maturity that they would generally run anything that you could throw at them. The OEM versions of MS-DOS would also disappear, as a single version of MS-DOS could run on all PC-compatibles.

But how compatible were all these PCs really? Were they functionally identical? Well no. But this was a given in the world of PCs and DOS. The different IBM machines and PC-clones were ‘close enough’, and software was written in a way that 100% hardware equivalence was not required. It was a given that there were different types of CPUs, different speeds, different chipsets and different video adapters. So software would settle on a certain ‘lowest common denominator’ of compatibility.

But it is even worse than you might think at first. With our demo 8088 MPH, we have already seen that even clones that use an 8088 CPU at 4.77 MHz and a CGA-compatible video adapter aren’t compatible enough to run the entire demo. But beyond that, even IBM’s own hardware isn’t entirely consistent. There are two different types of CGA, the ‘old style’ and ‘new style’, which have differences in the colour output.

Beyond that, IBM did not always use the same 6845 chips. Some IBM CGA cards use a Motorola chip, others may use 6845s from other sources, such as Hitachi or UMC. Beyond that, there are different revisions of the 6845 chip from Motorola as well. Which would not be that bad, if it wasn’t for the fact that they may have slightly different behaviour. In the case of 8088 MPH, apparently all our IBM CGA cards used a Motorola 6845 chip, which supported a hsync width of ‘0’, which it translated to 16 internally. Other 6845s would not have this behaviour, and as a result, the hsync width actually was 0, which meant that there effectively was no hsync, and the monitor could not synchronize to the signal.

Another thing I have already mentioned before, is the 8259A Programmable Interrupt Controller. There were two main problems there:

  1. There are two revisions of the 8259A chip, models before and after 1985. Auto-EOI mode does not work on the earlier chips when in slave mode.
  2. A PC-compatible machine can have either one or two 8259A chips. The AT introduced a second 8259A to support more interrupt levels. As a result, the first 8259A had to be set to ‘cascaded’ mode. Also, in early PCs, the 8259A was used in ‘buffered’ mode, but in the ATs it was not.

And I also briefly mentioned that there is a similar issue with the 8257 DMA controller: early PCs had only one 8257, the AT introduced a second DMA controller, for 16-bit transfers.


I also gave the IBM PCjr (codename: peanut) a honourable mention. Like early PC-clones, its hardware is very similar to that of the PC (8088 CPU, 8253 PIT, 8259A PIC, CGA-compatible graphics), and it runs PC-DOS, but it is not fully compatible.


I have recently obtained a PCjr myself, and I have been playing around with it a bit. What I found is that IBM made an even bigger mess of things than I thought.

As you might know, the IBM PCjr has advanced audio capabilities. It uses a Texas Instruments SN76496 sound chip (also used by Tandy 1000 machines). There has been an attempt by James Pearce of lo-tech to create an ISA card to add this functionality to any PC. I have built this card, and developed some software for it, and it was reasonably successful.

One thing we ran into, however, is that IBM chose port C0h for the SN76496. But for the AT, they chose the same port C0h for the second DMA controller. This caused us some headaches, since the card would never be able to work at port C0h on any AT-compatible system. So, we have added a jumper to select some additional base addresses. Tandy had also run into this same issue, when they wanted to extend their 1000-range of PC-compatibles to AT-compatibility. Their choice was to move the sound chip from C0h to 1E0h, out of the way of the second DMA controller.

This wasn’t a very successful move however: games written for the PCjr or early Tandy 1000 were not aware of the fact that the SN76496 could be anywhere other than at port C0h, so it was just hardcoded, and would not work on the new Tandy. So we had to patch games to make them work with other addresses.

But as I experimented a bit with the real PCjr, I also ran into another issue: the keyboard. The PCjr uses a wireless keyboard, so it has an entirely different keyboard controller than a regular PC. In order to save cost, IBM implemented the decoding of the wireless signal in software. They have connected the wireless receiver to the Non-Maskable Interrupt, the highest-level interrupt in the system.

But that brings us to the next incompatibility that IBM introduced: on the PC, XT and PCjr, they put the NMI control register at port A0h. On the AT however, they moved the NMI control to port 70h, as part of the new MC146818 CMOS configuration chip. What’s worse though, is that they put the second 8259A PIC at addresses A0h and A1h, so exactly where the old NMI control register used to be. On a regular PC or XT it is not that big of a deal, NMI is only used to report parity errors. The PCjr however uses it all the time, since it relies on it for the keyboard.

Oh, and a last annoying re-use of IBM: the PCjr’s enhanced graphics chip is known as the Video Gate Array, or ‘VGA’. Yes, they re-used ‘VGA’ later, for the successor of the Enhanced Graphics Adapter (‘EGA’), the Video Graphics Array.

Incomplete decoding

What caused me headaches however, is a cost-saving feature that was common back in the day: incomplete decoding of address lines. By not connecting all address lines, the same device is actually ‘mirrored’ at multiple ports. For example, the SN76496 is not just present at C0h, but since it ignores address lines A0, A1 and A2, it is present at C0h-C7h.

The same goes for the NMI register: it is not present only at A0h, but through A0h-A7h. So guess what happened when I ran my code to detect a second PIC at address A1h? Indeed, the write to A1h would also go to the NMI register, accidentally turning it off, and killing my keyboard in the process.

It took me two days to debug why my program refused to respond to the keyboard, even though it was apparent that the interrupt controller was successfully operating in auto-EOI mode. Namely, the PCjr has a vertical blank interrupt, and I wanted to experiment with this. I could clearly see that the interrupt fired at every frame, so I had not locked up the interrupt controller or the system.

While tracking down the bug, I also discussed with Reenigne.  Once I started suspecting that the NMI register is not just at A0h, but is actually mirrored over a wider range, I started to worry that the PC and XT may have the same problem. He told me that it is actually even worse on the original PC and XT. They are even more sloppy in decoding address lines (ignoring A0-A4), so the NMI register is present all through A0h-BFh.

In the end I had to make my detection routine for auto-EOI more robust. I already tried to use BIOS int 15h function C0h first, to get the information, but that fails on older systems such as the PCjr, since it was not introduced until 1986. This is why my PCjr got into the fallback code that tries to poll A1h to see if it responds like a PIC. I have added an extra level of safety now: If the int 15h function is not supported, I will first try to examine the machine model byte, located in the BIOS at addresss F000:FFFEh. This should at least allow me to filter out original IBM PCs, XTs and PCjrs, as well as clones that report the same byte. It may still not be 100% though.

Sound engineering

This might be a good point to mention a similar issue I encountered some weeks earlier. Namely, I have a machine with an IBM Music Feature Card. Recently, I built a Game Blaster clone, designed by veovis. When I put it in the same machine, the IMFC started acting up.

What is the problem here? Well, the IMFC is configured to a base address of 2A20h. The first time I saw this, it already struck me as odd. That is, most hardware is limited to a range of 200h-3FFh (range 0-1FFh is originally documented as ‘reserved’, although the Tandy Clone card proves that at least writing to C0h works on the ISA bus). But, the IO range of an 8086 is 16-bit, so indeed any address from 0-FFFFh is valid. There is no reason to limit yourself to 3FFh, aside from saving some cost on the address decoding circuitry.

The problem is that the Game Blaster clone only decodes the lower 10 address bits (A0-A9). I configured it to the default address of 220h, which would be mirrored at all higher addresses of the form x220h (xxxxxx1000100000‬b). And indeed, that also includes 2A20h (‭0010101000100000‬b).

Now, was this a flaw in veovis’ design? Not at all. He made a clone of the Game Blaster, and the original Game Blaster does exactly the same thing, as do many other cards of that era (including IBM’s own joystick adapter for example). In fact, many later Sound Blasters still do this. So, this is a bit of a shame. Using a Game Blaster or Sound Blaster at the default base address of 220h will conflict with using an IMFC at its default base address of 2A20h.


Posted in Oldskool/retro programming | Tagged , , , , , , , , , , , , , , , , , , , | 6 Comments

The Covox years

I have covered various early sound solutions from the DOS/8088 era recently, including AdLib, Sound Blaster, PCjr/Tandy’s SN76489 chip, and the trusty old PC speaker itself. One device that has yet to be mentioned however, is the Covox Speech Thing.

It is a remarkably simple and elegant device. Trixter has made a very nice video showing you the Covox Speech Thing, and the included speakers:

So it is basically an 8-bit DAC that plugs into the printer port of a PC (or in theory other computers with a compatible printer port). The DAC is of the ‘resistor ladder‘ type, which is interesting, because a resistor ladder is a passive device. It does not require a power source at all. The analog signal coming from the DAC is not very powerful though, so there is an amplifier integrated into the speakers. You can also run the output into a regular amplifier or recording equipment or such, but the output is not ‘line-level’, it is closer to ‘microphone-level’, so you may want to use a small pre-amplifier between the Covox and your other equipment.

So how does one program sound on a Covox? Well, it is very similar to outputting samples via PWM on the PC speaker, or outputting 4-bit samples by adjusting the volume register on the SN76489. That is, there is no DMA, buffering or timing inside the device whatsoever. The CPU has to write each sample at the exact time that it should be output, so you will either be using a timer interrupt running at the sampling frequency, or a cycle-counted inner loop which outputs a sample at every N CPU cycles. So this means that the sound quality is at least partly related to the replay code being used: the more accurate the timing is, the less temporal jitter the resulting analog signal will have.

So, such a simple device, with so few components (the real Covox has only one component inside, it uses a resistor ladder in a DIP package), and relying so much on the CPU, how good can this sound really? Well, my initial expectations were not that high. Somewhere in the early 90s, before I had my first sound card, I found the schematics for building your own Covox-compatible device. As said, it’s just a simple resistor ladder, so it can be built with just a few components. Various PC demos, trackers and MOD players included Covox support, and schematics. There were (and still are) also various cheap clones of the Covox available.

The one I built was very minimal, and I didn’t use resistors with extremely low tolerance. It produced sound, and it wasn’t that bad, but it wasn’t that great either. Not ever having heard any other Covox, either the real thing or a clone, I had no idea what it should sound like, and given the low cost and simplicity of the device, I figured it was nice that it produced any recognizable sound at all.

Fast forward to a few months ago, when there was talk of building a Covox clone on the Vogons forum. User ‘dreamblaster’ had made a clone, and the recordings he made sounded quite impressive to me. Way better than what I remember from my own attempt back in the day. But then Trixter came round, and said the real Covox sounded even better. And as you can hear at the end of the video above, he is right: as far as 8-bit DACs go, the Covox sounds excellent. It is crisp and noise-free, and actually sounds better than early Sound Blasters if you ask me (they are quite noisy and have very aggressive low-pass filtering to reduce aliasing, presumably because they were mostly expecting people to use them with samples of 11 kHz or less). You would never guess that this great sound quality would come from just a simple passive device hanging off the printer port of a PC.

So we set off to investigate the Covox futher, and try to get to the bottom of  its design, aiming to make a ‘perfect’ clone. I made a simple application to stream samples from disk to the printer port (basically a small adaptation of the streaming player for PC speaker that I showed here), so we could play back any kind of music. Trixter then selected a number of audio samples, and we did comparisons between the Covox clones and the real thing.

One thing that stood out was that the Covox DAC had a very linear response, where the clones had a much more limited dynamic range. Aside from that, the clones also produced a much louder signal.

The first thing where various clones go wrong, is that there are various ways to construct a resistor ladder. Which type of ladder is the ‘correct’ one for Covox? Luckily, Covox patented their design, and that meant that they had to include the schematic of their resistor pack DAC as well:

us4812847-3So this tells us what the circuit should look like for an exact clone. The patent further explains what values should be used for the different parts:

Nominal resistor values are 200K ohms each for resistors R1 through R8, 100K ohms for R9 through R15, and 15K ohms for R16.

Capacitor C1 has a value of about 0.005 microfarads, yielding a low-pass filter bandwidth of about 3000 hertz.

NB: The part of the schematic on the right, with registers R30-R37, are part of an example printer circuit with pull-up resistors, to show how the Covox could work when it was used in combination with a printer connected to its pass-through printer port. They are not part of the Covox itself. There are also two related schematics in the patent, one with the pull-up resistors added to the pass-through port on the Covox itself, and another with active buffer amplifiers. The only variations of Covox Speech Things that we’ve seen all use the most simple schematic, with only the resistor ladder and the pass-through port without pull-up resistors.

To make sure however, we did some measurements on the real Covox to try and verify if the actual unit was exactly like the patent or not. It would appear that indeed, the ladder is as designed, with 100k and 200k resistors. We are not entirely sure of the 15k ‘load’ resistor though. The measurements seem to indicate that it is probably closer to a 25k resistor. This resistor is used to bring the signal down from the initial +5v TTL levels of the printer port to about ~1v (measured), which would be an acceptable ‘line level’ input for audio equipment (when connected to an actual device, the impedance of the device will pull it down a bit further, the effective maximum amplitude should be around +0.350v in practice).

It appears that many clones did not clone the Covox exactly. Possibly to reduce cost by choosing a simpler design to reduce the amount of parts, and possibly to avoid the patent. The result however is that they generally sound considerably worse than the real thing (and in fact, perhaps because of this, Covox may have inadvertently gotten a reputation for a low-quality solution, because most people would use clones that didn’t sound very good, and not many people were aware of just how good the real Covox sounded. I certainly fall into that category, as I said above). For example, there is a schematic included in the readme-file that comes with Triton’s Crystal Dream demo:


As you can see, it is a simplified circuit and not really a ‘ladder’ as such. It uses less components, but is also less accurate. One interesting characteristic of a resistor ladder is that you can build it from batches of the same resistor values (especially considering the fact that you only need R and 2R values, and 2R can be constructed from two R resistors in series). If you buy resistors in a batch, then their tolerance in absolute sense will be as advertised (you can buy them eg with 5% tolerance, 1% tolerance or 0.1% tolerance). However, the relative tolerance of the resistors in the same batch is likely much smaller. And in the case of a resistor ladder, the relative tolerance is more important than the absolute tolerance.

Since this schematic uses resistors of various values, it cannot exploit the advantage of resistors in the same batch. Also, the values of these resistors do not correspond with the values in the real Covox circuit. Aside from that, the load resistor is missing, and they chose a different value for the capacitor.

Another popular one came with Mark J. Cox’ Modplay:


This schematic at least appears to be closer to the Covox, although not entirely the same. Again, the resistor and capacitor values are different from the Covox.

In general, what happens is that the response of the DAC is nowhere near linear. We’ve found that the clones tend to have much higher output levels than the real Covox, but espescially the dynamic range is far worse. You hear that especially when there is a fade-out in music: the actual level doesn’t drop off very much, and as a result, the 8-bit quantization noise becomes very obvious, and the sound is perceived as ‘grainy’ and low-quality. The real Covox gets a very linear response, so it sounds about as good as you can expect from an 8-bit DAC.

Our aim was to make one that sounds as close to the real thing as possible, or possibly even better. The end-result is the CVX4, which you can find here: http://www.serdashop.com/CVX4

It has a number of dip-switches so you can fine-tune the filtering and output level to suit your taste. This of course includes a setting that is completely true to the original Covox. Be sure to check out the example videos and reviews that are posted on the shop page. You can hear just how good it sounds. I will post one video here, which uses CvxPlay to demonstrate a number of samples selected by Trixter, which we used to compare the real Covox with the CVX4, and fine-tune the design:

If you are looking for a Covox clone for your retro gaming/demo/tracker PC, then look no further, the CVX4 is as good as it gets!

Posted in Hardware news, Oldskool/retro programming | Tagged , , , , , , , , , , , , , , , | 2 Comments