Bugs you can’t fix

Although I generally want to avoid the usual car-analogy, in this case I am talking from real-world experience which happened to be car-related, so you will have to excuse me.

No car is ‘perfect’… Every car has certain ‘bugs’, as in design-flaws. There may be excessive wear on certain parts, or some things may just fail unreasonably quickly. You can go to the dealership and try to have it fixed, but if they only have replacement parts for you, the wear or failure will occur again and again. They have a nice term for this: “product characteristic”. It’s not quite the same as “it’s not a bug, it’s a feature!”, although it may feel like it. It’s just that there are inherent flaws in the design that cause the excessive wear or failure. And the mechanic can replace parts or make adjustments, but he can’t redesign your car.

Over the years, I’ve encountered a handful of situations where I ran into software bugs, which, as I progressed in my debugging, turned out to be ‘unsolvable’, much like the above example in the car design. In my experience they are very rare, luckily. But they do pop up every now and then, and when they do, it’s a huge mess. I thought that would make them an interesting topic to discuss.

Shared code is great

The first example I want to give, was in a suite of applications that were distributed over a number of workstations, and connected together via a network-based messaging system.

A bug report came in, and I was asked to investigate it: An application that printed out graphs of sensor-data in realtime, would often print out random empty pages in between, but continue to function fine otherwise.

So I started debugging on the side of the printing app. I found that the empty pages were actually correct: sometimes it would receive empty messages. It just printed the data as it received it, all the code appeared to function as expected.

Then I approached it from the other side, to see if there were empty messages being sent by the sensor. But no, the sensor was working fine, and didn’t have any drop-out. So… the problem is somewhere between the sensor and the printing app. What is happening in the messaging system?

And that’s where I found the problem: The message system was designed so that you could register a message of a certain ‘type’ under a unique name. It would allocate a receive-buffer for each registered message. Do you see the problem already? There is only one buffer that is being re-used for each message sent under that type and name. For small messages you can usually get away with it. However, this sensor was sending large batches of data in each message, and it also had a relatively high frequency.

This led to a race-condition: The printing app would have to finish printing the data before the next message comes in, because the new message would just overwrite the buffer.

There was no locking mechanism in place, so there was no way for the printing app to tell the messaging system to wait with the new message until it was finished with the previous one. So the only thing I could do in the printing app was to just copy every message to a new internal buffer as quickly as possible, so that I minimize the ‘critical time’ that the data needs to remain valid in the buffer.

This improved the situation, but still, it did not fix it completely. There was still the occasional empty page that slipped through, probably because of network congestion. So that the last packet of the last message was immediately followed by the first packet of the new message, overwriting it immediately.

Why was this bug unfixable? Because the API and protocol design of the messaging system were just flawed. It would require a rewrite of the messaging system and all applications using it to take care of the race-condition. In theory it can be done, but it meant that you could not just roll out an update of the printing app. You’d have to roll out updates for all applications in the entire suite, because they all share the same messaging system, and need to be working with the same version of the protocol and API to avoid problems. This was just not economically viable. So the bug couldn’t be fixed.

The wrong data in the wrong structure

The second example is one that I already briefly mentioned before: A system that was designed with FIFO queues, when the requirements needed far more flexibility than just FIFO to get the required routing and prioritization.

Again this is a case where someone made a fundamental design decision that was downright wrong. Since it is so fundamental to the system, the only way to fix it is to do a complete redesign of the core functionality. ‘Fixing’ the system is basically the same as just scrapping it and restarting from scratch.

Basically they spent a few months designing and building a bicycle. Which does the job (checks the ‘working software’ Agile-box) for short-distance trips. But they did not read the requirements carefully, which clearly stated that they had to be able to reach places like Tokyo. What they should have built was a plane. And a plane is so fundamentally different from a bicycle, that there are virtually no parts of the design or implementation that can be shared with a plane.

Two for the price of one

That same system also had another ‘interesting’ problem: The queue sizes that it reported on its dashboard were never accurate. How these queue sizes are calculated takes a bit of explanation, so I hope I can get the point across.

The system is designed to handle a number of queues, where each queue can have hundreds to thousands of items. Or at least, the ‘queue’ as the user thinks of it: the amount of items they are actually trying to process with the system.

The implementation however was built up of a number of processes, which were connected via a network, and each process had a small internal queue. They could not hold all data in memory at any one time, and it was also undesirable to do so, since a system crash would mean that all the data would be lost, so the philosophy was to keep the number of in-memory items to a minimum.

What this meant was that there were a small number of items ‘in flight’ in these processes, and there was a larger ‘offline buffer’ that was still waiting to be fed to the system. In the initial version, this ‘offline buffer’ was not visible at all, so all you saw was the number of ‘in flight’ items, which generally was an insignificant amount (perhaps in the range of 8-96 items) compared to the offline buffer.

So, the customers wanted to see the offline buffer as well. This is where things started to go wrong… The system was built on top of an existing protocol, which was meant for local use only: items would be fed directly from one machine to one workstation, and there was really no concept of any queues. For some reason, the same protocol was still used now that it had become a cloud-based application, and items would be processed remotely in an asynchronous way, and on a much larger scale (many machines sending items, and many workstations processing them)… so that now the items would indeed queue up.

So they created a very nasty hack to try and get the offline buffer size into the cloud system: Each item contains an XML message. They added a new field to the header part of the XML, so that an item can contain the current offline buffer size. The system can then parse the header, and add this size to its own ‘in flight’ buffer, and show this on the dashboard.

Do you already see why this can go horribly wrong? Well, it’s subtle, but the results are disastrous… There are two basic flaws here:

  1. The protocol only transfers data whenever a new item is fetched. As long as no item is processed at the back, no new item is fetched at the front.
  2. The value in the XML header is static, so it is ‘frozen in time’ when the XML is generated.

The first flaw could be worked around with a very nasty hack: use small time-outs on items, so that even when nothing is being processed, items will time-out, leading to the fetching of a new item, so that its XML header can be parsed and the offline buffer size can be updated.

The second flaw is a bigger problem: New items can be added to the offline buffer continuously. So the first item you add would have an offline buffer size of 1. It was the first item. By the time the system fetches it for processing, perhaps hundreds of new items have been added. But still, the XML of the first item will contain ‘1’. Likewise, if the last item was added while the offline buffer had say 3000 items, its XML header would read ‘3000’. So the system will fetch it, and it will update its dashboard to show ‘3000’, even though the buffer is now empty.

The workaround for the first flaw doesn’t exactly make things better: you can use short time-outs, but these items need to be processed. So you make another workaround to re-feed these items into the system. But now you are re-feeding items with offline buffer sizes that do not even reflect the current system state. They still read the size from the time they were created.

I just can’t get over what a huge brainfart this whole thing is. This ‘system’ can never even be remotely accurate. The problem is similar to the first example with the messaging system though: the same protocol is used in various codebases in various forms for various clients. Trying to change it now is opening a huge can of worms. There are so many components that you’d need to modify, re-test and update with various clients that it’s not economically viable.

What surprised me most is that the company actually got away with this. Or at least, they did, until they sold it to one customer that indeed WAS picky, as the queue sizes on the dashboard were very relevant to the way they optimized their business process. How many items do we have? How many new items can we still feed? How quickly should we process the items?

Advertisements
This entry was posted in Software development and tagged , , , , , , , . Bookmark the permalink.

3 Responses to Bugs you can’t fix

  1. Fascinating problems. Old design decisions are very hard to change. And the funny thing is that the old decisions weren’t necessarily wrong at the time they were made, although I think in your examples the problems should have been obvious more or less from the beginning.

    I’m probably lucky to be working on software which is installed on end users’ systems and a new version has a considerable leeway in changing the internal design. But even there, compatibility with existing files makes certain changes almost impossible.

    Still, correcting old design mistakes is an awful lot of work even when it is doable.

    • Scali says:

      True, there are cases where an old system isn’t necessarily bugged, but it just cannot be used for newer situations. I suppose operating systems are a good example. DOS did what it was supposed to do. It wasn’t a very buggy OS. But as time went on, people started expecting more from an OS. Things like protected mode, pre-emptive multitasking, a GUI, multi-user support, networking etc. And while there have been attempts to extend DOS in these ways, in the end it was just better to use an OS that was designed for all these tasks in the first place.

      I think in this case, the example I gave with the message system was a design flaw. Re-using the same buffer without any locking was never going to work 100%. That was just a bad design decision (and I assume it wasn’t a conscious decision, but rather an oversight).
      If you know the context of the second example, it’s a bit worse.
      Namely, the first generation of that system was designed to have FIFO-behaviour with per-queue priority only: the system would consume items from a collection of queues, and you could assign a priority to each queue, so that it would consume items more often from queues with higher priority.
      This was a conscious decision, and I suppose it was one that fit the original requirements just fine.

      However, they then sold the same system to another client, who wanted a very different type of priority: they want to specify a ‘priority time’ for each item they insert into the system.
      Now, I made an analysis of this problem, and wrote a document explaining how FIFO queues could never solve this problem, especially not in the way they were implemented in our system. The queues only had limited capacity, so only an X number of items could be ‘in flight’ at any given time. This means that you can’t even insert new items into the system in the first place. Let alone that you can insert items with a lower priority time than the X items already ‘in flight’, so that the new items should ‘overtake’ the items already in the system (a very realistic use-case).

      So instead I drafted up a design that used a simple database instead of queues, so that items could be inserted at all times, and you could query them on priority time (and some other parameters).
      But what happened, when they set up a team to build a new system? I gave them my documentation, explained how the database-driven priority could be implemented and everything… But they ended up building a system with FIFO-queues anyway!
      Basically they just copied the old system for some reason. And what’s worse, they didn’t even implement ANY prioritizing at all, not even the per-queue priority that the old system had.

      So this was not even an old system. It was brand new. I don’t know what happened exactly, but apparently nobody understood my explanation of why FIFO buffers aren’t going to work, and they didn’t bother to ask me to explain either. They just figured they could build a system without understanding the problem, I suppose. And then 4 months later they suddenly find that they painted themselves into a corner, and they can start fresh. I built a working proof-of-concept of the database-driven prioritizer (and everything that goes with it, including an interface to display correct buffer counts at all times, and handle timeouts in a reliable and robust manner, things the old system never could) singlehandedly in only 12 days. My design wasn’t all that complex. If only they had done that right away (or asked me to do it, if they couldn’t), instead of wasting 4 months on something that was never going to work.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s