Although I generally want to avoid the usual car-analogy, in this case I am talking from real-world experience which happened to be car-related, so you will have to excuse me.
No car is ‘perfect’… Every car has certain ‘bugs’, as in design-flaws. There may be excessive wear on certain parts, or some things may just fail unreasonably quickly. You can go to the dealership and try to have it fixed, but if they only have replacement parts for you, the wear or failure will occur again and again. They have a nice term for this: “product characteristic”. It’s not quite the same as “it’s not a bug, it’s a feature!”, although it may feel like it. It’s just that there are inherent flaws in the design that cause the excessive wear or failure. And the mechanic can replace parts or make adjustments, but he can’t redesign your car.
Over the years, I’ve encountered a handful of situations where I ran into software bugs, which, as I progressed in my debugging, turned out to be ‘unsolvable’, much like the above example in the car design. In my experience they are very rare, luckily. But they do pop up every now and then, and when they do, it’s a huge mess. I thought that would make them an interesting topic to discuss.
Shared code is great
The first example I want to give, was in a suite of applications that were distributed over a number of workstations, and connected together via a network-based messaging system.
A bug report came in, and I was asked to investigate it: An application that printed out graphs of sensor-data in realtime, would often print out random empty pages in between, but continue to function fine otherwise.
So I started debugging on the side of the printing app. I found that the empty pages were actually correct: sometimes it would receive empty messages. It just printed the data as it received it, all the code appeared to function as expected.
Then I approached it from the other side, to see if there were empty messages being sent by the sensor. But no, the sensor was working fine, and didn’t have any drop-out. So… the problem is somewhere between the sensor and the printing app. What is happening in the messaging system?
And that’s where I found the problem: The message system was designed so that you could register a message of a certain ‘type’ under a unique name. It would allocate a receive-buffer for each registered message. Do you see the problem already? There is only one buffer that is being re-used for each message sent under that type and name. For small messages you can usually get away with it. However, this sensor was sending large batches of data in each message, and it also had a relatively high frequency.
This led to a race-condition: The printing app would have to finish printing the data before the next message comes in, because the new message would just overwrite the buffer.
There was no locking mechanism in place, so there was no way for the printing app to tell the messaging system to wait with the new message until it was finished with the previous one. So the only thing I could do in the printing app was to just copy every message to a new internal buffer as quickly as possible, so that I minimize the ‘critical time’ that the data needs to remain valid in the buffer.
This improved the situation, but still, it did not fix it completely. There was still the occasional empty page that slipped through, probably because of network congestion. So that the last packet of the last message was immediately followed by the first packet of the new message, overwriting it immediately.
Why was this bug unfixable? Because the API and protocol design of the messaging system were just flawed. It would require a rewrite of the messaging system and all applications using it to take care of the race-condition. In theory it can be done, but it meant that you could not just roll out an update of the printing app. You’d have to roll out updates for all applications in the entire suite, because they all share the same messaging system, and need to be working with the same version of the protocol and API to avoid problems. This was just not economically viable. So the bug couldn’t be fixed.
The wrong data in the wrong structure
The second example is one that I already briefly mentioned before: A system that was designed with FIFO queues, when the requirements needed far more flexibility than just FIFO to get the required routing and prioritization.
Again this is a case where someone made a fundamental design decision that was downright wrong. Since it is so fundamental to the system, the only way to fix it is to do a complete redesign of the core functionality. ‘Fixing’ the system is basically the same as just scrapping it and restarting from scratch.
Basically they spent a few months designing and building a bicycle. Which does the job (checks the ‘working software’ Agile-box) for short-distance trips. But they did not read the requirements carefully, which clearly stated that they had to be able to reach places like Tokyo. What they should have built was a plane. And a plane is so fundamentally different from a bicycle, that there are virtually no parts of the design or implementation that can be shared with a plane.
Two for the price of one
That same system also had another ‘interesting’ problem: The queue sizes that it reported on its dashboard were never accurate. How these queue sizes are calculated takes a bit of explanation, so I hope I can get the point across.
The system is designed to handle a number of queues, where each queue can have hundreds to thousands of items. Or at least, the ‘queue’ as the user thinks of it: the amount of items they are actually trying to process with the system.
The implementation however was built up of a number of processes, which were connected via a network, and each process had a small internal queue. They could not hold all data in memory at any one time, and it was also undesirable to do so, since a system crash would mean that all the data would be lost, so the philosophy was to keep the number of in-memory items to a minimum.
What this meant was that there were a small number of items ‘in flight’ in these processes, and there was a larger ‘offline buffer’ that was still waiting to be fed to the system. In the initial version, this ‘offline buffer’ was not visible at all, so all you saw was the number of ‘in flight’ items, which generally was an insignificant amount (perhaps in the range of 8-96 items) compared to the offline buffer.
So, the customers wanted to see the offline buffer as well. This is where things started to go wrong… The system was built on top of an existing protocol, which was meant for local use only: items would be fed directly from one machine to one workstation, and there was really no concept of any queues. For some reason, the same protocol was still used now that it had become a cloud-based application, and items would be processed remotely in an asynchronous way, and on a much larger scale (many machines sending items, and many workstations processing them)… so that now the items would indeed queue up.
So they created a very nasty hack to try and get the offline buffer size into the cloud system: Each item contains an XML message. They added a new field to the header part of the XML, so that an item can contain the current offline buffer size. The system can then parse the header, and add this size to its own ‘in flight’ buffer, and show this on the dashboard.
Do you already see why this can go horribly wrong? Well, it’s subtle, but the results are disastrous… There are two basic flaws here:
- The protocol only transfers data whenever a new item is fetched. As long as no item is processed at the back, no new item is fetched at the front.
- The value in the XML header is static, so it is ‘frozen in time’ when the XML is generated.
The first flaw could be worked around with a very nasty hack: use small time-outs on items, so that even when nothing is being processed, items will time-out, leading to the fetching of a new item, so that its XML header can be parsed and the offline buffer size can be updated.
The second flaw is a bigger problem: New items can be added to the offline buffer continuously. So the first item you add would have an offline buffer size of 1. It was the first item. By the time the system fetches it for processing, perhaps hundreds of new items have been added. But still, the XML of the first item will contain ‘1’. Likewise, if the last item was added while the offline buffer had say 3000 items, its XML header would read ‘3000’. So the system will fetch it, and it will update its dashboard to show ‘3000’, even though the buffer is now empty.
The workaround for the first flaw doesn’t exactly make things better: you can use short time-outs, but these items need to be processed. So you make another workaround to re-feed these items into the system. But now you are re-feeding items with offline buffer sizes that do not even reflect the current system state. They still read the size from the time they were created.
I just can’t get over what a huge brainfart this whole thing is. This ‘system’ can never even be remotely accurate. The problem is similar to the first example with the messaging system though: the same protocol is used in various codebases in various forms for various clients. Trying to change it now is opening a huge can of worms. There are so many components that you’d need to modify, re-test and update with various clients that it’s not economically viable.
What surprised me most is that the company actually got away with this. Or at least, they did, until they sold it to one customer that indeed WAS picky, as the queue sizes on the dashboard were very relevant to the way they optimized their business process. How many items do we have? How many new items can we still feed? How quickly should we process the items?