Read any good books lately?

Recently, I have written a few things about Agile methods and Scrum. I feel that they are not the solution to everything, because in certain scenarios they have severe risks and drawbacks. I warned about how certain (sub-)problems cannot easily be solved in an iterative fashion, and require proper thinking, design and perhaps even prototyping before you move into Scrum.

Now, I have been reading the book “Requirements Engineering” by Axel van Lamsweerde. I did not expect the topic to pop up, but there it was. And in retrospect, it makes sense: requirements engineering is very closely related to the problem I was describing. You would not generally take a Scrum-approach, but you’d rather have a more spiral approach, where you refine your results in ever larger iterations (much like how you would first develop a prototype with a small team, and then refine it to the end-product with a larger team).

Van Lamsweerde does one better than that, however. He specifically goes into Agile development and points out that Agile methods only work under specific assumptions. In other words: they should only be applied in specific scenarios, and are not the solution to everything. He lists the following assumptions:

  1. All stakeholder roles, including the customer and user roles, can be reduced to a single role.
  2. The project is small enough to assign to a single small-size team on a single location
  3. The user is available for prompt and efficient interaction.
  4. The project is simple and non-critical enough to give little or no consideration to non-functional aspects, environmental assumptions, underlying objectives, alternative options and risks.
  5. The user can provide functional increments quickly, consistently (so that no conflict management is required) and gradually from essential to less important requirements (so that no prioritization is required).
  6. The project requires little documentation for work coordination and subsequent product maintenance. Precise requirement specification before coding is not an issue.
  7. Requirements verification before coding is less important than early release.
  8. New or changing requirements are not likely to require major code refactoring and rewrite, and it is likely that the people who develop the product, will also maintain it.

He specifically points out that mission-critical software is one area where agility is a poor fit. He also points out that ‘agility is not a binary notion’: For each phase of the project, you can choose to use less or more agility, depending on what is required for that phase. I think this is quite interesting, because Scrum-consultants seem to claim the exact opposite: any derivation from the strict rules of Scrum is considered a ‘Scrumbut’, and they always advise to apply Scrum more strictly.

Now, getting back to the example I gave earlier, where the team developed a system based on queues, where they should have used a database… Quite a few of the 8 above assumptions did not apply in that scenario, so Scrum was a poor choice from the start. The queues-vs-databases example itself immediately points to a violation of assumption 8: major code refactoring was required because of the ‘new’ requirements for prioritizing. But also, the project was a very mission-critical project for the customer. It was also a fixed-price, fixed-deadline project. This would violate assumption 4. The customer was in another country, so 3 did not apply either. I would also say that at least assumptions 1 and 6 were violated as well.

When I first heard of Agile and Scrum, it was mostly related to web-application development. And in retrospect, this makes sense. I suppose (small) web-applications tend to be a reasonable fit for Scrum. However, as we also discussed in the comments of the earlier articles, there are also projects that are highly technical and experimental in nature, and they are a very poor fit for Scrum. A spiral-like approach is much more natural there, where one or two people start by collecting the requirements, and iteratively working out the risks and problematic areas, perhaps developing some proof-of-concepts/prototypes during that time, and then refine the requirements, design etc, so that the project will gradually mature, and it starts making sense to add more people to the project, because there are now reasonably well-defined goals and workloads to be distributed.

The irony of it all, is that Scrum never claimed to be the solution to everything. Take the original paper by Ken Schwaber: Figure 6 clearly shows the Scrum Methodology, defining 3 different phases:

SCRUM_Figure6

  1. Planning & System Architecture (Pregame)
  2. Sprints (Game)
  3. Closure (Postgame)

The way phase 1 and 3 are drawn in the figure, show that they are not really part of what Scrum is about (the sprints). They are shown in a way similar to the Waterfall methodology in Figure 1.

Schwaber further remarks:

The first and last phases (Planning and Closure) consist of defined processes, where all processes, inputs and outputs are well defined. The knowledge of how to do these processes is explicit. The flow is linear, with some iterations in the planning phase.

So basically Scrum never meant to solve everything with sprints. Not the planning, not the (intial) requirements, and not the system architecture. For some reason, everyone seems to have forgotten about this, and thinks Scrum is about doing everything in sprints, even all the design. That is a recipe for disaster with any non-trivial project.

Anyway, it’s always nice when your findings are confirmed independently by academic sources.

Advertisements
Posted in Software development | Tagged , , , , , , | 1 Comment

Intel tries its hand at a discrete GPU again?

As you may have heard a few months ago, Intel has employed Raja Koduri, former GPU-designer at AMD’s Radeon division. Back then the statement already read:

In this position, Koduri will expand Intel’s leading position in integrated graphics for the PC market with high-end discrete graphics solutions for a broad range of computing segments.

But at the time it was uncertain what exactly they meant with ‘discrete graphics solutions’, or what timespan we are talking about exactly.
But now there is the news that Intel has also employed Chris Hook, former Senior Director of Global Product Marketing at AMD.
And again, the statement says:

I’ll be assuming a new role in which I’ll be driving the marketing strategy for visual technologies and upcoming discrete graphics products.

So, there really is something discrete coming out of Intel, and probably sooner rather than later, if they are thinking about how to market this technology.

See also this article at Tom’s Hardware.

I am quite excited about this for various reasons. It would be great if NVIDIA would face new competition on the (GP)GPU-front. Also, Intel was and still is Chipzilla. They have the biggest and most advanced chip production facilities in the world. I’ve always wondered what an Intel GPU would be like. Even if their GPU design isn’t quite as good as NV’s, their manufacturing advantage could tilt things to their advantage. I’ve also said that although Intel GPUs aren’t that great in terms of performance, you have to look at what these chips are. Intel always optimized their GPUs for minimum power consumption, and minimum transistor count. So they only had a handful of processing units, compared to the thousands of units found on high-end discrete GPUs. The real question for me has always been: what would happen if you were to take Intel’s GPU design, and scale it up to high-end discrete GPU transistor count?

Perhaps we will be seeing the answer to this in the coming years. One other thing I had pointed out some years ago, was that Intel appeared to have changed course in terms of drivers and feature support. In the late 90s and early 2000s, Intel really had very minimal GPUs in every meaning of the word. However, when DirectX 10 came around, Intel was reasonably quick to introduce GPUs with support for the new featureset. Sadly it still took months to get the first DX10 drivers, but they did eventually arrive. It would appear that Intel had ramped up their driver department. DX11 was a much smoother transition. And when DDX12 came around, Intel was involved with the development of the API, and had development drivers publicly available quite soon (way sooner than AMD). Intel also gave early demonstrations of DX12 on their hardware. And their hardware actually was the most feature-complete at the time (DX12_1, with some higher tier support than NV).

Let’s wait and see what they will come up with.

Posted in Hardware news | Tagged , , , , , , | 4 Comments

The “Visual Basic Effect”

The programming language of BASIC has always been a popular option for beginners in the past. BASIC is an acronym that stands for “Beginner’s All-purpose Symbolic Instruction Code”. So its designers were clearly targeting beginner programmers. As a result, the language is quite easy to understand for people with little or no prior experience with computers. Its syntax is close to simple English phrases, and users do not generally need to worry about data types much.

On the Windows platform, Microsoft offered Visual Basic, in two flavours:

  1. The ‘full’ Visual Basic for stand-alone Windows programs.
  2. Visual Basic for Applications, a ‘scripting’ language for use inside MS Office and various other Windows applications.

Visual Basic was mostly used by inexperienced programmers, and people from outside the world of software development. Especially VBA, which was used by (non-programmer) people to add functionality to their Word and Excel documents, for example. Some even made ‘complete’ (and sometimes business-critical) applications this way. This resulted in poor average quality of Visual Basic solutions. This is what I would describe as the Visual Basic Effect.

Namely, in the hands of an experienced and properly educated software developer, Visual Basic can be used for decent software development, and can produce applications of good quality, even though Visual Basic obviously has its limitation (then again, don’t all programming languages? To a significant degree, software development is all about working inside limitations).

This is not limited to just Visual Basic, I think there is some ‘Visual Basic Effect’ in every programming language or environment. I think the most obvious example these days is in JavaScript. A lot of JavaScript is written by people who aren’t necessarily software engineers, but rather web designers who have learnt a bit of JavaScript to implement their designs.

As a result, many websites have highly inefficient, unstable, poorly written and bloated scripts. Now, I grew up with 8-bit computers of just 1 MHz and a handful of kilobytes for memory. And these machines could respond to user input ‘instantly’. I simply can’t get my head around the fact that today there are many websites that take a second or more to respond to a simple button-click, even on a computer with many cores, many MHz, and many gigabytes of memory. And certainly, there are fine examples of JavaScript-driven web pages that are very elegant and efficient. Of course, these are designed and implemented by proper software architects and engineers, not necessarily ‘people who build websites and know JavaScript’.

A related thing I encountered was with image processing, where it seems common to use Python to interact with a library such as OpenCV. Now, OpenCV itself is written in C++, by people who understand the intricacies of image processing, and know how to optimize it at a low level. The people who use OpenCV with Python are likely similar to the BASIC and JavaScript-crowd: People who aren’t necessarily software engineers themselves, but use Python as a simple tool to access OpenCV as a toolkit.

When people at work were looking for people to hire, they found that it was easier to find Python programmers with OpenCV experience, than people using C/C++. I pointed out the ‘Visual Basic Effect’ and said: “Are you sure the Python programmers are what you want though? They will know how to use OpenCV, but if what you are looking for is actually a custom re-implementation of certain functionality that is in, or similar to, OpenCV, these Python programmers are probably not going to be up to the task.” You will want the type of programmer that can develop something like OpenCV, and those are usually C/C++ people. These people think at a different level of abstraction, the level required to implement and optimize image processing routines such as found in OpenCV. Simply using OpenCV and rigging some processing routines together in a Python script is a completely different affair. Clearly, anyone well-versed in C/C++ is capable of picking up Python in a matter of days, and doing those tasks as well. But not the other way around.

I suppose that brings me back to the ‘aptitude’ that Joel Spolsky pointed out in his Guerrilla Guide to Interviewing. People who interview others for a job by just asking if they have experience with certain tools, or just asking a lot of basic facts, like a ‘quiz show’, are missing the point. You don’t want to know if the person you are interviewing happens to be familiar with the buzzword-of-the-day in the niche that your company’s latest project happens to be operating in, of if he or she happens to be familiar with the flavour-of-the-month tools and languages that your company is currently using. No, you want to know if the person has the aptitude to pick up the relevant knowledge and apply it in ways that generates value for your company.

And in that sense, it may often be better to hire people who are smart and experienced in other areas, than those who happen to know the tools you’re currently interested in, but have limited understanding and experience otherwise. In the case of the Visual Basic effect, you’re probably better off hiring non-VB programmers, because the developers that use more difficult and advanced languages such as C++ and C# will generally also be able to design and implement better VB code than people whose only exposure to software development has been VB.

There is just this huge difference between people who ‘know how to do X’, and people who have a solid understanding of relevant technologies, and can apply them to a wide range of tools and problems.

 

Posted in Software development | Tagged , , , , , , , , | Leave a comment

What is software development? An art? Craft? Trade? Occupation?… Part 3

In the previous two parts we have seen that software development is a very large collection of technologies and methodologies, and an area that has been in constant development since the advent of computers. However, even though every few years there are new ideas that are all the rage, and presented as the ‘silver bullet’ to solve all our problems and massively improve the quality and productivity, it seems to fall flat on its face everytime, once the initial hype dies down. In this part I would like to look at this from a different angle, and try to find some things that DO actually work.

Are new things really new?

In past blogs, I have given some examples in the past of ideas that aren’t really new, but can be found in papers or books published decades ago. I recently found this presentation by Kevlin Henney, that goes much deeper into this, and gives a great perspective on these ‘new’ ideas:

He gives a number of interesting historical examples of things that people consider ‘new’ today, but apparently were already being discussed back in the 1960s and 1970s. Some examples include abstract datatypes and objects, testing and iterative development.

This brings us back to Parnas (and in fact, a quote from Parnas from the 1970s appears in the above presentation by Henney). Because in his ‘Software Aging’ article that I mentioned last time, Parnas mentions exactly the type of thing that Henney discusses: A lot of ideas are already around, but they are not being communicated well throughout the software development community. The ideas are either not known at all, or they are not understood properly, in the context they were meant. As a result they are also not taught to new developers.

Now, when I talk to the average Agile/Scrum-fanatic, it appears that they think they finally found the answer to software development, and all those old developers simply were just messing around with waterfall models, and didn’t know what they were doing.

I think Henney proves a very important point in his presentation: These people DID know what they were doing. It is those people who think that these ‘new’ ideas are really new, that don’t know what they are doing. They are ignorant of the past, and are reinventing the wheel to a certain extent. And it seems that it comes down to the usual ignorant-arrogant pattern that we often see with the Dunning-Kruger effect: In their ignorance, they assume that they know more than the others, so they assume that their ideas are new, and the people before them weren’t as smart as they are.

It had to come from somewhere…

Now, if you think about it some more… Object-Oriented Programming (OOP) has been around for a long time now. The whole concept of Object-Oriented Programming had to come from somewhere. People didn’t just get together and think: “Let’s design some new programming language… here’s an idea! Let’s put objects in there!”. No, the concept of objects existed long before that. OOP languages are basically syntactic sugar for programming in a way that people even did with Pascal, C and assembly: working with objects.

It just seems that these ideas were forgotten or poorly understood, so when new generations of developers started learning object-oriented programming, they were not taught the right way. They understood how to write a working program in an object-oriented language, but they don’t know the languages that went before them, or what problems object-orientation tried to solve, let alone how best to solve them.

Now, Design Patterns will give you some examples of good solutions, but still I am missing something: nobody seems to tell you what the idea behind it is at the lower level. Let me try to give you a quick (and somewhat simplified) example:

An object can be seen as two sets of data:

  1. The actual data that is stored inside the object
  2. The methods of that object, laid out in a table

Methods are data? Well yes, they are function pointers. Methods are nothing other than procedural functions where the first argument is a pointer to the object. That is how inheritance and (virtual) methods work: You can store a function pointer inside the object itself. So if you pass an object to a method as a parameter, you are also passing the functions that should be called. So the object decides which function is called. The interface decides how the table of functions is laid out, so that index N in the table is always DoSomething(A, B, C) for all objects that implement that interface: it is a ‘contract’.

And when you understand objects at that basic level, it makes sense to just use objects and interfaces to pass function pointers around, so you can perform specific callbacks on that object from other methods and objects. And then Design Patterns such as Strategy and Bridge may suddenly make a lot more sense.

And of course you will also understand that people were doing this long before the Gang of Four published their book in 1994, and long before object-oriented programming languages were around.

As for hype… I have mentioned the No Silver Bullet article by Fred Brooks many times already. Even way back in 1986 he already warned about ‘new’ technologies and methodologies, surrounded by a lot of hype, which never seemed to deliver on their promises.

So clearly, there are some very good lessons that can be learnt from older sources. They are out there. You just need to look them up. Stand on the shoulders of those giants.

Get to the point already!

Brooks describes the ‘great designer’. Donald Knuth titled his most famous works ‘The Art of Computer Programming’, and has described programming as an art and a science. There is a very interesting study done by Barry W. Boehm in his book ‘Software Engineering Economics’, where he asserts that the key driver of software cost is the capability of the software development team. There is the concept of the ‘10x programmer‘, which was first seen in the paper ‘Exploratory experimental studies comparing online and offline programming performance’ by Sackman, Erikson and Grant in 1968. Joel Spolsky has pointed out that you should be looking for an ‘aptitude’ in programming, in his Guerrilla Guide to interviewing.

The constant here appears to be the ‘human factor’, not some kind of ‘magic’ technology that you have to use, or some kind of ‘magic’ methodology that makes everyone a great developer. It is about individual people, their talents, and their knowledge, skills and experience. The talent is something you cannot control. Knowledge, skills and experience you can improve. You can train yourself to become a better developer. You can train others to become a better developer.

And that brings me to the actual point here. As the title suggests, you can view software development in many ways. Viewing it as an art or a craft makes a lot of sense in this particular context: just like other arts and crafts, you can improve, and aim to master your craft. And you can guide less experienced developers to also become better by giving them proper advice and guidance.

This brings me to another recent movement in the world of software development. One that perhaps does not quite have the marketing skills of the Agile/Scrum movement, but their ideas are certainly no less valuable. On the contrary. I am talking about ‘Software craftsmanship’.

They try to make you view software development as some kind of medieval guild. And of course they have also drawn up their own Manifesto. For now I will leave you with Kevlin Henney once again, with a talk he has done on this subject. It aligns nicely with some of the things I have said on this blog before, and some things I haven’t:

Posted in Software development | Tagged , , , , , , , , | Leave a comment

Bugs you can’t fix

Although I generally want to avoid the usual car-analogy, in this case I am talking from real-world experience which happened to be car-related, so you will have to excuse me.

No car is ‘perfect’… Every car has certain ‘bugs’, as in design-flaws. There may be excessive wear on certain parts, or some things may just fail unreasonably quickly. You can go to the dealership and try to have it fixed, but if they only have replacement parts for you, the wear or failure will occur again and again. They have a nice term for this: “product characteristic”. It’s not quite the same as “it’s not a bug, it’s a feature!”, although it may feel like it. It’s just that there are inherent flaws in the design that cause the excessive wear or failure. And the mechanic can replace parts or make adjustments, but he can’t redesign your car.

Over the years, I’ve encountered a handful of situations where I ran into software bugs, which, as I progressed in my debugging, turned out to be ‘unsolvable’, much like the above example in the car design. In my experience they are very rare, luckily. But they do pop up every now and then, and when they do, it’s a huge mess. I thought that would make them an interesting topic to discuss.

Shared code is great

The first example I want to give, was in a suite of applications that were distributed over a number of workstations, and connected together via a network-based messaging system.

A bug report came in, and I was asked to investigate it: An application that printed out graphs of sensor-data in realtime, would often print out random empty pages in between, but continue to function fine otherwise.

So I started debugging on the side of the printing app. I found that the empty pages were actually correct: sometimes it would receive empty messages. It just printed the data as it received it, all the code appeared to function as expected.

Then I approached it from the other side, to see if there were empty messages being sent by the sensor. But no, the sensor was working fine, and didn’t have any drop-out. So… the problem is somewhere between the sensor and the printing app. What is happening in the messaging system?

And that’s where I found the problem: The message system was designed so that you could register a message of a certain ‘type’ under a unique name. It would allocate a receive-buffer for each registered message. Do you see the problem already? There is only one buffer that is being re-used for each message sent under that type and name. For small messages you can usually get away with it. However, this sensor was sending large batches of data in each message, and it also had a relatively high frequency.

This led to a race-condition: The printing app would have to finish printing the data before the next message comes in, because the new message would just overwrite the buffer.

There was no locking mechanism in place, so there was no way for the printing app to tell the messaging system to wait with the new message until it was finished with the previous one. So the only thing I could do in the printing app was to just copy every message to a new internal buffer as quickly as possible, so that I minimize the ‘critical time’ that the data needs to remain valid in the buffer.

This improved the situation, but still, it did not fix it completely. There was still the occasional empty page that slipped through, probably because of network congestion. So that the last packet of the last message was immediately followed by the first packet of the new message, overwriting it immediately.

Why was this bug unfixable? Because the API and protocol design of the messaging system were just flawed. It would require a rewrite of the messaging system and all applications using it to take care of the race-condition. In theory it can be done, but it meant that you could not just roll out an update of the printing app. You’d have to roll out updates for all applications in the entire suite, because they all share the same messaging system, and need to be working with the same version of the protocol and API to avoid problems. This was just not economically viable. So the bug couldn’t be fixed.

The wrong data in the wrong structure

The second example is one that I already briefly mentioned before: A system that was designed with FIFO queues, when the requirements needed far more flexibility than just FIFO to get the required routing and prioritization.

Again this is a case where someone made a fundamental design decision that was downright wrong. Since it is so fundamental to the system, the only way to fix it is to do a complete redesign of the core functionality. ‘Fixing’ the system is basically the same as just scrapping it and restarting from scratch.

Basically they spent a few months designing and building a bicycle. Which does the job (checks the ‘working software’ Agile-box) for short-distance trips. But they did not read the requirements carefully, which clearly stated that they had to be able to reach places like Tokyo. What they should have built was a plane. And a plane is so fundamentally different from a bicycle, that there are virtually no parts of the design or implementation that can be shared with a plane.

Two for the price of one

That same system also had another ‘interesting’ problem: The queue sizes that it reported on its dashboard were never accurate. How these queue sizes are calculated takes a bit of explanation, so I hope I can get the point across.

The system is designed to handle a number of queues, where each queue can have hundreds to thousands of items. Or at least, the ‘queue’ as the user thinks of it: the amount of items they are actually trying to process with the system.

The implementation however was built up of a number of processes, which were connected via a network, and each process had a small internal queue. They could not hold all data in memory at any one time, and it was also undesirable to do so, since a system crash would mean that all the data would be lost, so the philosophy was to keep the number of in-memory items to a minimum.

What this meant was that there were a small number of items ‘in flight’ in these processes, and there was a larger ‘offline buffer’ that was still waiting to be fed to the system. In the initial version, this ‘offline buffer’ was not visible at all, so all you saw was the number of ‘in flight’ items, which generally was an insignificant amount (perhaps in the range of 8-96 items) compared to the offline buffer.

So, the customers wanted to see the offline buffer as well. This is where things started to go wrong… The system was built on top of an existing protocol, which was meant for local use only: items would be fed directly from one machine to one workstation, and there was really no concept of any queues. For some reason, the same protocol was still used now that it had become a cloud-based application, and items would be processed remotely in an asynchronous way, and on a much larger scale (many machines sending items, and many workstations processing them)… so that now the items would indeed queue up.

So they created a very nasty hack to try and get the offline buffer size into the cloud system: Each item contains an XML message. They added a new field to the header part of the XML, so that an item can contain the current offline buffer size. The system can then parse the header, and add this size to its own ‘in flight’ buffer, and show this on the dashboard.

Do you already see why this can go horribly wrong? Well, it’s subtle, but the results are disastrous… There are two basic flaws here:

  1. The protocol only transfers data whenever a new item is fetched. As long as no item is processed at the back, no new item is fetched at the front.
  2. The value in the XML header is static, so it is ‘frozen in time’ when the XML is generated.

The first flaw could be worked around with a very nasty hack: use small time-outs on items, so that even when nothing is being processed, items will time-out, leading to the fetching of a new item, so that its XML header can be parsed and the offline buffer size can be updated.

The second flaw is a bigger problem: New items can be added to the offline buffer continuously. So the first item you add would have an offline buffer size of 1. It was the first item. By the time the system fetches it for processing, perhaps hundreds of new items have been added. But still, the XML of the first item will contain ‘1’. Likewise, if the last item was added while the offline buffer had say 3000 items, its XML header would read ‘3000’. So the system will fetch it, and it will update its dashboard to show ‘3000’, even though the buffer is now empty.

The workaround for the first flaw doesn’t exactly make things better: you can use short time-outs, but these items need to be processed. So you make another workaround to re-feed these items into the system. But now you are re-feeding items with offline buffer sizes that do not even reflect the current system state. They still read the size from the time they were created.

I just can’t get over what a huge brainfart this whole thing is. This ‘system’ can never even be remotely accurate. The problem is similar to the first example with the messaging system though: the same protocol is used in various codebases in various forms for various clients. Trying to change it now is opening a huge can of worms. There are so many components that you’d need to modify, re-test and update with various clients that it’s not economically viable.

What surprised me most is that the company actually got away with this. Or at least, they did, until they sold it to one customer that indeed WAS picky, as the queue sizes on the dashboard were very relevant to the way they optimized their business process. How many items do we have? How many new items can we still feed? How quickly should we process the items?

Posted in Software development | Tagged , , , , , , , | 3 Comments

What is software development? An art? Craft? Trade? Occupation?… Part 2

Software Engineering seemed like a good idea at the time, and the analogy was further extended to Software Architecture around the 1990s, by first designing a high-level abstraction of the complex system, trying to reduce the complexity and risks, and improving quality by dealing with the most important design choices and requirements first.

And while proper Software Engineering and Software Architecture can indeed deliver very high-quality software, there is a huge practical problem: time. It takes a lot of time to do everything ‘right’, also, things can get very bureaucratic with lots of documentation and rules and such. A related problem is that requirements may change during the design, so by the time the design is done, the problem it was designed to solve may have changed into an entirely different problem, and the design is no longer adequate.

So, Software Engineering/Architecture, once again not the Silver Bullet that people were hoping for. Which led to new approaches, such as Design Patterns, and slogans such as ‘design for change’, ‘best practices’ and all that. One thing that strikes me when I read books or articles about such topics, or object-oriented programming in general, is that everyone seems to use different terminology. They may use different names for the same concepts, or even worse, they use the same names for different concepts. And we are talking about very basic concepts such as ‘encapsulation’, ‘implementation’, ‘aggregation’ and things like that.

This shows at the very least, that Software Engineering is still a very new and immature field. It might also show a different aspect, namely that the people within the Software Engineering community are not communicating and working together very well.

If you read David Parnas’ work on Software Aging, he already mentioned this back in 1994:

Software is used in almost every industry, e.g. aircraft, military, automotive, nuclear power, and telecommunications Each of these industries developed as an intellectual community before they became dependent upon software. Each has its own professional organisations, trade organisations, technical societies and technical journals. As a result, we find that many of these industries are attacking their software problems without being aware of the efforts in other industries. Each industry has developed its own vocabulary and documents describing the way that software should be built. Some have developed their own specification notations and diagraming conventions. There is very little cross-communication. Nuclear Industry engineers discuss their software problems at nuclear industry meetings, while telecommunications engineers discuss very similar problems at entirely different meetings. To reach its intended audience, a paper on software engineering will have to be published in many different places. Nobody wants to do that (but promotion committees reward it).
This intellectual isolation is inappropriate and costly. It is inappropriate because the problems are very similar. Sometimes the cost structures that affect solutions are different, but the technical issues are very much the same. It is costly because the isolation often results in people re-inventing wheels, and even more often in their re-inventing very bumpy and structurally weak wheels. For example, the telecommunications industry and those interested in manufacturing systems, rarely communicate but their communication protocol problems have many similarities. One observes that the people working in the two industries often do not realise that they have the same problems and repeat each other’s mistakes. Even the separation between safety-critical and non safety-critical software (which might seem to make sense) is unfortunate because ideas that work well in one situation are often applicable in the others.
We need to build a professional identity that extends to people in all industries. At the moment we reach some people in all industries but we don’t seem to be reaching the typical person in those industries.

The paper itself ironically enough proves his very point: the term Software Aging has since taken on a different meaning. Parnas meant aging of code/applications because of maintenance, adding features, and needs and expectations changing:

There are two, quite distinct, types of software aging. The first is caused by the failure of the product’s owners to modify it to meet changing needs; the second is the result of the changes that are made. This “one-two punch’ can lead to rapid decline in the value of a software product.

These days, the term Software Aging instead is used for software that exhibits problems after running for an extended period of time, such as resource leaks or data corruption. So the term was ‘reinvented’ by other ‘software engineers’ to mean something entirely different than what Parnas meant by it.

When is a crisis not a crisis

Parnas also points out that the so-called ‘software crisis’ had been going on for some 25 years already, at the time of writing. And despite advancements in ‘software engineering’, apparently the ‘crisis’ has not been solved. So this is not really a crisis, it is a long-term problem. And there are apparently other problems than just the ones that ‘software engineering’ has tried to address so far.

He goes on to explain that ‘engineering’ is something quite different from what people understand in terms of ‘software engineering’. It is also about a certain professional ethic. A certain professional/academic education, professional standards, a professional identity. In that sense it doesn’t really help that many people who develop software aren’t actually formally educated in computer science, but rather some other field, such as for example physics or electrical engineering, and happen to be writing software related to their field.

I personally would describe this as something like “cheapening of the trade”. It can be annoying to have to work with people who haven’t ‘read the classics’, and aren’t familiar with various concepts or customs that proper computer science people take for granted. It can be very difficult to communicate with these people, because they are not working from the same common basis of knowledge. Yet, they are seen as ‘software engineers’ as much as those of us who studied CS at a university.

So is Agile Development the answer?

In recent years, there has been more criticism on Software Engineering/Architecture, mainly from the realm of Extreme Programming and Agile Development. Their philosophy argues that proper Software Engineering/Architecture is too ‘rigid’, has a lot of overhead, and cannot deal with change effectively. So instead, a more lightweight and ‘agile’ way of development is proposed.

But is that the answer? Well, not entirely, as I have argued before. In fact, I would go as far as to say that Agile Development has mostly been adopted by people outside of Computer Science, such as project managers. To them it is attractive that Agile Development, and in particular Scrum, appears to give a very good overview of progress and use of resources. I say ‘appears to give’, because it is a false sense of accuracy: all the ‘progress’ you see is based on estimates, which may or may not be realistic.

While I agree that Extreme Programming and Agile Development make certain good points, and provide some useful methodologies, the obvious problem is that they tend to completely ignore the process of developing, testing and debugging software itself.

In the next part I want to go into these areas, and introduce some movements in software development that focus more on these topics.

Posted in Software development | Tagged , , , , , , | 9 Comments

Putting the things together, part 2: MIDI and other problems

Remember a few months ago, when I explained my approach to playing VGM files? Well, VGM files are remarkably similar to Standard MIDI files. In a way, MIDI files are also just time-stamped captures of data sent to a sound device. MIDI however is an even stronger case for my approach than VGM is, since MIDI has even higher resolution (up to microsecond resolution, that is 1 MHz).

So when I was experimenting with some old MIDI hardware, I developed my own MIDI player. I then decided to integrate it with the VGM preprocessor, and use the same technology and file format. This of course opened up a can of worms…

(For more background information on MIDI and its various aspects, see also this earlier post).

You know what they say about assumptions…

The main assumption I made with the VGM replayer is that all events are on an absolute timeline with 44.1 kHz resolution. The VGM format has delay codes, where each delay is relative to the end of the previous delay. MIDI is very similar, the main difference is that MIDI is more of a ‘time-stamped event’ format. This means that each individual event has a delay, and in the case of multiple events occuring at the same time, a delay value of 0 is supported. VGM on the other hand supports any number of events between delays.

So implicitly, you assume here that the events/commands do not take any time whatsoever to perform, since the delays do not take any processing time for the events/commands into account. This means that in theory, you could have situations where there is a delay shorter than the time it takes to output all data, so the next event starts while the previous data is still in progress:

Overlapping data

In practice, this should not be a problem with VGM. Namely, VGM was originally developed as a format for capturing sound chip register writes in emulators. Since the software was written on actual hardware, the register writes will implicitly never overlap. As long as the emulator accurately emulates the hardware and accurately generates the delay-values, you should never have any ‘physically impossible’ VGM data.

MIDI is different…

With MIDI, there are a number of reasons why you actually can get ‘physically impossible’ MIDI data. One reason is that MIDI is not necessarily just captured data. It can be edited in a sequencer, or even generated altogether. Aside from that, a MIDI file is not necessarily just a single part, but can be a combination of multiple captures (multi-track MIDI files).

Aside from that, not all MIDI interfaces may be the same speed. The original serial MIDI interface is specified as 31.25 kbps, one start bit, one stop bit, and no parity. This means that every byte is transmitted as a frame of 10 bits, so you can send 3125 bytes per second over a serial MIDI link. However, there are other ways to transfer MIDI data. For example, if you use a synthesizer with a built-in sequencer, it does not necessarily have to go through a physical MIDI link, but the keyboard input can be processed directly by the sequencer, via a faster bus. Or instead of a serial link, you could use a more modern connection, such as USB, FireWire, ethernet or WiFi, which are much faster as well. Or you might not even use physical hardware at all, but virtual instruments with a VSTi interface or such.

In short, it is certainly legal for MIDI data to have delays that are ‘impossible’ to play on certain MIDI interfaces, and I have actually encountered quite a few of these MIDI files during my experiments.

But what is the problem?

We have established that ‘impossible’ delays exist in the MIDI world. But apparently this is not usually a problem, since people use MIDI all the time. Why is it not a problem for most people? And why is it a problem for this particular method?

The reason why it is not a problem in most cases, is because the timing is generally decoupled from the sending of data. That is, the data is generally put into some FIFO buffer, so you can buffer some data while it is waiting for the MIDI interface to finish sending the earlier data.

Another thing is that timing is generally handled by dedicated hardware. If you implement the events with a simple timer that is being polled, and the event being processed as soon as the timer has passed the delay-point, then the timing will remain absolute, and it will automatically correct itself as soon as all data has been sent. The timer just continues to run at the correct speed at all times.

Why is this not the case with this specific approach? It is because this approach relies on reprogramming the timer at every event, making use of the latched properties of the timer to avoid any jitter, as explained earlier. This only works however if the timer is in the rate-generator mode, so it automatically restarts every time the counter reaches 0.

This means that we have to write a new value to the timer before it can reach 0 again, otherwise it will repeat the previous value. And this is where our problem is: when the counter reaches 0, an interrupt is generated. In the handler for this interrupt, I output the data for the event, and then write the new counter value (actually for two interrupts ahead, not the next one). If I were to write a counter value that is too small, then that means that the next interrupt will be fired while we are still in the interrupt handler for the previous event. Interrupts will still be disabled, so this timer event will be missed, and the timer will restart with the same value, meaning that our timing is now thrown off, and is no longer on the absolute scale.

Is there a solution?

Well, that is a shame… we had this very nice and elegant approach to playing music data, and now everything is falling apart. Or is it? Well, we do know that worst-case, we can send data at 3125 bytes per second. We also know how many bytes we need to send for each event. Which means that we can deduce how long it takes to process each event.

This means that we can mimic the behaviour of ‘normal’ FIFO-buffered MIDI interfaces: When an event has an ‘impossible’ delay, we can concatenate its data onto the previous event. Furthermore, we can add up the delay values, so that the absolute timing is preserved. This way we can ensure that the interrupt will never fire while the previous handler is still busy.

So, taking the problematic events in the diagram above, we fix it like this:

Regrouped data

The purple part shows the two ‘clashing events’, which have now been regrouped to a single event. The arrows show that the delays have been added together, so that the total delay for the event after that is still absolute. This means that we do not trade in any accuracy either, since a ‘real’ MIDI interface with a FIFO buffer would have treated it the same way as well: the second MIDI event would effectively be concatenated to the previous data in the FIFO buffer. It wouldn’t physically be possible to send it any faster over the MIDI interface.

This regrouping can be done for more than just two events: you can keep concatenating data until eventually you reach a delay that is ‘possible’ again: one that fires after the data has been sent.

Here is an example of the MIDI player running on an 8088 machine at 4.77 MHz. The MIDI device is a DreamBlaster S2P (a prototype from Serdaco), which connects to the printer port. This requires the CPU to trigger the signal lines of the printer port at the correct times to transfer each individual MIDI byte:

Posted in Oldskool/retro programming | Tagged , , , , , , , , , , , , , , , | 6 Comments