Experience and skill in software development

I just spotted a number of hits from Ars Technica to my blog. It is a regular event that one of my blog posts gets posted in some online discussion, causing a noticeable spike in my statistics. When it does, I usually check out that discussion. This was a rare occasion where I actually enjoyed the discussion. It also reminds me directly of a post I made only a few weeks ago: The Pessimist Pamphlet.

You can find this particular discussion here on Ars Technica. In short, it is about a news item on one of Microsoft’s recent patches, namely to the Equation Editor. The remarkable thing here is that they did a direct binary patch, rather than patching the source code and rebuilding the application.

The discussion that ensued, seemed to split the crowd into two camps: One camp that was blown away by the fact that you can actually do that. And another camp that had done the same thing on a regular basis. My blog was linked because I have discussed patching binaries on various occasions as well. In this particular case, the Commander Keen 4 patch was brought up (which was done by VileR, not myself).

Anyway, the latter camp seemed to be the ‘old warrior’/oldskool type of software developer, which I could identify with. As such, I could also identify with various statements made in the thread. Some of them closely related to what I said in the aforementioned Pessimist Pamphlet. I will pick out a few relevant quotes:

(In response to someone mentioning various currently popular processes/’best practices’ such as unit tests, removing any compiler warnings etc):

I know people who do all this and still produce shitty code, as in it doesn’t do what its supposed to do or there are some holes that users’ can exploit, etc. There’s no easy answer to it as long as its a human that is producing the code.

I have said virtually the same thing in another discussion the other day:

That has always been my objection against “unit-test everything”.
If you ask me, that idea is mainly propagated by people who aren’t mathematically inclined, so to say.
For very simple stuff, a unit-test may work. For complicated calculations, algorithms etc, the difficultly is in finding every single corner-case and making tests for those. Sometimes there are too many corner-cases for this to be a realistic option to begin with. So you may have written a few unit-tests, but how much of the problem do they really cover? And does it even cover relevant areas in the first place?

I think in practice unit-tests give you a false sense of security: the unit-tests that people write are generally the trivial ones that test things that people understand anyway, and will not generally go wrong (or are trivial to debug when they do). It’s often the unit-tests that people don’t write, where the real problems are.

(People who actually had an academic education in computer science should be familiar both with mathematics and also the studies in trying to formally prove correctness of software. And it indeed is a science).

On to the next:

What you consider “duh” practices are learned. Learned through the trials and efforts of our elders. 20 years from now, a whole generation of developers will wonder why we didn’t do baby-simple stuff like pointing hostile AIs at all our code for vulnerability testing. You know, a thing that doesn’t exist yet.

This touches on my Pessimist Pamphlet, and why something like Agile development came into existence in the first place. Knowing where something came from and why is very important.

The one process that I routinely use is coding standards. Yes, including testing for length required before allocating the memory and for verifying that the allocation worked.

The huge process heavy solutions suck. They block innovation, slow development and still provide plenty of solutions for the untrained to believe their work is perfect – because The Holiest of Processes proclaims it to be so.

Try getting somewhat solid requirements first. That and a coding standard solves nearly every bug I’ve even seen. The others, quite honestly, were compiler issues or bad documentation.

Another very important point: ‘best practices’ often don’t really work out in reality, because they tend to be very resource-heavy, and the bean counters want you to cut corners. The only thing that REALLY gives you better code quality is having humans write better code. Which is not done with silly rules like ‘unit tests’ or ‘don’t allow compiler warnings’, but having a proper understanding of what your code is supposed to do, and how you can achieve this. Again: as the Pessimist Pamphlet says: make sure that you know what you’re doing. Ask experienced people for their input and guidance, get trained.

Another one that may be overlooked often:

There’s also the problem that dodgy hacks today are generally responses to the sins of the past.

“Be meticulous and do it right” isn’t fun advice; but it’s advice you can heed; and probably should.

“Make whoever was on the project five years ago be meticulous and do it right” is advice that people would generally desperately like to heed; but the flow of time simply doesn’t work that way; and unless you can afford to just burn down everything and rewrite, meticulous good practice takes years to either gradually refactor or simply age out the various sins of the past.

Even if you have implemented all sorts of modern processes today, you will inevitably run into older/legacy code, which wasn’t quite up to today’s standards, but which your system still relies on.

And this one:

You can write shit in any language, using any process.

Pair programming DOES tend to pull the weaker programmer up, at least at first, but a weird dynamic in a pair can trigger insane shit-fails (and associated management headaches).

There’s no silver bullet.

This is something that I have also run into various times, sadly… poor management of the development process:

Unfortunately in the real world, project due dates are the first thing set, then the solution and design are hammered out.

I’m working coding on a new project that we kicked off this week that is already “red” because the requirements were two months behind schedule, but the due date hasn’t moved.

And the reply to that:

It’s sadly commonplace for software project to allot zero time for actual code implementation. It’s at the bottom of the development stack, and every level above it looks at the predetermined deadline and assumes, “well, that’s how long I’VE got to get MY work done.” It’s not unusual for implementation to get the green light and all their design and requirements documents AFTER the original delivery deadline has passed. Meanwhile, all those layers – and I don’t exclude implementation in this regard – are often too busy building their own little walled-off fiefdoms rather than working together as an actual team.

Basically the managers who think they’re all-important, and once they have some requirements, they’ll just shove it into a room with developers, and the system will magically come out on the other end. Both Agile development and the No Silver Bullet article try to teach management that software development is a team sport, and management should work WITH the developers/architects, not against them. As someone once said: Software development is not rocket science. If only it were that simple.

Another interesting one (responding to the notion that machine language and assembly are ‘outdated’ and not a required skill for a modern developer):

The huge difference is that we no longer use punchcards, so learning how punchcards work is mostly a historic curiosity.

On the other hand every single program you write today, be it Haskell, JavaScript, C#, Swift, C++, Python, etc, would all ultimately be compiled to or run on top of some code that still works in binary/assembly code. If you want to fully understand what your program is doing, it’s wise to understand to at least read assembly. (And if you can read and understand it it’s not a big stretch to then be able to modify it)

And really, most of the skill in reading assembly isn’t the assembly itself. It’s in understanding how computers and OS actually work, and due to Leaky Abstraction (https://en.wikipedia.org/wiki/Leaky_abstraction) it’s often abstractions can be broken, and you need to look under the curtain. This type of skill is still pretty relevant if you do computer security related work (decompiling programs would be your second nature), or if you do performance-sensitive work like video games or VR or have sensitive real-time requirements (needing to understand the output of the compiler to see why your programs are not performing well).

Very true! We still use machine code and assembly language in today’s systems. And every now and then some abstraction WILL leak such details. I have argued that before in this blogpost.

Which brings me to the next one:

We can celebrate the skill involved without celebrating the culture that makes those skills necessary. I’d rather not have to patch binaries either, but I can admire the people who can do it.

A common misunderstanding on the blogpost I mentioned above is that people mistook my list of skills for a list of  ‘best practices’. No, I’m not saying you should base all your programming work around these skills. I’m saying that these are concepts you should master to truly understand all important aspects of developing and debugging software.

This is also a good point:

My point is: software engineering back in the days might not have all those fancy tools and “best practises” in place: but it was an art, and required real skills. Software engineering skills, endurance, precision and all that. You had your 8 KB worth of resources and your binary had to fit into that, period.

I am not saying that I want to switch my code syntax highlighter and auto-completion tools and everything, and sure I don’t want to write assembler ;) But I’m just saying: don’t underestimate the work done by “the previous generations”, as all the lessons learned and the tools that we have today are due to them.

If you learnt coding ‘the hard way’ in the past, you had to hone your skills to a very high level to even get working software out of the door. People should still strive for such high levels today, but sadly, most of them don’t seem to.

And again:

Just as frustrating is that quite a few developers have this mania with TDD, Clean Architecture, code reviews processes etc. without really understanding the why. They just repeat the mantras they’ve learnt from online and conference talks by celebrities developers. Then they just produced shitty code anyway.

And the response to that:

A thousand times this. Lately I have a contractor giving me grief (in the form of hours spent on code reviews) because his code mill taught him the One True Way Of Coding.. sigh.

As said before, understand what the ideas are behind the processes. Understanding the processes and thoughts makes you a much better developer, and allow you to apply the processes and ideas in the spirit they were meant by the initiators, for best effect. And I cannot repeat it often enough: There is no silver bullet! No One True Way Of Coding!

Well, that’s it for now. I can just say that I’m happy to see I’m not quite alone in my thoughts on software development. On some forums you only see younger developers, and they generally all have the same, dare I say, naïve outlook on development. I tend to feel out-of-place there. I mostly discuss programming on vintage/retro-oriented forums these days, since they are generally populated with older people and/or people with a more ‘oldskool’ view on development, and years of hands-on experience. They’ve seen various processes and tools come and go, usually failing to yield a lot of result. The common factor in quality has always been skilled developers. It is nice to see so many ‘old warriors’ also hanging out on Ars Technica.

Advertisements
Posted in Software development | Tagged , , , , , , , , , , , , , , , | Leave a comment

Software: How to parallel?

Software optimization has always been one of my favourite tasks in software development. However, the hardware you are optimizing for, is a moving target (unless of course you are doing retroprogramming/oldskool demoscening, where you have nicely fixed targets). That nice algorithm that you fine-tuned last year? It may not be all that optimal for today’s CPUs anymore.

One area where this was most apparent, was in the move from single-core CPUs to multi-core CPUs, in the early 2000s. Prior to the first commonly available consumer dualcore x86 CPUs from Intel and AMD, multi-core/multi-CPU systems were very expensive and were mainly used in the server and supercomputer markets. Most consumers would just have single-core systems, and this is also what most developers would target.

The thing with optimizing tasks for single-core systems is that it is a relatively straightforward process (pun intended). That is, in most cases, you can concentrate on just getting a single job done as quickly as possible. This job will consist of a sequence of 1 or more processing steps, which will be executed one after another in a single thread. Take for example the decoding of a JPEG image. There are a number of steps in decoding an image, roughly:

  • Huffman decoding
  • Zero run-length decoding
  • Dequantization
  • Inverse DCT
  • De-zigzag the 8×8 blocks
  • Upscaling U and V components
  • YUV->RGB conversion

There can be only one

For a single-threaded solution, most of the optimization will be about making each individual step run as quickly as possible. Another thing to optimize is making the transition from one step to the next as efficient as possible, using the most optimal data flow with the least amount of copying or transforming data between steps.

But that is about it. If you want to decode 10 JPEG images, or even 100 JPEG images, you will process them one after another. Regardless of how many images you want to process, the code is equally optimal in every case. There are some corner-cases though, since even a single-core system consists of various bits of hardware which may be able to process in parallel. For example, you could have disk transfers that can be performed in the background via DMA. Your OS might provide APIs to perform asynchronous disk access with this functionality. Or your system may have a GPU that can run in parallel with the CPU. But let us stick to just the CPU-part for the sake of this article.

How does one parallelize their code? That is the big question. And there is a reason why the title is a question. I do not pretend to have an answer. What I do have however, is various ideas and concepts, which I would like to discuss. These may or may not apply to the problem you are currently trying to solve.

When you are used to optimizing for single-core systems, then intuitively you might take the same approach to parallelization: you will try to make each step of the process as fast as possible, by applying parallel algorithms where possible, and trying to use as many threads/cores as you can, to maximize performance. This is certainly a valid approach in some cases. For example, if you want to decode a single JPEG image as quickly as possible, then this is the way to do it. You will get the lowest latency for a single image.

However, you will quickly run into Amdahl’s law this way: not every step can be parallelized to the same extent. After the zero-length decoding, you can process the data in 8×8 blocks, which is easy to parallelize. However, the Huffman decoding is very difficult to parallelize, for the simple reason that each code in the stream has a variable length, so you do not know where the next code starts until you have decoded the previous one. This means that you will not make full use of all processing resources in every step. Another issue is that you need to have explicit synchronization between the different steps now. For example, the 8×8 blocks are separated into Y, U and V components. But at the end, when you want to convert from YUV to RGB, you need to have all three components decoded before you can do the conversion. Instead of just waiting for a single function to return a result, you may now need to wait for all threads to have completed their processing, causing extra overhead/inefficiency.

When you want to decode more than one image, this may not be the fastest way. The well-parallelized parts may be able to use all cores at the same time, but the serial parts will have very inefficient use of the cores. So the scaling will be less than linear with core count. You will not be getting the best possible throughput.

Batch processing

People from the server world are generally used to exploiting parallelism in another way: if they want to process multiple items at the same time, they will just start multiple processes. In this case, you can run as many of the single-core optimized JPEG decoders side-by-side as you have cores in your system. Generally this is the most efficient way, if you want to decode at least as many JPEG images as you have cores. You mostly avoid Amdahl’s law here, because each core runs very efficient code, and all cores can be used at all times. The main way in which Amdahl’s law will manifest itself in such a situation is in the limitations of shared resources in the system, such as cache, memory and disk I/O. For this reason, scaling will still not be quite linear in most cases, but it generally is as good as it gets, throughput-wise.

However, in this case, if you have say 16 cores, but you only want to decode 10 images, then you will have 6 idle cores, so like the earlier parallelized approach, you are still not making full use of all processing resources in that case, so again your throughput is not optimal.

You could try to run the above parallelized solution for multiple images in parallel, but then you run into other problems: since the parallel parts are designed to use as many resources as available for a single image, running two or more instances in parallel will be very inefficient, because the instances will be fighting for the resources and end up starving eachother.

It gets even more complicated if we would add a time component: say you want to decode a batch of JPEG images, but they are not available at the same time. The images are coming in from some kind of external source, and you do not know exactly when they are coming in, or how many you need to process at a time. Say, if you expect a batch of 100 images, you may get 3 images at a time, then nothing for a while, then another 10 images, etc. So you never know how many images you want to process at the same time, or how many cores you may have available. How would you try to make this as efficient as possible in the average case? So the question is: do you optimize for lowest latency, highest throughput, or some balance between them?

Take a cue from hardware

I think it is interesting to look at CPUs and GPUs at this point, because they need to solve very similar problems, but at a lower level. Namely, they have a number of execution units, and they have batches of instructions and data coming in in unpredictable circumstances. Their job is to allocate the execution units as efficiently as possible.

An interesting parallel to draw between the above two software solutions of parallelizing the decoding of JPEG images and GPU technology is the move from VLIW to scalar SIMD processing in GPUs.

To give some background: GPUs traditionally processed either 3d (XYZ) or 4d vectors (XYZW), or RGB/ARGB colours (effectively also 3d or 4d vectors). So it makes sense to introduce vectorized SIMD instructions (much like MMX, SSE and AVX on x86 CPUs):

add vec1.xyzw, vec2.xyzw, vec3.xyzw

So a single add-instruction can add vectors of up to 4 elements. However, in some cases, you may only want to add some of the elements, perhaps just 1 or 2:

add vec1.x, vec2.x, vec3.x

add vec1.xy, vec2.xy, vec3.xy

The below image is a nice illustration of this I believe:

VLIW

What you see here is an approach where a single processing unit can process vectors of up to 5 elements wide. You can see that the first instruction is 5 wide, so you get full utilization there. Most other instructions however are only 1d or 2d, and there is one more 4d one near the end. So most of the time, the usage of the 5d processing unit is very suboptimal. This is very similar to the above example where you try to parallelize a JPEG decoder and optimize for a single image: some parts may be ’embarrassingly parallel’ and can use all the cores in the CPU. Other parts can extract only limited or even no parallelism, leaving most of the units idle. Let’s call this ‘horizontal’ parallelization. In the case of the GPU, it is instruction-level parallelism.

The solution with GPUs was to turn the processing around by 90 degrees. Instead of trying to extract parallelism from the instructions themselves, you treat all code as if it is purely scalar. So if your shader code looked like this:

add vec1.xyzw, vec2.xyzw, vec3.xyzw

The compiler would actually compile it as a series of scalar operations, like this:

add vec1.x, vec2.x, vec3.x

add vec1.y, vec2.y, vec3.y

add vec1.z, vec2.z, vec3.z

add vec1.w, vec2.w, vec3.w

The parallelism comes from the fact that the same shader is run on a large batch of vertices or pixels, so you can place many of these ‘scalar threads’ side-by-side, running on a SIMD unit. For example, if you take a unit like the above 5d vector unit, you could pack 5 scalar threads this way, and always make full use of the execution unit. It is also easy to make it far wider than just 5 elements, and still have great efficiency, as long as you have enough vertices or pixels to feed. Let’s call this ‘vertical’ parallelization. In the case of the GPU, this is thread-level parallelism.

Now, you can probably see the parallel with the above two examples of the JPEG decoding. One tries to extract as much parallelism from each step as possible, but will not reach full utilization of all cores at all times, basically a ‘horizontal’ approach. The other does not try to extract parallelism from the decoding code itself, but instead parallelizes by running multiple decoders side-by-side, ‘vertically’.

Sharing is caring

The ‘horizontal’ and ‘vertical’ approaches here are two extremes. My example above with images coming in ‘randomly’ shows that you may not always want to use one of these extremes. Are there some hybrid forms possible?

In hardware, there certainly are. On CPUs we have SMT/HyperThreading, to share the execution units of a single core between multiple threads. The idea behind this is that the instructions in a single thread will not always keep all execution units busy. There will be ‘bubbles’ in the execution pipeline, for example when an instruction has to wait for data to become available from cache or memory. By feeding instructions from multiple threads at a time, the execution units can be used more efficiently, and bubbles can be reduced.

GPUs have recently acquired very similar functionality, known as asynchronous compute shaders. This allows you to feed multiple workloads, both graphics and compute tasks, simultaneously, so that (if things are balanced out properly) the GPU’s execution units can be used more efficiently, because one task can use resources that would otherwise remain an idle ‘bubble’ during another task.

DeadpoolThreadpool

The software equivalent of this is the threadpool: a mechanism where a number of threads are always available (usually the same amount as you have cores in your machine), and these threads can receive any workload. This has some advantages:

  • Creating or destroying threads on demand is quite expensive, a threadpool has lower overhead
  • Dependencies can be handled by queuing a next task to start when a current task completes.
  • The workloads are scheduled dynamically, so as long as you have enough workloads, you can always keep all threads/cores busy. You do not have to worry about how many threads you need to run in parallel at a given time

That last one might require an example to clarify. Say you have a system with 8 cores. Normally you will want to run 8 tasks in parallel. If you were to manually handle the threads, then you could create 8 threads, but that only works if you’re running only one instance of that code. If you were to run two, then it would create 16 threads, and they would be fighting for the 8 cores. You could try to make it smart, but then you’d probably quickly come to the conclusion that the proper way to do that is to… create a threadpool.

Because if you use a threadpool, it would always have 8 threads running for your 8 cores. If you create 8 independent workload tasks, it can run them all in parallel. If you created 16 however, it would run the first 8 in parallel, and then start with the 9th as soon as the first task is complete. So it will always keep 8 tasks running.

Another advantage is that you can run any type of task in parallel. So in the case that the images do not come in at the same time, the different images can be in different steps. Instead of the massively parallel steps hogging all the CPU cores, the non-parallelized steps can be run in parallel with the massively parallel ones, finding a decent balance of resource sharing.

In this case, I suppose the key is to find the right level of granularity. You could in theory create a separate task for each 8×8 block for every step. But that will create a lot of overhead in starting, stopping and scheduling each individual task. So you may want to group large batches of 8×8 blocks together in single tasks. You might also want to group multiple decoding steps together on the same batch of 8×8 blocks, to reduce the total amount of tasks further.

Anyway, these are just some ideas of how you can parallelize your code in various ways on modern multi-core systems. Whichever is the best way depends on your specific needs. Perhaps there are also other approaches that I have not mentioned yet, or which I am not even aware of. As I said, I don’t have all the answers, just some ideas to try. Feel free to share your experiences in the comments!

Posted in Software development | Tagged , , , , , , , , , , , | Leave a comment

nVidia makes good on their promise: DX12 support for Fermi

You might remember that in the months leading up to Mantle, DX12 and Vulkan, I mentioned that all of nVidia’s cards from Fermi and up would support DX12. This was also officially confirmed by nVidia on this page, and also here. However, you can see the small print there already:

Fermi will receive DX12 support later this year (expected around the first wave of DX12 content).

And indeed, nVidia’s initial release of DX12 drivers had support for all GPUs, except Fermi. However, these Fermi drivers never appeared later that year.

nVidia later made a statement that they would not support Vulkan on Fermi. People extrapolated from this that the elusive DX12 drivers for Fermi would never materialize either.

But nVidia silently made good on their promise. I still have an old Core2 Duo machine around, with my old GTX460 in there. I put Windows 10 on there to have another DX12 test box, and I ran into exactly this problem: no drivers.

However, I just upgraded it to the Windows 10 Fall Update, and while I was at it, I also installed the latest GeForce drivers, namely 388.00. And lo and behold:

GTX460_DX12

There it is! Direct3D DDI version 12! And driver model WDDM 2.3! These are fully up-to-date drivers, exposing the DX12 driver interface to applications. I don’t know how long this has been in nVidia drivers, but it can’t be more than a few driver releases since I last checked (previous drivers reported only DDI 11).

If I were to hazard a guess, I would think that the 384.76 drivers were the first. Previous release notes say this for DirectX 12 support:

DirectX 12 (Windows 10, for Kepler, Maxwell, and Pascal GPUs)

But now there is no mention of specific GPUs anymore, implying that Fermi is also supported.

Of course I wanted to make absolutely sure, so I ran one of the DirectX 12 samples on it. And indeed, it works fine (and it’s not running the WARP software emulation. The samples would mention this in the title if they did, and it was compiled with WARP set to ‘false’):

GTX460_DX12Sample

I also tried the only DX12 demoscene production I know of so far, Farbrausch’s FR-087: Snake Charmer. This also works:

GTX460_SnakeCharmer

So there we have it, DX12 on Fermi is finally a thing! Kudos to nVidia for delivering on their promise at last.

Update: Apparently this was already discovered on Guru3D, which confirms the 384.76 driver release as the first: http://www.guru3d.com/news-story/nvidia-fermi-cards-get-d3d-12-support.html

I found that link while looking at the Wiki page for nVidia GPUs, which someone had updated to DirectX 12 already in July.

 

Posted in Direct3D, Hardware news, Software development, Software news | Tagged , , , , , , | 13 Comments

The Pessimist Pamphlet

When I first heard of Agile development, I got a feeling of catch-22. That is, I agreed with most of the Agile principles and the underlying reasoning. However, I felt that the reason why I agreed is because of my personal experience, and having already learnt (often the hard way) to approach software development in a way very similar to Agile. I figure that most experienced developers would feel the same. The catch-22 then, is in how you would implement Agile development. For many of the ideas they promote, you need at least one or two of these very experienced developers in your team, who have this sort of Fingerspitzengefühl. Because if you look at the 12 principles, they name esoteric things like ‘sustainable development’, ‘technical excellence’, ‘good design’, ‘simplicity’ and reflecting on their own functioning. If you don’t have a certain level of experience and insight in software development, you will not be able to properly recognize and identify such things within the team and its software.

What I think is the most important of the 4 main values in the Agile manifesto is this:

Individuals and interactions over processes and tools

To me, this statement says that you should always trust the ‘Fingerspitzengefühl’, that is, the insight and experience of developers over some ‘rules’ on how you should or shouldn’t approach a particular area of software development.

In practice however, we mainly see one type of ‘Agile’ development being implemented, which is Scrum. The irony of it all is that Scrum is a very rigid set of rules. While it is considerably more ‘Agile’ than traditional hierarchical/waterfall-like business processes, and allows for more flexible changes in requirements and such, in practice it is often not quite as ‘Agile’ as I believe the people of the Agile Manifesto had envisioned. Scrum has become more of a Cargo Cult thing than something actually Agile.

One of my personal objections against Scrum is that people seem to interpret it as “the team can solve everything”, “anything can be cut up into smaller pieces”, “we’ll cross that bridge when we get there”, and “architecture is a dirty word”.

Sometimes, there are things that you simply cannot solve iteratively. If you have a single component that has various highly dependent requirements, and you choose to solve them iteratively, you can shoot yourself in the foot quite badly. For example, I have seen a project where items could come in from various sources, and had to be processed via a web-based client. The initial design used queues for the items, so that they could be processed first-in, first-out. Prioritization was possible by assigning different weights to the queues of each of the sources.

However, there were two more requirements, that had been postponed until later, which threw a huge spanner in the works: The client wanted to assign a ‘priority time’ to each item, and the client wanted an item to be processed (as soon as possible) by a specific web-client user. This meant that FIFO queues were completely inadequate. Since each item can now have a completely unpredictable priority time, it is no longer a queueing problem, it is a sorting problem. Likewise, routing an item directly to a particular client is not something you can do with just FIFO queues.

I believe this is a case where Scrum failed horribly: You never know exactly where you are going, you are just solving the problem one sprint at a time. You don’t realize that you took a wrong decision at the start until you’ve already built a large part of the application, and you need to redesign and reimplement a large part of the application. Nobody in the team recognized how much of a technical impact these particular requirements would have. They thought they were Scrumming away quite nicely. I suppose the reason is that the main form of interaction that the Scrum-method proposes is the daily standup. But that only discusses things that you are currently working on, you are not looking beyond the current sprint.

The Fingerspitzengefühl here should have said “Wait, this is a very complex application, we need to analyze enough of the requirements, and do enough preparation and design work, with perhaps some Proof-of-Concept to at least know that we are starting our sprints in the right direction.” You have to look before you leap. If nothing else, the ancient and outdated waterfall-methodology at least made sure you did that much.

My idea then, is that the above value of Agile should be re-formulated into a stronger version:

It is the wrong way until you have determined the right way

This is what I would like to call the Pessimist Pamphlet. It is stronger because it starts from the negative, somewhat similar to the philosophy of science where you have verification and falsification. The idea is to inspire more interactions between individuals, think about the problem you’re solving, and how you can best solve it, and apply critical thinking to the processes and tools you’re using. Are they really the right ones in this particular case? What other options are there? What experienced developers (either inside your own team or not) could I ask for their input? Instead of thinking “I’m probably doing things right” until you find out that you didn’t, you should question yourself beforehand and try to avoid problems that you could not see, but others could have, if you just had the interaction. Are you absolutely sure? Don’t assume you’re sure, make sure you’re sure.

Hopefully that will guard you from some of the pitfalls of Scrum, such as that you do not try to solve any problem whole, and that you assume that there is always someone in the team who will come up with the correct solutions and insights. Some problems need to be solved as a whole, not one requirement at a time. The key is to properly identify the dependencies between the requirements, and group all dependent requirements into a single ‘problem’ to solve. I am still not talking about designing the entire application at once obviously, merely identifying the parts that need to be designed as a whole, because the set of dependent requirements push you into a certain minimum featureset that your data model, choice of algorithms etc. need to support (as in this example, don’t choose queues when you might need more complex queries or sorting at a later stage, so eg a relational database may be more appropriate). And sometimes you really do need to consult someone from outside the team, who may have relevant insights and knowledge that your team does not have.

It would seem that I am not the only one who has their doubts about Scrum and how it is applied in practice. This is a very nice article discussing some pitfalls and shortcomings of Scrum: https://www.aaron-gray.com/a-criticism-of-scrum/

I believe this also brings us back to that valuable lesson that Fred Brooks taught us in the 1980s: There is no silver bullet for software development. I suppose what I have discussed here comes down determining to what he calls the ‘essential complexity’ of a problem. In his famous article, he talks about creating working prototypes and developing them incrementally towards a final product. One of the arguments he gives is that it is almost impossible for the client to give a complete and accurate specification upfront, so the requirements are subject to change. You could say he laid the foundation for Agile development in that article. However, he also makes a clear point that there’s a difference between good and great designers. You should cherish these great designers, because they can simply do things that good designers cannot, no matter what methodology you apply. These great designers can make the biggest difference.

 

Posted in Software development | Tagged , , , , , , , , , , | 2 Comments

Trackers vs MIDI, part deux

The previous post got rather long(winded) already. But shortly after I posted it, I realized that I had not yet said all that I wanted to say. Namely, I mentioned the Yamaha FM synthesizer chips at the start, and I wanted to get back to them later, when discussing trackers. However, I solely focused on sample-based trackers and the UltraSound there. So this time, let’s look at trackers for other types of synthesizers.

It mostly revolves around what I briefly mentioned before:

But I just said that I thought the Sound Blaster Pro 2.0 sounded bland. What happened here? Well, my guess is that MIDI happened.

What is the most ‘characteristic’ thing about synthesizers and synthesizer music in general? I would say that it is the fact that synthesizers generate the sound in realtime, and you can modify the parameters of the generated sound in realtime. A common effect is the ‘filter sweep’, where you change the sound from a bright sound moving to a dark sound and back. You can listen to music from Jean-Michel Jarre for many examples of that. His sounds are constantly ‘morphing’:

You will find plenty of such examples in music on the Commodore 64 as well, since the SID chip is also a ‘subtractive’ synthesizer like these early synths from the 70s and 80s, which first generates a basic waveform with an oscillator, and then runs it through a filter with cutoff and resonance parameters to shape the final tone. This is one of the defining features of the SID chip: its contemporaries lacked the filter. Another ‘synth’ feature it has is ring-modulation. A third feature is that you can adjust the duty-cycle of the pulse wave, to control its timbre. Using these features and manipulating their parameters allowed the SID to sound far more ‘synth-music’-like than any other computer at the time:

Now, Yamaha was of course also a big name in the synthesizer-world of the early 80s, most notably with the DX7. The DX7 introduced the world to FM synthesis. Purists will say that no filters are used in FM synthesis, and unlike the SID and other early synthesizers, which work with analog signals, FM is actually implemented mostly as digital algorithms, with only an A/D converter at the end of a chain.

However, in practice, the concept is much the same: the sound is generated in realtime, and you can adjust parameters in realtime as well, for various effects. Even on an FM synthesizer you can get quite convincing realtime controlled filter-like sounds, very ‘synth-like’:

This brings us back to the world of PC soundcards, as the popular AdLib and Sound Blaster cards also used FM synthesizer chips from Yamaha, namely the OPL2 and OPL3 as mentioned before.

While these chips may not be as advanced as a real DX7 synthesizer, the basic concept still holds: the sound is generated in realtime, and various parameters can be tweaked to modify the sound in realtime, creating a number of effects. The problem here is that each synthesizer has its own unique sound generation engine, with its own unique parameters to tweak in realtime.

MIDI allows you to tweak these parameters, but the problem is that there are only a handful of standardized messages defined in MIDI:

  • Note on/off (with velocity)
  • Aftertouch (with velocity)
  • Pitch bend change

These messages allow you to start or stop a note, to set the volume and pitch, and that’s basically it.

Anything else is done via generic messages for ‘control change’, ‘program change’ or with System Exclusive (SysEx) messages. The problem is: there is no standard for how these MIDI messages should map to the synthesizer. Or well, SysEx messages are well-defined, but only for a specific synthesizer, so you need customized software to support it.

And that is more or less the clash between MIDI and PC sound cards: MIDI is trying to be a very generic solution for recording and replaying musical data. It works well when you have a dedicated synthesizer hooked up, and any realtime changes you make on the synthesizer are sent as MIDI messages, which can be recorded and replayed by a MIDI sequencer.

The problem with the AdLib and Sound Blaster cards was that such a ‘development station’ didn’t really exist: the cards did not natively support MIDI, and there was no synthesizer to easily generate MIDI commands for all sorts of parameter changes. So if you wanted to go the MIDI route, you’d first need to write your own MIDI interpreter to drive the OPL chip, and then set up a MIDI controller and sequencer to control the OPL chip and compose music for it.

It seems that there is somewhat of a disconnect between the two worlds here. You would have musicians who were at home with MIDI, but who weren’t programmers themselves, and could not make a MIDI interpreter for the OPL chip. And then there were programmers who would be able to make a MIDI interpreter. But they would rather choose a more straightforward tracker-like solution. This led to MIDI mostly being used with simple standard MIDI drivers, with generic instrument presets and no realtime control of any synth parameters.

However, a few brave souls did in fact build trackers for the OPL chips. And they did actually play around with the instruments and showed off what the OPL chip was really capable of. One such tracker is EdLib, made by JCH. You might vaguely remember that name, since I also used a tune composed in EdLib in the 1991 Donut.

What is interesting is that JCH converted some C64 songs to EdLib. Listen to The Alibi by Laxity, first the original:

And then the EdLib version:

As you can hear, the C64 version modifies the sounds in realtime, and this AdLib tune does pretty much the same. What I like about the conversion is that it does not sound like JCH wanted to make the AdLib-version as close as possible to the C64 version. But rather, he tried to really adapt it to the AdLib and make it sound as good as possible on the OPL chip. The result is possibly one of the best AdLib tunes ever made, and certainly way better than the generic MIDI sound that is usually associated with the AdLib.

Another nice example is the soundtrack from the game Dune:

My favourite track is probably ‘Water’. It really shows off the metallic percussion sounds that FM can do so well. It makes it sound very bright and fresh, way different from the ‘muffled’ sounds of most 8-bit sound chips.

The Dune music was made with the HERAD system, which was loosely based on MIDI, but specifically targets the AdLib, so it is probably closest to the custom MIDI solution I meant above. The Dune-music is a great demonstration of what HERAD can do with an AdLib in capable hands.

Lastly, I also want to mention the game Tyrian. This one also really stood out on the AdLib back in the day:

Again, this game uses its own custom software for AdLib, known as Loudness Sound System (LDS).

 

Some more examples of outstanding AdLib tunes can be found here on the Crossfire Designs site. I also discovered the obscure ‘Easy AdLib’ tracker there, which also includes some very nice AdLib music:

So the story of FM synthesis on the PC is generally a sad one, with but a few highlights. If you knew how to program it, you could get some fantastic sounds from it. But it was not an easy chip to program. As a result, it seems that most game developers were just happy to get any sound from it at all. For a card that has been the standard in PC audio for such a long time, remarkably little software for composing music on it has been released. I suppose most game developers neither had the tools nor the skills to really make the AdLib shine.

Which is a shame, since FM chips were also used in quite a few arcade machines, consoles and home computers, mostly Japanese. And there’s lots of great FM music out there on chips that are quite similar to the OPL2 and OPL3 used on PC sound cards.

So I would like to close today’s blog with a recent demo from Titan for the Sega Mega Drive. The Mega Drive uses two sound chips, one being the SN76489 which we also know from PCjr and Tandy. The other being a Yamaha YM2612, an FM synthesizer.

 

 

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , , , , , , , , , , , | 5 Comments

Trackers vs MIDI

With all the recent tinkering with audio devices and sound routines, I stumbled across various resources, old and new. One such resource was this article on OS/2 Museum, about the Gravis UltraSound. (And small world, this site is by Michal Necasek, who is on the OpenWatcom team, the C/C++ compiler I use for my 16-bit DOS projects). More specifically, the ‘flamewar’ between Rich Heimlich and the rest of the newsgroup, regarding the quality of the UltraSound patches, and general usability in games.

Now, as a long-time demoscener and Amiga guy, it probably doesn’t surprise you that I myself was an early adopter of the GUS, and it has always had a special place in my heart. So I decided to browse through that flamewar, for a trip down memory lane. I can understand both sides of the argument.

In the blue corner…

The GUS was not designed to be a perfect clone of a prior standard, and build on that, unlike most other sound cards. Eg, the original Sound Blaster was basically an AdLib with a joystick port and a DMA-driven DAC for digital audio added. Later Sound Blasters and clones would in turn build on 100% AdLib/Sound Blaster compatibility. Likewise, the Roland MT-32 set a standard. Roland Sound Canvas set another standard (General MIDI), and also included an MT-32 compatibility mode (which wasn’t quite 100% though). Most other MIDI devices would also try to be compatible with MT-32 and/or Sound Canvas.

The GUS was different. Being a RAM-based wavetable synthesizer, it most closely resembles the Amiga’s Paula chip. Which is something completely alien to PCs. You upload samples, which you can then play at any pitch and volume, anywhere in the stereo image (panning). While there was a brave attempt at a software layer to make the thing compatible with Sound Blaster (SBOS), the nature of the hardware didn’t lend itself very well to simulating the Yamaha OPL2 FM chip. So the results weren’t that great.

In theory it would lend itself quite well to MIDI, and there was also an emulator available to support Roland MT-32 and Sound Canvas (Mega-Em). However, for a complete General MIDI patch set, you needed quite a lot of RAM, and that was the bottleneck here. Early cards only had 256kB. Later cards had 512kB and could be upgraded to a maximum of 1 MB. Even 1 MB is still quite cramped for a high-quality General MIDI patch set. Top quality ROM-based wavetable synthesizers would have around 4 MB of ROM to store the patches.

Since the card was rather new and not that well-known, there weren’t that many games that supported the card directly, so you often had to rely on these less-than-great emulators. Even when games did use the card natively, the results weren’t always that great. And that’s what I want to focus on in this blog, but more on that later.

I never used SBOS myself, so I suppose my perspective on the GUS is slightly different anyway. My first soundcard was a Sound Blaster Pro 2.0, and when I got a GUS some years later (being a C64/Amiga guy, the SB Pro never impressed me much. The music sounded bland and the card was very noisy), I just left the SB Pro in my system, so I had the best of both worlds: Full AdLib/SB compatibility, and GUS support (or MT-32/Sound Canvas emulation) when required.

In the red corner…

People who owned and loved the UltraSound, knew what the card was capable of, if you played to its strengths, rather than its weaknesses (as in the emulators).

Gravis included their own MIDI player, where you could configure the player to use specially tweaked patch sets for each song. The card could really shine there. For example, they included a solo piano piece, where the entire RAM could be used for a single high-quality piano patch:

Another demonstration they included was this one:

That works well for individual songs, because you know what instruments are and aren’t used. But for generic software like games, you have to support all instruments, so you have to cram all GM instruments into the available RAM.

And being so similar to the Amiga’s Paula, the GUS was quickly adopted by demosceners, who had just recently started to focus on the PC, and brought the Amiga’s ProTracker music to the PC. Initially just by software-mixing multiple channels and outputting on PC speaker, Covox, Sound Blaster or similar single-channel devices. So when the GUS came out, everything seemed to fall into place: This card was made to play modules. Each module contains only the samples it needs, so you make maximum use of the RAM on the card. The chip would perform all mixing in hardware, so there was very little CPU overhead for playing music, and the resulting quality was excellent.

On the Amiga, every game used tracked music. So that would be a great solution for the GUS on the PC as well, right? Well, apparently not, because in practice very few games included tracked music on PC. And of the few games that did, many of them were ported from the Amiga, and used the 4-channel 8-bit music from the Amiga as-is. That didn’t really give the GUS much of a chance to shine. It couldn’t show off its 16-bit quality or its ability to mix up to 32 channels in hardware. Mixing just 4 channels was not such a heavy load on the CPU at the time, so hardware mixing wasn’t that much of an advantage in this specific case.

Yamaha’s FM synthesis

As you may know, the Sound Blaster and AdLib cards used a Yamaha FM synthesizer chip. Originally they used the OPL2, later generations (starting with Sound Blaster Pro 2.0 and AdLib Gold), used the more advanced OPL3. Now, Yamaha is a big name in the synthesizer world. And their FM synthesis was hugely popular in the 80s, especially with their revolutionary DX7 synthesizer, which you can hear in many hits from that era.

But I just said that I thought the Sound Blaster Pro 2.0 sounded bland. What happened here? Well, my guess is that MIDI happened. The above flamewar with Rich Heimlich seems to revolve a lot around the capability of the devices to play MIDI data. Rich Heimlich was doing QA for game developers at the time, and apparently game developers thought MIDI was very important.

Yamaha’s chips, much like the GUS, weren’t that well-suited for MIDI. For different reasons however, although, some are related. That is, if you want to play MIDI data, you need to program the proper instruments into the FM synthesizer. If you just use generic instrument patches, your music will sound… generic.

Also, you are not exploiting the fact that it is in fact an FM synthesizer, and you can modify all the operators in realtime, doing cool filter sweeps and other special effects that make old synthesizers so cool.

Why MIDI?

So what is it that made MIDI popular? Let’s define MIDI first, because MIDI seems to mean different things to different people.

I think we have to distinguish between three different ‘forms’ of MIDI:

  1. MIDI as in the physical interface to connect musical devices
  2. MIDI as in the file format to store and replay music
  3. MIDI as in General MIDI

Interfaces

The first is not really relevant here. Early MIDI solutions on PC were actually a MIDI interface. For example, the MT-32 and Sound Canvas that were mentioned earlier were actually ‘sound modules’, which is basically a synthesizer without the keyboard. So the only way to get sound out of it is to send it MIDI data. Which you could do from any MIDI source, such as a MIDI keyboard, or a PC with a MIDI interface. The Roland MPU-401 was an early MIDI interface for PC, consisting of an ISA card and a breakout box with MIDI connections. The combination of MPU-401 + MT-32 became an early ‘standard’ in PC audio.

However, Roland later released the LAPC-I, which was essentially an MPU-401 and MT-32 integrated on an ISA card. So you no longer had any physical MIDI connection between the PC and the sound module. Various later sound cards would also offer MPU-401-compatibility, and redirect the MIDI data to their onboard synthesizer (like the GUS with its MegaEm emulation, or the Sound Blaster 16 with WaveBlaster option, or the AWE32). I can also mention the IBM Music Feature Card, which was a similar concept to the LAPC-I, except that its MIDI interface was not compatible with the MPU-401, and it contained a Yamaha FB-01 sound module instead of an MT-32.

So for PCs, the physical MIDI interface is not relevant. The MPU-401 hardware became a de-facto standard ‘API’ for sending MIDI data to a sound module. Whether or not that is actually implemented with a physical MIDI interface makes no difference for PC software.

File format

Part of the MIDI standard is also a way to store the MIDI data that is normally sent over the interface to a file, officially called ‘Standard MIDI file’ or sometimes SMF. It is basically a real-time log of MIDI data coming in from an interface: a sequence of MIDI data events, with delta timestamps of very high accuracy (up to microsecond resolution). We mostly know them as ‘.MID’ files. These are also not that relevant to PC games. That is, they may be used in the early stages of composing the music, but most developers will at some point convert the MIDI data to a custom format that is more suited to realtime playback during a game on various hardware.

General MIDI

Now this is the part that affects sound cards, and the GUS in particular. Initially, MIDI was nothing more than the first two points: an interface and a file format. So where is the problem? Well, MIDI merely describes a number of ‘events’, such as note on/off, vibrato, etc. So MIDI events tell a sound module what to play, but nothing more. For example, you can send an event to select ‘program 3’, and then to play a C#4 at velocity 87.

The problem is… what is ‘program 3’? That’s not described by the MIDI events. Different sound modules could have entirely different types of instruments mapped to the same programs. And even if you map to the same instruments, the ‘piano’ of one sound module will sound different to the other, and one module may support things like aftertouch, while another module does not, so the expression is not the same.

In the PC-world, the MT-32 became the de-facto standard, because it just happened to be the first commonly available/supported MIDI device. So games assumed that you connected an MT-32, and so they knew what instruments mapped to which programs. One reason why the IBM Music Feature Card failed was because its FB-01 was very different from the MT-32, and the music had to be tweaked specifically for the unit to even sound acceptable, let alone sound good.

Roland later introduced the SC-55 Sound Canvas, as something of a ‘successor’ to the MT-32. The SC-55 was the first device to also support ‘General MIDI’, which was a standardization of the instrument map, as well as a minimum requirement for various specs, such as polyphony and multi-timbral support. It could be switched to the MT-32 instrument map for backward compatibility.

Where did it go wrong?

While the idea of standardizing MIDI instruments and specs seems like a noble cause, it never quite worked in practice. Firstly, even though you now have defined that program 1 is always a piano, and that you can always find an organ at program 17, there is still no guarantee that things will sound anything alike. Different sound modules will have different methods of sound generation, use different samples, and whatnot, so it never sounds quite the same. What’s worse, if you play an entire piece of music (as is common with games), you use a mix of various instruments. You get something that is more than the ‘sum of its parts’… as in, the fact that each individual instrument may not sound entirely like the one the composer had used, is amplified by them not fitting together in the mix like the composer intended.

In fact, even the SC-55 suffered from this already: While it has an MT-32 ’emulation’ mode, it does not use the same linear arithmetic method of sound generation that the real MT-32 uses, so its instruments sound different. Games that were designed specifically for the MT-32 may sound slightly less than perfect to downright painful.

The second problem is that indeed developers would design sound specifically for the MT-32, and thereby used so-called ‘System Exclusive’ messages to reprogram the sounds of the MT-32 to better fit the composition. As the name already implies, these messages are exclusive to a particular ‘system’, and as such are ignored by other devices. So the SC-55 can play the standard MT-32 sounds, but it cannot handle any non-standard programming.

This leads to a ‘lowest common denominator’ problem: Because there are so many different General MIDI devices out there, it’s impossible to try and program custom sounds on each and every one of them. So you just don’t use it. This is always a problem with standards and extension mechanisms, and reminds me a lot of OpenGL and its extension system.

Today, many years later, General MIDI is still supported by the built-in Windows software synthesizer and most synthesizers and sound modules on the market, and the situation hasn’t really changed: if you just grab a random General MIDI file, it will sound different on all of them, and in many cases it doesn’t even sound that good. The fact that it’s ‘lowest common denominator’ also means that some of the expression and capabilities of synthesizers are lost, and they tend to sound a bit ‘robotic’.

So I think by now it is safe to say that if the goal of General MIDI was to standardize MIDI and make all MIDI sound good everywhere, all the time, it has failed miserably. Hence General MIDI never caught on as a file format for sharing music, and we stopped using it for that purpose many years ago. The ‘classic’ MIDI interface and file format/data are still being used in audio software, but things went more into the direction of custom virtual instruments with VSTi plugins and such, so I don’t think anyone bothers with standardized instrument mapping anymore. The first two parts of MIDI, the interface and the file format, did their job well, and still do to this day.

Getting back to games, various developers would build their music system around MIDI, creating their own dialect or preprocessor. Some interesting examples are IMF by ID software, which preprocesses the MIDI data to OPL2-specific statements, and HERAD by Cryo Interactive.

Doing something ‘custom’ with MIDI was required for at least two reasons:

  1. Only high-end devices like the IBM Music Feature Card and the MPU-401/MT-32/Sound Canvas could interpret MIDI directly. For other devices, such as PC speaker, PCjr/Tandy audio, AdLib or Game Blaster, you would need to translate the MIDI data to specific commands for each chip to play the right notes.
  2. Most audio devices tend to be very limited in the number of instruments they can play at a time, and how much polyphony they have.

Especially that second issue is a problem with MIDI. Since MIDI only sends note on/off commands, there is no explicit polyphony. You can just endlessly turn on notes, and have ‘infinite polyphony’ going on. Since MIDI devices tend to be somewhat ‘high-end’, they’ll usually have quite a bit of polyphony. For example, the MT-32 already supports up to 32 voices at a time. It has a simple built-in ‘voice allocation’, so it will dynamically allocate voices to each note that is played, and it will turn off ‘older’ notes when it runs out. With enough polyphony that usually works fine in practice. But if you only have a few voices to start with, even playing chords and a melody at the same time may already cause notes to drop out.

An alternative

Perhaps it’s interesting to mention the Music Macro Language (MML) here. Like the MIDI file format it was a way to store note data in a way that was independent from the actual hardware. Various early BASIC dialects had support for it. It seemed to especially be popular in Japan, possibly because of the popularity of the MSX platform there. At any rate, where some game developers would build a music system around MIDI, others would build an MML interpreter, usually with their own extensions to make better use of the hardware. Chris Covell did an interesting analysis of the MML interpreter found in some Neo Geo games.

So, trackers then!

Right, so what is the difference between trackers and MIDI anyway? Well, there are some fundamental differences, mainly:

  1. The instrument data is stored together with the note data. Usually the instruments are actually embedded inside the tracker ‘module’ file, although some early trackers would store the instruments in separate files and reference them from the main file, so that instruments could easily be re-used by multiple songs on a single disk.
  2. Notes are entered in ‘patterns’, like a 2d matrix of note data, where a pattern is a few bars of music. These patterns are then entered in a ‘sequence’, which determines the order of the song, allowing easy re-use of patterns.
  3. The columns of the pattern are ‘channels’, where each channel maps directly to a physical voice on the audio hardware, and each channel is monophonic, like the audio hardware is.
  4. The horizontal ‘rows’ of the pattern make up the timeline. The timing is usually synchronized to the framerate (depending on the system this is usually 50, 60 or 70 Hz), and the tempo is set by how many frames each row should take.

Does that sound limited? Well yes, it does. But there is a method to this madness. Where MIDI is a ‘high-level’ solution for music-related data, which is very flexible and has very high accuracy, trackers are more ‘low-level’. You could argue that MIDI is like C, and trackers are more like assembly. Or, you could think of MIDI as HTML: it describes which components should be on the page, and roughly describes the layout, but different browsers, screen sizes, installed fonts etc will make the same page look slightly different. A tracker however is more like PostScript or PDF: it describes *exactly* what the page looks like. Let’s look at these 4 characteristics of trackers in detail.

Instruments inside/with the file

Trackers started out as being hardware-specific music editors, mainly on C64 and Amiga. As such, they were targeted at a specific music chip and its capabilities. As a result, you can only play tracker modules on the actual hardware (or an emulation thereof). But since it is a complete package of both note data and instrument data, the tracker module defines exactly how the song should sound, unlike MIDI and its General MIDI standard, which merely describe that a certain instrument should be ‘a piano’, or ‘a guitar’ or such.

The most popular form of tracker music is derived from the Amiga and its SoundTracker/NoiseTracker/ProTracker software. I have discussed the Amiga’s Paula sound chip before. It was quite revolutionary at the time in that it used 4 digital sound channels. Therefore, Amiga trackers used digital samples as instruments. Given enough CPU processing power, and a way to output at least a single digital audio stream, it was relatively easy to play Amiga modules on other platforms, so these modules were also used on PC and even Atari ST at times.

Notes entered in patterns

I more or less said it already: trackers use sequences of patterns. I think to explain what a ‘pattern’ is, an image speaks more than a thousand words:

protracker01

If you are familiar with drum machines, they usually work in a similar way: a ‘pattern’ is a short ‘slice’ of music, usually a few bars. Then you create a song by creating a ‘sequence’ of patterns, where you can re-use the same patterns many times to save time and space.

Patterns are vertically oriented, you usually have 64 rows to place your notes on. What these rows mean ‘rhythmically’ depends on what song speed you choose (so how quickly your rows are played), and how ‘sparsely’ you fill them. For example, you could put 4 bars inside a single pattern. But if you space your notes apart twice as far, and double the speed, it sounds the same, but you only get 2 bars out of the 64 rows now. However, you have gained extra ‘resolution’, because you now have twice the amount of rows in the same time interval.

Pattern columns are ‘voices’

This is perhaps the biggest difference between MIDI and trackers: Any polyphony in a tracker is explicit. Each channel is monophonic, and maps directly to a (monophonic) voice on the hardware. This is especially useful for very limited sound chips that only have a handful of voices (like 3 for the C64 and 4 for the Amiga). MIDI simply sends note on/off events, and there will be some kind of interpreter that converts the MIDI data to the actual hardware, which will have to decide which voices to allocate, and may have to shut down notes when new note on-events arrive and there are no more free voices.

With a tracker, you will explicitly allocate each note you play to a given channel/voice, so you always know which notes will be enabled, and which will be disabled. This allows you to make very effective use of only a very limited number of channels. You can for example ‘weave’ together some drums and a melody or bassline. See for example what Rob Hubbard does here, at around 4:03:

He starts out with just a single channel, weaving together drums and melody. Then he adds a second channel with a bassline and even more percussion. And then the third channel comes in with the main melody and some extra embellishments. He plays way more than 3 parts all together on a chip only capable of 3 channels. That is because he can optimize the use of the hardware by manually picking where every note goes.

Here is another example, by Purple Motion (of Future Crew), using only two channels:

And another 2-channel one by Purple Motion, just because optimization is just that cool:

I suppose these songs give a good idea of just how powerful a tool a tracker can be in capable hands.

The horizontal ‘rows’ of the pattern make up the timeline

This part also has to do with efficiency and optimization, but not in the musical sense. You may recall my earlier articles regarding graphics programming and ‘racing the beam’ and such. Well, of course you will want to play some music while doing your graphics in a game or demo. But you don’t want your music routine to get in the way of your tightly timed pixel pushing. So what you want is to have a way to synchronize your music routine as well. This is why trackers will usually base their timing on the refresh-rate of the display.

For example, Amiga trackers would run at 50 Hz (PAL). That is, your game/demo engine will simply call the music routine once per frame. The speed-command would be a counter of how many frames each row would take. So if you set speed 6, that means that the music routine will count down 6 ‘ticks’ before advancing to the next row.

This allows you to choose when you call the music routine during a frame. So you can reserve a ‘slot’ in your rastertime, usually somewhere in the vertical blank interval, where you play the music. Then you know that by definition the music routine will not do anything during the rest of the frame, so you can do any cycle-exact code you like. The music is explicitly laid out in the row-format to be synchronized this way, allowing for very efficient and controlled replaying in terms of CPU time. The replay routine will only take a handful of scanlines.

With regular MIDI this is not possible. MIDI has very accurate timing, and if you were to just take any MIDI song, you will likely have to process multiple MIDI events during a frame. You never quite know when and where the next MIDI event may pop up. Which is why games generally quantize the MIDI down. However, quantizing it all the way down to around 50 or 60 Hz is not going to work well, so they generally still use a significantly higher frequency, like in the range 240-700 Hz. Which is an acceptable compromise, as long as you’re not trying to race the beam.

Back to the UltraSound

The specific characteristics and advantages of tracker-music should make it clear why it was so popular in the demoscene. And by extension you will probably see why demosceners loved the UltraSound so much: it seems to be ‘custom-made’ for playing sample-based tracker modules. ProTracker modules already sounded very good with 4 channels and 8-bit samples, even if on the PC you needed to dedicate quite a bit of CPU power for a software-mixing routine.

But now there was this hardware that gave you up to 32 channels, supported 16-bit samples, and even better: it did high-quality mixing in hardware, so like on the Amiga it took virtually no CPU time to play music at all. The UltraSound was like a ‘tracker accelerator’ card. If you heard the above examples with just 2 or 3 channels on primitive chips like the C64’s SID and the Amiga’s Paula, you can imagine what was possible with the UltraSound in capable hands.

Where things went wrong for the UltraSound is that trackers were not adopted by a lot of game developers. Which is strange in a way. On the Amiga, most games used one of the popular trackers, usually ProTracker. You would think that this approach would be adopted for the UltraSound as well. But for some reason, many developers treated it as a MIDI device only, and the UltraSound wasn’t nearly as impressive in games as it was in the demoscene.

So, let’s listen to two of my favourite demos from the time the UltraSound reigned supreme in the demoscene. The legendary demo Second Reality has an excellent soundtrack (arguably the highlight of the demo), using ‘only’ 8 channels:

And Triton’s Crystal Dream II also has some beautiful tracked music, again I believe it is ‘only’ 8 channels, certainly not the full 32 that the UltraSound offered (note by the way that the card pictured in the background of the setup menu is an UltraSound card):

What is interesting is that both these groups developed their own trackers. Future Crew developed Scream Tracker, and Triton developed FastTracker. They became the most popular trackers for PC and UltraSound.

So who won in the end? Well, neither did, really. The UltraSound came a bit too late. There were at least three developments that more or less rendered the UltraSound obsolete:

  1. CPUs quickly became powerful enough to mix up to 32 channels with 16-bit accuracy and linear interpolation in the background, allowing you to get virtually the same quality of tracker music from any sound card with a single stereo 16-bit DAC (such as a Sound Blaster 16 or Pro Audio Spectrum 16) as you do from the UltraSound.
  2. CD-ROMs became mainstream, and games started to just include CD audio tracks as music, which no sound card could compete with anyway.
  3. Gaming migrated from DOS to Windows. Where games would access sound hardware directly under DOS, in Windows the sound hardware was abstracted, and you had to go via an API. This API was not particularly suited to a RAM-based wavetable synthesizer like the UltraSound was, so again you were mostly in General MIDI-land.

As for MIDI, point 2 more or less sealed its fate in the end as well, at least as far as games are concerned. Soundtracks are ‘pre-baked’ to CD-tracks or at least digital audio files on a CD, and just streamed through a stereo 16-bit DAC. MIDI has no place there.

I would say that General MIDI has become obsolete altogether. It may still be a supported standard in the market, but I don’t think many people actually use it to listen to music files on their PCs anymore. It just never sounded all that good.

MIDI itself is still widely used as a basis for connecting synthesizers and other equipment together, and most digital audio workstation software will also still be able to import and export standard MIDI files, although they generally have their own internal song format that is an extension of MIDI, which also includes digital audio tracks. Many songs you hear on the radio today probably have some MIDI events in them somewhere.

Trackers are also still used, both in the demoscene, and also in the ‘chiptune’ scene, which is somewhat of a spinoff of the demoscene. Many artists still release tracker songs regularly, and many fans still listen to tracker music.

 

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , , , , , | 29 Comments

DMA activation

No, not the pseudoscience stuff, I am talking about Direct Memory Access. More specifically in the context of IBM PC and compatibles, which use the Intel 8237A DMA controller.

For some reason, I had never used the 8237A before. I suppose that’s because the DMA controller has very limited use. In theory it can perform memory-to-memory operations, but only between channels 0 and 1, and IBM hardwired channel 0 to perform memory refresh on the PC/XT, so using channel 0 for anything else has its consequences. Aside from that, the 8237A is a 16-bit DMA controller shoehorned into a 20-bit address space. So the DMA controller can only address 64k of memory. IBM added external ‘page registers’ for each channel, so you can store the high 4 bits of the 20-bit address there, and this will be combined with the low 16-bit address from the DMA controller on the bus. This means there are only 16 pages of 64k, aligned to 64k boundaries (so you have to be careful when allocating a buffer for DMA, you need to align it properly so you do not cross a page boundary. Beyond 64k, the addressing just wraps around). However, since channel 0 was reserved for memory refresh on the PC/XT, they did not add a page register for it. This means that you can only do memory-to-memory transfers within the same 64k page of channel 1, which is not very useful in general.

On the AT however, they added separate memory refresh circuitry, so channel 0 now became available for general use. They also introduced a new page register for it (as well as a second DMA controller for 16-bit DMA transfers, as I also mentioned in this earlier article). So on an AT it may actually work. There is another catch, however: The 8237A was never designed to run at speeds beyond 5 MHz. So where the 8237A runs at the full 4.77 MHz on a regular PC/XT, it runs at half the clockspeed on an AT (either 3 or 4 MHz, depending on whether you have a 6 or 8 MHz model). So DMA transfers are actually slower on an AT than on a PC/XT. At the same time the CPU is considerably faster. Which means that most of the time, you’re better off using the CPU for memory-to-memory transfers.

Therefore, DMA is mostly used for transferring data to/from I/O devices. Its main uses are for floppy and harddrive transfers, and audio devices. Being primarily a graphics programmer, I never had any need for that. I needed memory-to-memory transfers. You generally don’t want to implement your own floppy and harddrive handling on PC, because of the variety of hardware out there. It is better to rely on BIOS or DOS routines, because they abstract the hardware differences away.

It begins

But in the past weeks/months I have finally been doing some serious audio programming, so I eventually arrived at a need for DMA: the Sound Blaster. In order to play high-quality digital audio (up to 23 kHz 8-bit mono) on the Sound Blaster, you have to set up a DMA transfer. The DSP (‘Digital Sound Processor’ in this context, not Signal) on the SB will read the samples from memory via DMA, using an internal timer to maintain a fixed sampling rate. So playing a sample is like a ‘fire-and-forget’ operation: you set up the DMA controller and DSP to transfer N bytes, and the sample will play without any further intervention from the CPU.

This is a big step up from the sample playing we have been doing so far, with the PC Speaker, Covox or SN76489 (‘Tandy/PCjr’ audio). Namely, all these devices required the CPU to output individual samples to the device. The CPU was responsible for accurate timing. This requires either cycle-counted loops or high-frequency timer interrupts. Using DMA is more convenient than a cycle-counted loop, and far more efficient than having to handle an interrupt for every single sample. You can now play back 23 kHz mono 8-bit audio at little more than the cost of the bandwidth on the data bus (which is about 23 kb/s in this case: you transfer 1 byte for each sample), so you still have plenty of processing time left to do other stuff. The DMA controller will just periodically signal a HOLD to the CPU. Once the CPU acknowledges this with a HLDA signal, the DMA controller has now taken over the bus from the CPU (‘stealing cycles’), and can put a word from memory onto the bus for the I/O device to consume. The CPU won’t be able to use the bus until the DMA transfer is complete (this can either be a single byte transfer or a block transfer).

It’s never that easy

If it sounds too good to be true, it usually is, right? Well, in a way, yes. At least, it is for my chosen target: the original Sound Blaster 1.0. It makes sense that when you target the original IBM PC 5150/5160, that you also target the original Sound Blaster, right? Well, as usual, this opened up a can of worms. The keyword here is ‘seamless playback’. As stated above, the DMA controller can only transfer up to 64k at a time. At 22 kHz that is about 3 seconds of audio. How do you handle longer samples?

After the DMA transfer is complete, the DSP will issue an interrupt. For longer samples you are expected to send the next buffer of up to 64k immediately. And that is where the trouble is. No matter what you try, you cannot start the next buffer quickly enough. The DSP has a ‘busy’ flag, and you need to wait for the flag to clear before you send each command byte. I have measured that on my 8088 at 4.77 MHz, it takes 316 cycles to send the 3 bytes required for a new buffer command (the 0x14 command to play a DMA buffer, then the low byte and high byte of the number of samples to play). At 4.77 MHz, a single sample at 22050 Hz lasts about 216 CPU cycles. So you just cannot start a new transfer quickly enough. There is always a small ‘glitch’. A faster CPU doesn’t help: it’s the DSP that is the bottleneck. And you have to wait for the busy-flag to clear, because if you don’t, it will not process the command properly.

Nagging the DSP

Some early software tried to be creative (no pun intended) with the Sound Blaster, and implemented unusual ways to output sound. One example is Crystal Dream, which uses a method that is described by Jon Campbell of DOSBox-X as ‘Nagging the DSP’. Crystal Dream does not bother with the interrupt at all. Apparently they found out that you can just send a new 0x14 command, regardless of whether you received and acknowledged the interrupt or not. In fact, you can even send it while the buffer is still playing. You will simply ‘restart’ the DSP with a new buffer.

Now, it would be great if this resulted in seamless output, but experimentation on real hardware revealed that this is not the case (I have made a small test program which people can run on their hardware here).. Apparently the output stops as soon as you send the 0x14 command, and it doesn’t start again until you’ve sent all 3 bytes, which still takes those 316 cycles, so you effectively get the exact same glitching as you would with a proper interrupt handler.

State of confusion

So what is the solution here? Well, I’m afraid there is no software-solution. It is just a design-flaw in the DSP. This only affects DSP v1.xx. Later Sound Blaster 1.x cards were sold with DSP v2.00, and Creative also offered these DSPs as upgrades to existing users, as the rest of the hardware was not changed. See this old Microsoft page for more information.  The early Sound Blasters had a ‘click’ in digital output that they could not get rid of:

If a board with the versions 1.x DSP is installed and Multimedia Windows is running in enhanced mode, a periodic click is audible when playing a wave file. This is caused by interrupt latency, meaning that interrupts are not serviced immediately. This causes the Sound Blaster to click because the versions 1.x DSP produce an interrupt when the current DMA buffer is exhausted. The click is the time it takes for the interrupt to be serviced by the Sound Blaster driver (which is delayed by the 386 enhanced mode of Windows).

The click is still present in standard mode, although it is much less pronounced because the interrupt latency is less. The click is more pronounced for pure tones.

The version 2.0 DSP solves this problem by using the auto- initialize mode of the DMA controller (the 8237). In this mode, the DMA controller automatically reloads the start address and count registers with the original values. In this way, the Sound Blaster driver can allocate a 4K DMA buffer; using the lower 2K as the “ping” buffer and the upper 2K as the “pong” buffer.

While the DMA controller is processing the contents of the ping buffer, the driver can update the pong; and vice versa. Therefore, when the DMA controller auto-initializes, it will already have valid data available. This removes the click from the output sound.

What is confusing here, is the nomenclature that the Sound Blaster Hardware Programming Guide uses:

Single-cycle DMA mode

They refer to the ‘legacy’ DSP v1.xx output mode as ‘single-cycle DMA mode’. Which is true, in a sense: You program the DMA controller for a ‘single transfer mode’ read. A single-cycle transfer means that the DMA controller will transfer one byte at a time, when a device does a DMA request. After that, the bus is released to the CPU again. Which makes sense for a DAC, since it wants to play the sample data at a specific rate, such as 22 kHz. For the next byte, the DSP will initiate a new DMA request by asserting the DREQ line again. This opposed to a ‘block transfer’ where the DMA controller will fetch the next byte immediately after each transfer, so a device can consume data as quickly as possible, without having to explicitly signal a DREQ for each byte.

Auto-Initialize DMA mode

The ‘new’ DSP 2.00+ output mode is called ‘auto-initialize DMA mode’. In this mode, the DSP will automatically restart the transfer at every interrupt. This gives you seamless playback, because it no longer has to process a command from the CPU.

The confusion here is that the DMA controller also has an ‘autoinitialize mode’. This mode will automatically reload the address and count registers after a transfer is complete. So the DMA controller is immediately reinitialized to perform the same transfer again. Basically the same as what the DSP is doing in ‘auto-initialize DMA mode’. You normally want to use both the DMA controller and DSP in their respective auto-init modes. Double-buffering can then be done by setting up the DMA controller with a transfer count that is twice as large as the block size you set to the DSP. As a result, the DSP will give you an interrupt when it is halfway the DMA buffer, and another one when it is at the end. That way you can re-fill the half of the buffer that has just finished playing at each interrupt, without any need to perform extra synchronization anywhere. The DMA controller will automatically go back to the start of the buffer, and the DSP also restarts its transfer, and will keep requesting data, so effectively you have created a ringbuffer:

SB DMA

However, for the DMA controller, this is not a separate kind of transfer, but rather a mode that you can enable for any of the transfer types (single transfer, block transfer or demand). So you are still performing a ‘single transfer’ on the DMA controller (one byte for every DREQ), just with auto-init enabled.

You can also use this auto-init mode when using the legacy single-cycle mode of the DSP, because the DSP doesn’t know or care who or what programs the DMA, or what its address and count are. It simply requests the DMA controller to transfer a byte, nothing more. So by using auto-init on the DMA controller you can at least remove the overhead of having to reprogram DMA at every interrupt in single-cycle mode. You only have to send a new command to the DSP, to minimize the glitch.

Various sources seem to confuse the two types of auto-init, thinking they are the same thing and/or that they can only be used together. Not at all. In theory you can use the single-cycle mode for double-buffering in the same way as they recommend for auto-init mode: Set the DMA transfer count to twice the block size for the DSP, so you get two interrupts per buffer.

And then there is GoldPlay… It also gets creative with the Sound Blaster. Namely, it sets up a DMA transfer of only a single byte, with the DMA controller in auto-init mode. So if you start a DSP transfer, it would just loop over the same sample endlessly, right? Well no, because GoldPlay sets up a timer interrupt handler that updates that sample at the replay frequency.

That is silly and smart at the same time, depending on how you look at it. Silly, because you basically give up the advantage of ‘Fire-and-forget’ DMA transfers, and you’re back to outputting CPU-timed samples like on a PC speaker or Covox. But smart, for exactly that reason: you can ‘retrofit’ Sound Blaster support quite easily to a system that is already capable of playing sound on a PC speaker/Covox. That is probably the reason why they did it this way. Crystal Dream also uses this approach by the way.

There is a slight flaw there, however. And that is that the DSP does not run in sync with the CPU. The DSP has its own crystal on the card. What this means is that you probably will eventually either miss a sample completely, or the same sample gets output twice, when the timer and DSP go out of sync too far. But since these early SB cards already have glitches by design, one extra glitch every now and then is no big deal either, right?

The best of both worlds

Well, not for me. I see two requirements here:

  1. We want as few glitches as possible.
  2. We want low latency when outputting audio.

For the DSP auto-init mode, it would be simple: You just set your DMA buffer to a small size to have low latency, and handle the interrupts from the DSP to update the buffers. You don’t have to worry about glitches.

For single-cycle mode, the smaller your buffers, the more glitches you get. So the two requirements seem mutually exclusive.

But they might not be. As GoldPlay and Crystal Dream show, you don’t have to match the buffer size of the DMA with the DSP at all. So you can set the DSP to the maximum length of 64k samples, to get the least amount of glitches possible.

Setting the DMA buffer to just 1 sample would not be my choice, however. That defeats the purpose of having a Sound Blaster. I would rather set up a timer interrupt to fire once every N samples, so the timer interrupt would be a replacement for the ‘real’ DSP interrupt you’d get in auto-init mode. If you choose your DSP length to be a multiple of the N samples you choose for your DMA buffer, you can reset the timer everytime the DSP interrupt occurs, so that you re-sync the two. Be careful of the race condition that theoretically the DSP and timer should fire at the same time at the end of the buffer. Since they run on different clock generators, you never know which will fire first.

One way to get around that would be to implement some kind of flag to see if the timer interrupt had already fired, eg a counter would do. You know how many interrupts to expect, so you could just check the counter in the timer interrupt, and not perform the update when the counter exceeds the expected value. Or, you could turn it around: just increment the counter in the timer interrupt. Then when the DSP interrupt fires, you check the counter, and if you see the timer had not fired yet, you can perform the last update from the DSP handler instead. That removes the branching from the timer interrupt.

Another way could be to take the ‘latched timer’ approach, as I also discussed in a previous article. You define a list of PIT count values, and update the count at every interrupt, walking down the list. You’d just set the last count in the list to a value of 0 (interpreted as 65536 ticks), so you’re sure it never fires before the DSP does.

Once you have that up and running, you’ll have the same low-latency and low CPU load as with DSP v2.00+, and your glitches will be reduced to the minimum possible. Of course I would only recommend the above method as a fallback for DSP v1.xx. On other hardware, you should just use auto-init, which is 100% glitch-free.

Update 17-04-2017: I read the following on the OSdev page on ISA DMA:

Some expansion cards do not support auto-init DMA such as Sound Blaster 1.x. These devices will crash if used with auto-init DMA. Sound Blaster 2.0 and later do support auto-init DMA.

This is what I thought was some of the ‘confusion’ I described above regarding auto-init mode on the SB DSP vs the DMA controller. But I did not want to include anything on it until I was absolutely sure. But NewRisingSun has verified this on a real Sound Blaster with DSP v1.05 over at the Vogons forum. He ran my test program, which uses auto-init DMA, while performing single-cycle buffer playback on the DSP (the only type the v1.xx DSP supports). And it plays just fine, like the DSP v2.xx and v3.xx we’ve tested with. So no crashing. The quote from OSDev is probably confusing DMA auto-init mode with the fact that Sound Blaster 1.x with DSP v1.xx do not have the auto-init DSP command (which won’t make them crash either, they just don’t know the command, so they won’t play). In short, that information is wrong. DMA auto-init transfers work fine on any Sound Blaster, and are recommended, because they save you the CPU-overhead of reprogramming the DMA controller after every transfer. You only have to restart the DSP.

Posted in Oldskool/retro programming | Tagged , , , , , , , , , , , | 4 Comments