My most ‘DailyWTF’ moment?

I’d like to tell a little anecdote, something that happened a few years ago. Things went wrong in so many ways…

Let me try to paint the situation here: A large worldwide company, with development centers around the world (in Europe, the US, India and Australia). We were in Europe, and the chief architect was literally on the other side of the world, in Australia.

One could ask whether it is a good idea to opt for an Agile development strategy with an organizational structure like that, where direct, personal contact is physically impossible… but that’s something I may get into some other time.

In this particular case it was about the upgrade of our build system. Namely, we had just moved to Git, a distributed version control system. However, we were not using it in a very distributed way yet: so far, there was a central repository in Australia, and they had the only working build server, because they were the ones who led the upgrade to Git and the related build environment.

The next step was for each location to have their own local Git repository, so that developers could have local branches, and test them on a local build server first, before things were pushed to the central server in Australia. So a colleague of mine had installed all the required software on our build server. Things worked, sort of, but the build times were incredibly long, and especially during the unit tests, it would randomly throw exceptions.

Of course, the people in Australia would not bother looking into the issue. After all, it worked on *their* server, right? (Okay, their server had better specs, but our server was not completely useless either. If I recall correctly, we had two dualcore CPUs, and 16 GB of internal memory). So there couldn’t possibly be a problem with the build script they wrote, right? Execpt that they also had some minor issues with unit tests occasionally. However, they ‘solved’ that by reducing the number of threads that the application’s threadpool used (keep that in mind for later).

Well, that meant we were left to our own devices. My colleague had documented his steps of the installation, then asked me if I could do a complete reinstall, and see if I could find any problems. The installation itself seemed to go just fine. However, the build issues did not disappear.

So I decided to monitor what was going on exactly during a build, especially during the unit test-part. As it turned out, the unit-tests seemed to spawn an awful lot of processes and used up all memory, overloading the system, driving it into heavy swapping. As a result, the unit tests slowed to a crawl, causing random timeouts and other failures. So at least we now knew what was wrong. It was not just a hardware problem or an installation issue. But now I still had to find out why it was starting so many unit tests at a time.

As it turned out, the build system used some scripts written in Ruby. A language I was not familiar with. But, apparently the scripts were written by the chief architect in Australia… and they couldn’t possibly have any issues, right? (are you starting to see a pattern yet? Git, Ruby… especially if I also tell you that they recently moved from a Direct3D9 renderer to Ogre? Indeed, it seems the chief architect was some kind of linux/open source fanboy, even though we were developing a Windows-only application, mostly with C#/.NET technology. In fact, the Ogre-thing has another nice snafu… When the first Ogre-versions of the application arrived, things started to break on many of our test-PCs. The problem? Although Ogre can do both Direct3D and OpenGL, they defaulted to OpenGL. And although the rendering was far too trivial to require shaders, they had written everything with GLSL. Well, obviously that’s not going to work on systems with a DX9-class Intel IGP, such as the ones in our test-PCs. They don’t support GLSL. At best they support the ARB assembly shader language. I could have told them that, but I guess they thought they knew better than to ask for any input from an experienced 3D developer…)

What’s worse… the scripts were part of the application source code in the repository itself. The idea being that you would always build with the latest version of the script. While this is a nice idea in theory, it does not lend itself very well to debugging/experimenting, or just having customized scripts for each individual build server. Even worse… Although we are supposed to be Agile, this script was off-limits to any developer. While I could create my own branch to test modified scripts on our own build server, in the end the modifications had to be approved and committed by Mr. Chief Architect. Which would not be such a problem if he wasn’t such a stubborn know-it-all.

Anyway, so my only option was to learn some Ruby (what a waste of time) and look into modifying the script myself, since I was not going to get any help from the original author. Once I figured out what it actually did, I didn’t quite know whether to laugh or cry… The guy had made a proper fork-bomb! In an ill-fated attempt at exploiting all the cores in the build server, he had just made a loop that started a new thread for each unit test, and then created the unit test process (which itself would span multiple threads during the unit testing). I have no idea how someone that incompetent can ever become chief architect, but oh well…

So THAT is why our server gets bogged down! Apparently their server got away with it (for now) because it has more cores and more memory, so it doesn’t quite come to a screeching halt (yet). But it still runs far more concurrent processes than good for it. This also explains why they had to modify the thread count in the threadpool to avoid problems during unit testing. They never bothered to look into the actual cause of the issue, they just cured the symptoms.

Great, so with my newfound Ruby skills I had to create a system where the number of concurrent threads was limited, while all unit tests still ran, and the script waited for all of them to complete, logging the data. Right, well, after a bit of experimenting, I got things to work quite nicely, and even made the threadcount dependent on an environment variable (and a default value if no variable was set). This way we could have per-server customization of the build process. Not only could our server reliably build all projects and run unit tests now, its performance was also increased dramatically. Where a full build and test would take about 45 minutes at first, it was now reduced to less than 20 minutes.

So, I presented my solution to the chief architect. When testing it on their system, they also found a significant speedup. But, did he thank me for all the work I put in, learning some nasty scripting language just so I could fix his beginner mistakes, and threw in some optimizations in the process? Well no, ofcourse not. Instead, before accepting my code, he ‘corrected’ it…

Semantic variable naming

The issue is this: I had created a list of running threads for the unit test processes. This was required so that the Ruby script could limit the number of threads to a maximum and periodically check if threads were complete, so that new threads could be started. I had given this list a name like ‘unitTestProcesses’, which was of type Thread[].

Now, he had ‘corrected’ this to ‘unitTestThreads’, “… because they are threads”. NO NO NO NO NO! I am not an idiot, Mr. Chief Architect. I *know* they are threads, that is obvious from the type, is it not? So it is quite redundant to put that in the variable name. No, the reason why I called them processes, is because what these threads *represent*, semantically. And each thread represents a unit test process. When you use the name ‘process’, the variable name implicitly tells you that it is creating processes, and by extension you will know that a single process may have multiple threads. So by no means are you controlling the number of *threads* on the system (which is what you’d want, ideally, but you don’t have that level of control), but merely the number of *processes* on the system. When using the name ‘thread’, you are more or less implying that you control the number of threads, which you don’t. So that name is wrong on many levels. Variable names that contain redundant information are a pet peeve of mine anyway… but this one takes the cake. I guess he still doesn’t quite understand it?

Well, he was kind enough to take away any last doubt in that respect. Namely, he mentioned that the environment variable that I had introduced should be set to the number of cores in the system. Again, NO NO NO NO NO! I thought it’s multiprocessing 101, but I don’t know what education a chief architect has… Having the number of threads (well, let’s ignore the fact that we can’t control the actual number of threads here for a moment) equal to the number of cores in a system only makes sense if these threads are ‘worker threads’. The kind of thread that will process all of the time. Many threads in a system just sit idle most of the time, and wait for I/O events and such. As a result, you can have multiple threads running on the same core without losing any performance.

Our unit tests were mostly like that as well: many of them were quite dependent on disk I/O and such. To find the optimum value for the number of unit test processes, you will have to experiment. I found that on our build server, we could easily run three times the amount of concurrent unit tests as we had cores in the system. You’d think that a chief architect would know about such things… but apparently not… That might also explain why the application ran like a dog anyway.

Anyway, I’m a professional, and a team player, so when he mailed me about the ‘unitTestProcesses’ that he had changed, I just kindly explained my idea behind my name, but then added that I did not mind which name he would pick. I would be happy either way, as long as the code got committed, so our build server would get its fix. That is the most important thing here, not some ego-fight over whose name is better.

Advertisements
This entry was posted in Direct3D, OpenGL, Software development and tagged , , , , , , , , , , , . Bookmark the permalink.

5 Responses to My most ‘DailyWTF’ moment?

  1. Great post.
    I admire your courage writing about this experience, we don’t see much of that in my homeland Israel.

  2. Klimax says:

    Just few things. Have you already sent this to The Daily WTF? I think they could use another Chief Architect is Know-it-all yet is incompetent…

    Second, at least GLSL is relatively easy to translate to HLSL. (Import by OpenGL?)

    Anyway, this has sort of happier ending then most WTF like this…

  3. Luke W says:

    As Abraham said, certainly was a courageous move. Great to see a resolution rather than an ego war resulting. GTD! Thanks for sharing!

  4. Pingback: Be aware of the world around you | Scali's OpenBlog™

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s