It is time to address another long-running pet-peeve of mine: the notion that source code documents a piece of software accurately. On the surface, this might seem to be a plausible notion, but it is in fact a fallacy, if you look beyond that surface.
I think the best way to approach this is to point out that there are multiple phases when writing software. In this case, we can determine the design-phase and the implementation-phase. I will start with the latter.
The implementation-phase is the phase where the actual source code is written. By default, the source code is an accurate documentation of the implementation-phase. As a result, the source code is also a good way to find and fix any errors that were introduced during the implementation. However, there is more to writing software than just implementing it…
The design-phase is where the REAL programming happens. Programming is not just about writing code. It is about solving a problem. Writing the code (the implementation-phase) is trivial once the problem has been solved. If writing the code turns out not to be trivial, then apparently the problem was not solved completely yet, and as such the design was not detailed enough.
The design-phase can be taken quite broadly, including such details as what the problem is that is to be solved, and even why that problem is chosen to be solved. You will generally find such details in software documentation, explaining what the software is supposed to do, and who or what it is aimed at. However, you will rarely find such information in the source code, and it is often impossible to derive the intended use or uses from the source code, unless you are already familiar with the problem (in which case you implicitly already know about the design ‘documentation’).
At a more technical level, the design-phase will also specify details of what solution was taken, and why this solution was taken. It might also touch upon other possible solutions, and explain why these have been passed up. Looking at the source code will only show you the solution that was implemented, but nothing about any possible alternatives, or the reasons behind the choice for this particular solution.
Getting even more technical, the implementation may make a number of implicit assumptions. Think about the size of certain buffers, or upper and lower bounds for certain variables, and other ‘constants’ like that. When looking at the source code, you may be able to tell what these ‘constants’ are, but it will generally not tell you why they were designed that way. Were they chosen to match a certain protocol? Do they have to do with the total memory assumed to be available on the machine? Perhaps the CPU’s caches? Again, you need more background information, more documentation to really understand what the source code is doing there, and why. If you understand these things simply by reading the source code, then you apparently already have some prior knowledge about this. You may have read some documentation from similar software at an earlier stage.
Source code is the ‘what’, not the ‘why’
To summarize this, I can put it in a single sentence:
Source code is the ‘what’, not the ‘why’
In other words: it tells you WHAT the software is doing, but not WHY it is doing it this way. If there are errors in the design, they will be hard, if not impossible to find by just studying the source code. It’s similar to semantic errors: the code appears to be doing exactly what it is supposed to be doing, it’s just not what you really wanted to do.
I can give a few examples that I think will demonstrate quite clearly why source code is not a good means of documentation (in the sense that it does not teach you what you really need to know about the software).
An interesting area of software is that of video compression and decompression. There are many popular compression schemes for video, which rely on lossy compression. They ‘leave out’ unimportant details. A lot of study has gone into determining the different levels of detail and encoding them as efficiently as possible. You could take the source code of an MPEG codec, but you’re not likely to understand WHY it works the way it does, without any external documentation. For example, in the case of MPEG, the Huffman algorithm is used for entropy encoding. However, the Huffman codes are pre-determined by the standard, and encoded in a pre-optimized table. As a result, the code looks nothing like the ‘textbook’ Huffman, where you use a tree to look up a code bit for bit (not that any proper implementation for Huffman does that anyway: often such an optimized table is constructed at runtime). It is just a simple lookup in a table that is hardcoded into the source. There is no way of telling where this table comes from, and it may even be hard to understand what the table is supposed to do, if you only look at the source code, and have no further knowledge of the MPEG standard.
Another classic example I like to use is that of Marching Cubes: an algorithm to polygonize an isosurface for a 3D scalar field. Even more than with MPEG’s Huffman example, most of the design of the Marching Cubes algorithm is pre-encoded into tables.
The basic idea is that you divide the scalar field into an even 3D grid. This grid can be seen as a set of cubes, through which we will be ‘marching’ while building the isosurface. The idea is that if you can approximate the surface within a cube with a few polygons, you can do this for all cubes, and you have your polygonized surface (divide-and-conquer).
The problem is now reduced to approximating the surface within a single cube. This is done by looking at the 8 vertices of the cube, and classifying them: is the field value at the vertex above or below the isovalue (respectively outside or inside the isosurface volume)? This leads to 8 binary values, leading to 256 different cases for which a surface has to be constructed. By exploiting symmetry (both rotational and inverting), this can be reduced to a limited set of 15 base cases. A transition from outside to inside means that a polygon vertex will be created at the center of the cube’s edge between an outside vertex and an inside vertex. This way the polygons approximating the isosurface are formed.
The implementation will generally do little more than classifying the cube’s vertices and doing a table lookup. The table itself is hard-coded into the source code (generally just the base-cases, expanded to the full 256 cases at runtime, to save space). What this table means exactly, or how it was designed and constructed, that is impossible to determine from the source alone (and also quite impossible to try and document with source code comments. You’ll want images of the base cases to make sense of it all).
The joke is then, that there is a problem with the original algorithm. Most popular implementations will not just place the polygons at the center of each edge, but rather use linear interpolation along the edge to better approximate the surface. This doesn’t work properly, as the inverse symmetry affects the ‘bias’. If a cube was ‘mostly inside’ the isosurface, inverting the classification of its vertices will result in a case that is ‘mostly outside’ the isosurface. When you move the vertices around by interpolation, this will create holes in the surface.
Instead, the inverse symmetry needs to be dropped from the table, and instead an additional 8 base cases with proper bias need to be added. This problem is impossible to solve if you only have the source code for an implementation with the original base cases. And yes, these flawed implementations DO exist. Marching Cubes is not that hard to implement if you understand the algorithms. But it is virtually impossible to derive the algorithm from the source code. Let alone fixing a common bug that creates holes in the surface. You need to know the ‘why’.
I would go as far as saying that source code is only required for two reasons:
- To find and fix bugs in the implementation, as explained above.
- As a last resort when proper documentation does not exist.
With proper documentation for the design-phase, the implementation should be trivial, and therefore the actual source code provides no extra informational value about the inner workings of the software (aside from flawed implementations, as mentioned in 1. above). More often than not, hundreds of lines of source code can be explained with just a few sentences and/or a few simple diagrams. A picture says more than a thousand words. Naturally you should not take this the wrong way though: you should still aim to make the source code as easy to read and understand as possible. Use proper variable names, structure your code nicely for readability (use indentation and blank spaces to group/separate code). And use comments for anything that is not immediately obvious. After all, you may have to maintain the implementation at a later stage.