Just keeping it real… bugfixing like it’s 1991

As you may have noticed, the 1991 donut intro did not have any music. I did not cover it in the previous blog, but there was some music planned for this small intro. I chose to use EdLib, because AdLib was one of the few sound cards available for PC back in 1991, and the EdLib replayer is relatively light on CPU, and contains only 16-bit code, so it would work on a 286 like the rest of my code.

However, at the time I was having problems getting the EdLib code to work together with the rest of the intro. I could get the EdLib code to work in a standalone program, but the whole system would crash when I called the same EdLib routines from the intro. I tried to debug it at the party place, but I could not pinpoint the cause at the time, or find a good workaround.

Over the weekend, I tried to give it another look. I had already arrived at the point where I suspected that malloc()’s heap was getting corrupted. And it seemed unlikely that the EdLib code was causing this, since there is nothing suspect going on in the EdLib code. It doesn’t make heavy use of the stack, and it does not call any kind of DOS or BIOS interrupts either, certainly nothing to do with allocating memory. Besides, it would often crash on the first call into the player code, so it looked like the player code was getting corrupted by something else.

I have had a discussion over the Second Reality code with some other demo coders recently. And one of the things we discussed was that they constructed a sort of ‘loader’, which provided various functions to other programs, including the music, through an interrupt handler. So I figured I could apply that idea here: if I write a loader in asm, I know 100% sure that it is not doing anything weird to the memory. If that loader then loads my C program, the C program should stay within its own memory as well, and the two would not corrupt eachother.

So I tried that… but I introduced a different bug there, which I overlooked. Initially I thought it was the same bug, and I was just blind to it, as I’d been looking at this code for far too long. So I asked Andrew Jenner and Trixter if they could have a look, because I was about to give up. Andrew found the bug, I forgot to call the function that set up the int handler. Apparently that got deleted while I was experimenting and fixing other things, and I overlooked that. Once I put the line back in there, things started working as expected: the intro code would just call into the music player once a frame, and the two processes would live happily side-by-side. Finally I had a way to play music for my intro!

However, we still had not found the actual bug, we merely had a workaround at this point. So we were not satisfied yet, there was still a challenge to overcome. I decided to set up a minimal program in C, which only loads the music from disk and plays it, to see if we could figure out exactly what goes wrong, and why. I then sent the program off to Andrew and Trixter with my initial analysis:

As far as I could trace it, it seems to be a bug in the linker.
I printed out the address of Player, and the address that malloc() returned.
I also printed out the bytes for the entry point of Player (not Player itself, but offset 0x62e that it jumps to).
This is what happens:
Player: 05110000
Player bytes: 1E060E1F
pSongData: 03D91704
Player bytes: 000C8019

So apparently it mallocs memory somewhat below the player… And after the fread(), the player got overwritten.
So we have 0x5110 for our player, and 0x5494 for our allocated memory. Which is slap-bang in the middle of the player code, right?
So obviously things die when you try to load your song there (or a torus for that matter).

So the question is: why is malloc() returning a block of memory that is part of the mplayer.obj in memory? The song I’m loading is only 4kb, and that is the only data there is. We have separate data and code segments, since it’s a small model, so in theory I should be able to allocate close to 64k before I need to worry about stack trouble.
So to me it looks like there is just something broken in the generated MZ header or something, causing malloc() to place the heap in the wrong place. It is probably placed after the code segment generated by the C compiler, but it does not seem to pay attention to the segment in other objs.
In which case I guess there are 3 possible locations for the bug:
1) The .obj file has an incorrect header, causing the linker to generate incorrect information -> assembling with the version of tasm included with TC++ 3.1 may solve that
2) The .obj file is correct, but the linker generates incorrect information anyway -> linking with a different linker (Microsoft?) may solve that
3) The headers are correct, but there is a bug in the libc causing malloc to interpret the headers wrongly -> roll your own malloc()?

I then tried to rebuild the code as stated, but that did not solve it. I also tried to use the Microsoft linker on the code, but although I managed to create a binary, it did not run. It would probably be quite a chore to figure out how exactly a Turbo C++ binary is set up. Then Andrew responded with his analysis:

The problem is as follows:
* The malloc() implementation looks at a variable called __brklvl to decide where to start allocating memory.
* The startup code (c0.asm) initializes the stack and __brklvl using the value of a symbol called edata@ which is in the segment _BSSEND.
* mplayer.obj doesn’t use the normal _TEXT and _DATA segments but instead has a single segment called MUSICPLAYER for both its code and its data. MUSICPLAYER has no segment class.
* tlink places segments without a segment class after the normal _TEXT, _DATA, _BSS, _BSSEND and _STACK segments – i.e. in the very place the startup code assumes is empty.

So the fix to the problem is a one-liner – in mplayer.asm just change the line that says:
musicplayer     segment public
to:
musicplayer     segment public ‘far_data’

Then MUSICPLAYER will be placed by the linker after _TEXT and before _DATA, so it won’t collide with anything. I suggest using ‘far_data’ instead of ‘code’ or ‘data’ so that MUSICPLAYER doesn’t take up any space in your normal code and data segments (which are limited to 64kB).

And there we have it! Our answer at last! After changing the segment class of the EdLib player, the code would finally work properly, and I can build a single-file intro with music. It seems I was on the right track with my initial analysis, but it seems that there is not really a conclusive answer as to what the bug really is. You could look at it from various angles:

  1. There was indeed wrong information in the .obj, because it was confusing the linker as to where the _BSSEND should be placed. Rebuilding it (after adding ‘far_data’) fixed it.
  2. The linker was in error, because if it had linked the segments in a different order, _BSSEND would have ended up in the right place after all.
  3. Libc is in error, since you cannot reliably assume that _BSSEND is the last bit of used memory in the binary. Perhaps it should have tried to parse the MZ header fields instead, to work out where the memory is.
  4. One should not link additional segments to a small model program, because that is not in line with the small model definition.

I personally don’t really agree with #4. In my interpretation the model only applies to the code that the compiler generates. Since you have far pointers and far function definitions, and even farmalloc() and farfree(), there should be no reason why you can’t interface code and data outside your own code and data segments.

I am leaning mostly towards #2 myself. Namely, #1 would not be a conclusive solution. If you are writing the code yourself, you have control over the segments. But in this case, MPLAYER.OBJ was supplied by a third party, and normally, changing the segment class would not be an option.
And if #2 is fixed, then the _BSSEND assumption will always be correct. The linker already seems to have some kind of predefined order for segments with known classes. If it would just place class-less segments at the front of that order, rather than at the end, the problem would be solved.
#3 would also be a robust solution, but it would make libc larger and more complex, so #2 would be preferred, especially in that era.

These are the most annoying, but at the same time interesting bugs. Bugs you just don’t see, because they aren’t in your code. You seem to be doing everything right, but it just does not work. Anyway, I may release a final version of the 1991 donut intro, with music, and perhaps a few other small tweaks. But at any rate, the next release WILL have music!

Advertisements
This entry was posted in Oldskool/retro programming and tagged , , , , , , , , , , , , , , , , , , , , , . Bookmark the permalink.

2 Responses to Just keeping it real… bugfixing like it’s 1991

  1. As far as I know, Microsoft, Borland and Watcom linkers simply order segments differently. For example, Watcom has an “option dosseg”:

    http://www.users.pjwstk.edu.pl/~jms/qnx/help/watcom/compiler-tools/wlink.html#DoSSegOption

    Which emulates the segment ordering done by the Microsoft linker. It can also be enabled by a magic record in one of the .obj files (usually the one, containing the RTL startup code). It is even supported by TLink, I think, but as far as I know, neither Borland C, nor Borland Pascal uses it, so it still is something specific to the Microsoft (and Watcom?) compilers. If the ‘dosseg’ ordering was used, it would have put the segment without a class _before_ the data and bss segments. But Borland compilers don’t use ‘dosseg’ and may probably break if you enable it.

  2. Pingback: 1991 donut – final release | Scali's OpenBlog™

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s