Let’s explore the Commodore 64 some more. The topic today is: mathematics. And again, we are going to keep it real, VERY real. I have to admit, originally the ‘keeping it real’ was mainly a play of words on the 16-bit ‘real mode’ of x86 processors, as I wanted to explore not only the old CGA, EGA and VGA graphics standards, but I also wanted to explore them in their natural habitat, which would be pre-386 machines, which only worked in the original 16-bit segmented memory mode.

But another limitation I had, with virtually all the platforms discussed here (C64, Amiga, PC, GP2X), is that they do not have an FPU to process real numbers either (in fact, the example of handwritten machine and assembly code of Steve Wozniak that I used in a previous article, is actually a library for processing floating point on a secondary 6502 processor in an early Apple machine).

And lastly, especially with the Commodore 64, and having to write cycle-exact assembly code, self-modifying machine-code, or even having to organize your code byte-exact in some ways, we are clearly within the realm of Real Programmers, such as Mel. There are actually tons of examples of Real Programming for 6502 and C64, some of which can be found on Codebase64, or on Pagetable.com. I’m sure this is one that Mel would have liked: the ‘Ninja’ way of creating a stable raster. In short, it sets up the timers so that the memory-mapped values of their registers read a valid jmp opcode, and then carefully sets up code so that each address contains a delay routine to remove the amount of jitter that the timer indicated.

I might get into other interesting hacks as well at some point. It should be clear at this point, that much like Mel, the C64 programmers are masters of their domain. Other interesting topics include tape or disk fastloading routines. People would basically replace the slow standard IO routines with their own custom cycle-exact code using much more efficient handshaking, which could boost throughput up to about 8 times. The 1541 diskdrive is an interesting beast anyway, since it contains its own 6502 processor, and has 2k of RAM. It is possible to upload code to the drive, and have it execute it. For example, if you had two or more drives, you could upload a disk copy program such as Fast Hack’em to the drives, and they would copy the disk with no further intervention from the computer. You could unplug the drives from the computer once the copying started. Another example is to use the drive as a sort of co-processor. For example, you could do some 3d animation where the drive will calculate object rotations for you, and let the main CPU worry only about the rasterizing.

## Six degrees of separation

Anyway, I digress… mathematics was today’s topic. At this point, I realize I have been holding out on you. The 286/EGA routines I have developed earlier on in this series, have evolved further since I last covered them (okay, they appeared briefly in the GP2X demo The Chalcogens). I have since added an efficient sorting routine, so I can also handle inconvex objects such as a donut, and I have also added a polygon clipper. So it currently looks like this:

I have been planning to explain some things about the mathematics involved, how to deal with not having an FPU, making it fit into 16-bit integer operations and such (not to mention the limitations of 64k segments, when using approximation tables). But I will save that for another day. The C64 and its limitations pose some related problems, but also some new ones: it does not have a multiply or a divide instruction.

There are various degrees of ‘realness’ to programming, before you hit ‘rock bottom’, so to say (my definition of ‘rock bottom’ is the venerable Atari 2600, which is so primitive, it does not even have a framebuffer, and ‘racing the beam’ is the only way to get graphics on screen. Commodore 64 is not quite THAT primitive, but it is not too far off).

- Programmers who have not programmed in assembly will say: “You’re doing everything in assembly language?!”
- Programmers who have done some assembly, but only on modern CPUs will say: “You’re doing everything without an FPU?!”
- Or: “You’re doing everything with just 16-bit instructions and 64k memory segments!?”
- Or: “You’re doing everything on a machine at less than 100 MHz!?”
- Or: “You’re accessing all the hardware directly!?”
- And now perhaps even people who have programmed on ‘primitive’ 16-bit MS-DOS machines might go: “You’re doing everything on a machine that doesn’t even have mul and div instructions?! And with only 8-bit instructions!?”

Well yes, that is what we will be doing. But luckily, we have hit ‘rock bottom’ at that point. There isn’t much lower to go from here, realistically. The 6502 is a very early 8-bit microprocessor, similar to the Zilog Z80, Motorola 6800 or Intel 8080 found in early home/personal computers. You will not likely run into a machine with a less capable processor than this. So once you know how to work around its limitations, you should be able to program on any machine. This early technology is ‘genesis’: some of the earliest computers available to regular people. And also where computer games and the demoscene originated. An interesting piece of trivia: various scientific calculators are also based on such simple CPUs. Texas Instruments built a number of graphing calculators using the Z80 CPU for example. So that’s something to think about: the calculator only has a very basic 8-bit CPU, which does not even have multiply and division implemented in hardware. Everything is done in software, doing multiply and division, handling numbers much larger than 8 bits, handling floating point numbers, trigonometry and all that.

(Note also that self-proclaimed linux ‘hackers’ generally are barely in the upper layer of only using compiled languages. Or not even that. Just compiling your own kernel does not impress Mel in the least. You are WAY removed from what Real Programmers can do with a machine, any machine).

How limited is 8-bit really? 8 bits, so 8 binary digits. This gives us 2^{8} different values. So a range of 0..255 for unsigned numbers, or -128..127 for signed numbers. When I started my simple scroller/music/rasterbar intro on C64, I was confronted with the limitations almost immediately. I wanted to clear the screen, which consists of 40×25 = 1000 characters. Let’s build a loop then, of 1000 iterations. Oh wait! We can’t count to 1000 in an 8-bit register! Now, I could use a 16-bit loop counter, and implement a 16-bit subtraction, or I could use two nested loops, with an outer loop of 4 iterations, and an inner loop of 250 iterations. However, K.I.S.S., so I chose to just do a single loop of 250 iterations, and write to 4 different locations each iteration.

On to the mathematics then. They aren’t particularly hard really, nothing that you wouldn’t be able to handle with a few years of grammar school. I suppose it is more about getting in the right mindset, applying your knowledge to actually solve problems. It reminds me of the N*9 problem that I brought up in an earlier blog. People tend to get stumped by it. Not because it’s hard, but because people aren’t used to solving such problems.

## Addition and subtraction

Let’s start simple, and extend our 8-bit addition and subtraction to 16-bit and beyond. This is where the carry flag comes in. The carry flag is technically the 9th bit of the result (do not call it an overflow flag, because overflow is a term that is generally used with signed numbers, and most CPUs have a separate overflow flag in addition to the carry flag).

First, I suppose a small refresher on binary numbers may be useful. I covered it earlier, but here it is in a nutshell:

The first bit represents 1, the second bit represents the double of that: 2. The third bit is again the double of the that: 4. And so on.

In general: The bit at position n represents value 2 to the power n.

For the following, it makes sense to think of numbers as combinations of powers of two. For example, the number 15 is 2^{0} + 2^{1} + 2^{2} + 2^{3}. But we don’t necessarily have to get down to the bit level. In this case it generally makes more sense to think of numbers one byte at a time. This is often easy to do with hexadecimal.

Let’s take the 16-bit number 624. In hex this is 270h. Each hex digit corresponds to 4 bits, so we can see immediately that the high byte is 2h and the low byte is 70h. So the number can be seen as 2h*2^{8} + 70h, or in decimal: 2*2^{8} + 112 = 512 + 112 = 624.

Now let’s add another 16-bit number to that, say 400, or 190h. We only have an 8-bit add, so we can only add one byte at a time. First we add the low bytes: 70h + 90h = 100h, or in decimal: 112 + 144 = 256. So, we have a 9-bit result. The low 8 bits are stored in the result register, so 00h. The most significant bit is stored in the carry flag. So carry is now set.

Now we add the high bytes: 2 + 1 = 3. So our intermediate 16-bit result is now 300h, or 768. But we have not added the 9th bit of the first addition yet. We add this to the result of the high byte, so we get 4 instead of 3. Our 16-bit result is now 400h, or 1024 in decimal. And that is correct! We have done a full 16-bit addition while only using an 8-bit add instruction.

Subtraction works very much the same way. You do a ‘subtract-with-borrow’. You start with the low byte again. If you do x – y where x < y, then the result will be less than 0, meaning that the subtraction has to ‘borrow’ a bit from the high byte. This will again be reflected by the CPU setting the carry flag. And this time you subtract the carry from the high byte after the second subtraction. So if we do 400h – 270h, we get the low byte first: 00h – 70h. This will ‘borrow’ a bit, so it actually does 100h – 70h, which results in 90h, and the carry set. Now we do the high byte: 4 – 2 = 2. And subtract carry: 2 – carry = 1. So our result is 190h. Which again is correct (actually, the 6502 does it exactly the other way around, and clears carry when a bit was borrowed, and also subtracts (1-carry) with the sbc instruction).

Since each addition and each subtraction will set the carry flag, this idea can be extended to numbers of any size. So you can do 24-bit, 32-bit and beyond just as easily. In fact, the 6502 is such a minimalistic CPU, that it only has adc and sbc instructions, which always add or subtract carry. If you want a ‘normal’ add or sub, you have to make sure that the carry flag is not set.

The actual code is quite simple, for z = x + y:

clc lda x adc y sta z lda x+1 adc y+1 sta z+1

and z = x – y:

sec lda x sbc y sta z lda x+1 sbc y+1 sta z+1

(Note: we assume Little Endian here, so the low byte is stored at the lowest address (x, y, z), and the high byte is stored at the highest address, (x+1, y+1 and z+1). In theory you could store each byte anywhere you like, but under normal circumstances it makes sense to just group them like this.)

So far this only covered unsigned numbers. However, because the 6502 uses two’s complement notation for signed numbers (like all popular CPUs), there is no need for a specific signed addition or subtraction, as two’s complement numbers behave the same for addition and subtraction, regardless of sign.

Okay, that was easy, and you may already have been familiar with instructions like adc/sbc from other CPUs. Now let’s move on to something that really only applies to early CPUs: the missing multiply instruction. So, we don’t have an instruction for it… does that stop us? Well no, because these CPUs are still Turing-complete. So, there must be a way for them to perform these operations.

## Multiplication

Okay, let’s say we want to calculate z = x * y, where x and y are unsigned 8-bit numbers. The result will then be 8+8 = 16-bit (again, this follows from treating the bits as powers of two. If you multiply two powers, you can add the exponents).

Right, how do you solve this one? A naïve approach might be to just build a loop, something like this:

```
z = 0;
for (i = 0; i < x; i++)
z += y;
```

Now that you know how to do 16-bit addition, you could at least implement this solution, and it will work. You might even think of an optimization where you swap x and y if x > y, so that you minimize the number of iterations.

However, worst-case it will still require 255 iterations, with 16-bit additions everytime. It is not going to be fast. Let’s take a closer look at the powers-of-two again. Say we want to calculate 6 * 50. If you look at 6 in binary, it is 101b. Now let’s write 6 as powers of two:

6 = 101b = 1*2^{0} + 0*2^{1} + 1*2^{2}

So 6 * 50 becomes:

(2^{0} + 2^{2}) * 50 = 2^{0} * 50 + 2^{2} * 50

So we can look at each binary digit separately, and bit n will tell us whether we need to add 50*2^{n} either 0 times or 1. So we only need to be able to multiply 50 by powers of 2 now. In general, every term will look like a form of y*2^{n}. This is a lot easier, and can be done with a simple logical shift left operation.

We can check which powers of 2 to add to the result by simply shifting the bits of the number 6 to the right. The lowest bit will then be transferred to carry, and we can simply do a conditional jump to decide whether to add or not.

In high-level code, the algorithm looks like this now:

z = 0; while (x > 0) { x >>= 1; if (carry) z += y; y <<= 1; }

Note that the shifting of y requires a 16-bit shift again. Much like the above adc/sbc, we can do the same with shifts, since bit 7 is shifted into the carry by the 6502’s asl instruction. If we then use rol for the high byte instead of asl, it will set the carry to bit 0. And now we have a 16-bit shift:

asl y rol y+1

I will get back to multiplication later, but for now, I would like to move on to division first.

## Division

Division is not much harder than multiplication. Let’s take z = x / y, where x is an unsigned 16-bit number and y is an unsigned 8-bit number. Again, a naïve approach might be to subtract in a loop such as this:

```
z = 0;
while (x > 0)
{
x -= y;
z++;
}
```

And again, it will work, but it will not be very fast. We still have a bad worst case. And this time we cannot swap x and y either, to make it faster.

However, we can apply a very similar algorithm to the one used above in the multiplication, based on powers of two. Namely, as we have seen, each term y*2^{n} is only added 0 or 1 times. Which means we can do the reverse operation reasonably easy as well: We start with y*2^{n} for the largest n, and check if it is smaller or larger than the current value for x. If x >= y*2^{n}, then y*2^{n} MUST be a factor in x. Because all lower powers of 2 added together would only add up to 2^{n} – 1.

So if x is larger or equal, we subtract the y*2^{n} from x, and add 2^{n} to the current quotient. Then we move to n = n-1, until x is smaller than y. The value of x is then the remainder.

Or, in high-level code:

z = 0; d = y << 8; if (y == 0) division by zero!! else if (x >= d) divide overflow!! else { d >>= 1; p = 1 << 7; while (x >= y) { if (x >= d) { x -= d; z += p; } d >>= 1; p >>= 1; } }

Note that we limit z to an 8-bit result in this routine, which means that 2^{7} is the largest factor we can test for. For larger numbers we would have a divide overflow. Clearly we can not divide by 0 either, so we need to avoid that situation as well (it would result in an endless loop).

I think I will leave it at that for today. We have our basic addition, subtraction, multiply and divide. Next time I will want to revisit the multiply routine, and perhaps cover some other mathematical topics as well.

Pingback: Just keeping it real, part 10.1 | Scali's OpenBlog™

Pingback: Revision 2013 is keeping it real | Scali's OpenBlog™