Just keeping it real, part 10.1

Last time, I covered the basics of doing multiply and divide with 8-bit numbers. So far, these routines only worked for unsigned numbers. This time I will show how to add support for signed numbers, some ways to extend these routines to numbers of more than 8 bits, and also some optimizations.

Another thing that I have not mentioned specifically was a compare-operation for more than 8-bit. For the division you would need one. I will just start there.

Comparing more than 8 bits

Let’s say we want to compare two 16-bit numbers x and y. We can start by just looking at the high bytes. If high(x) > high(y), then it must follow that x > y. Likewise, if high(x) < high(y), it follows that x < y. If high(x) == high(y), then we have to check the low bytes. In that case, if low(x) < low(y), then x < y, if low(x) > low(y), then x > y, and if low(x) == low(y), then x == y.

So we could use some code like this:

lda x+1
cmp y+1
bne done

lda x
cmp y

done:

Quite simple, is it not? Once we have executed this code, the status register reflects the comparison of x and y just as if they were both 8-bit numbers. And as you can see, this can again be extended to as much precision as you want, just like the addition and subtraction routines. Note also that a compare is technically the same as a subtraction, except that only the status flags are set, and the result is discarded. So in some cases you can combine the subtraction and comparison steps into one, since after a subtraction the status flags are the same as a comparison.

Signing division and multiply

In order to make our routines handle signed numbers as well, we could start with a naïve approach again. Namely, it is easy to determine what sign our result should have, for both multiply and divide:

 X | Y | X*Y | X/Y
---+---+-----+----
 + | + |  +  |  +
 + | - |  -  |  -
 - | + |  -  |  -
 - | - |  +  |  +

It is just an exclusive-or relationship. So, we could simply determine the sign that our result should get, then perform abs(x)*abs(y) or abs(x)/abs(y), and then apply the correct sign to the result.

For multiply, we can solve it in a slightly more clever way though. Namely, all we have been doing are additions. Normally, addition behaves the same for signed and unsigned numbers. In this case however, we started out with 8-bit signed numbers, and they have been promoted to a 16-bit result. The low byte is still always the same, regardless of signed or unsigned numbers. But the high byte requires some adjustment now.

Namely, if the numbers are signed, we should interpret the most significant bit differently. Instead of it having the value of 2⁷, it has the value of -2⁷. So, if x < 0, then we should have added y*-2⁷ rather than y*2⁷. So, we can correct it by subtracting y*2⁷, and then adding y*-2⁷. Which is -2*y*2⁷ == -y*2⁸.

In other words, if we just subtract y from the high byte of the result, we have corrected for the sign of x. Then we do the same for y: if y < 0, subtract x*2⁸ from the high byte:

z = unsignedmul(x, y);

if (x < 0)
    z -= (y << 8);
if (y < 0)
    z -= (x << 8);

That’s all, which is slightly less work than getting the resulting sign first, then abs(x) and abs(y), and applying the sign to the result.

Going beyond 8-bit

The multiply and divide routines consist only of shifting, adding, subtracting and comparing. Since we know how to extend each of these operations to more than 8-bit, it is trivial to extend the current multiply and divide routines to work on 16-bit numbers or more. Depending on the exact situation, this could be a good approach. However, there are other ways to process larger numbers. I would like to show some here.

With multiplication, we can break it up into multiple smaller multiply operations, known as a long multiplication. Let’s say we want to calculate z = x * y, where x and y are 16-bit numbers, and z is a 32-bit result.

We can write x and y as a combination of byte values and scale factors:

x = high(x)*2⁸ + low(x)

y = high(y)*2⁸ + low(y)

Substitute that, and we get:

z = (high(x)*2⁸ + low(x)) * (high(y)*2⁸ + low(y))

z = high(x)*high(y)*2¹⁶ + high(x)*low(y)*2⁸ + low(x)*high(y)*2⁸ + low(x)*low(y)

Now z is rewritten as a combination of 8-bit multiplies and some powers-of-two, which can translate to bit-shifts:

z = (high(x)*high(y) << 16) + (high(x)*low(y) << 8) + (low(x)*high(y) << 8) + low(x)*low(y)

And these are all byte-sized shifts, so we really only need to add the relevant bytes to the relevant place in the result. We don’t need to physically shift the bits in memory.

This idea can be extended to any number of bits easily. It might not seem all that efficient right now, since we need to perform 4 separate multiplies and then some extra adds, where we could just have extended the logic inside the multiply loop to 32-bit instead. But what if we know a faster way to perform the 8-bit multiply? I will get back to that soon.

Division extended

First, let’s look at ways to extend the division routine. Our routine does not only perform a division. After executing, the value of x also represents the remainder (so effectively we also perform a modulo operation at the same time). This remainder will prove useful when we want to divide larger numbers.

Our division routine, performing z = x / y, has three limitations:

x is limited to 16-bit, and x < y*2⁸
z is limited to 8-bit
y is limited to 8-bit

1 and 2 are closely related. That is, if we want to divide a larger number by an 8-bit value, it follows that we get a larger result as well. This can be solved by a long division.

An example using x = 65535 and y = 4 which is the maximum 16-bit value, divided by a small 8-bit value (so we don’t meet the x < 4*2⁸ requirement). z = 16383, remainder = 3, but how do we calculate this? We divide the numbers up in low and high bytes again, and look at the powers of 2 involved. We will first divide the high byte:

high(x) / y = high(z)

255 / 4 = 63, remainder = 3

But actually the high byte was scaled by a factor 2⁸, so we should also scale our result and remainder by that factor, so 63*2⁸ and 3*2⁸ respectively.

Now we add that remainder to the low byte, and then divide it as well:

(remainder*2⁸ + low(x)) / y = low(z)

(768 + 255) / 4 = 255, remainder = 3

Note: it is always true that (remainder*2⁸ + low(x)) < y*2⁸, since remainder < y by definition.

Now, we have both high(z) and low(z), so we can construct the result:

z = high(z)*2⁸ + low(z) = 63*256 + 255 = 16383

And there we go. The remainder from the last division was also the correct remainder for the entire division, so we also have x mod y.

This idea can also be expanded easily to larger sizes of x, by simply performing an extra division for each extra byte in x, and chaining them together in the same way, using the remainder, with this long division approach.

Now we only have problem 3 left to solve. What if y is larger than 8-bit? In that case, we can reduce both x and y until y fits within 8-bit (we normalize to 8-bit). Let’s take x = 65535 and y = 1024. Then z = 63, remainder = 1023.

We have to reduce y to fit in the range of 0..255. So we divide it by 2 (shift right) until it fits. We need to divide by 2³, so we get:

y’ = y / 2³ = y >> 3 = 128

x’ = x / 2³ = x >> 3 = 8191

Now we can calculate a guess:

z’ = x’ / y’ = 8191 / 128 = 63

As long as x’ fits the requirement of x’ < y’*2⁸, our guess z’ can be at most 1 off, and more specifically, z’ is either z or z+1, since both x’*2³ <= x and y’*2³ <= y.

So, we can do a test by calculating z’ * y. If z’ * y > z, then apparently our z’ must have been z+1. Else z’ = z. So we can simply correct for that.

Some pseudo-code for this:

y' = y;
x' = x;

while (y' >= 256)
{
    y' >>= 1;
    x' >>= 1;
}

z = div( x', y' );
if (z > 0)
{
    z--;

    x2 = mul(z, y);
    x3 = x - y;

    if (x3 <= x)
        if (x2 <= x3)
        {
            z++;
            x2 += y;
        }

    r = x - x2;
}
else
    r = x;

Note that in this pseudo-code I had to circumvent some issues: mul(z, y) can only give a 16-bit value at most. So in the case where z’ is actually z+1, it may overflow. So instead we turn it around and test for z’-1, which always fits in 16-bit. This won’t work if z’ is 0, because z’-1 will underflow then. Then again, if z’ was 0, the initial guess must have been correct, because else z would have to be -1. This is not possible, since we have assumed non-negative numbers as input. So we only need to test cases where z’ > 0, which means we can always use z’-1 as a guess.

Note also that z’ is 8-bit by definition, and only y is 16-bit. When implementing this in assembly, you can do an optimized multiply routine, using the 8-bit value in the loop. And if you implement it in assembly, you could detect overflow with the carry flag, so you wouldn’t necessarily need to use z’-1 in the guess, but you can just use z’ directly.

In that case you do not have to use x3 = x – y to compare the guess against either, but you can just use x directly. This avoids the other problem here: if x < y, then x3 will underflow. The if (x3 <= x) is an extra check for underflow situations.

This code also takes a bit of extra care to calculate the proper remainder in the r variable after the guess has been corrected.

Multiplication tables

The above long multiplication and division using a multiply to check and correct an initial guess would be more interesting if we could make our regular multiply faster than the loop-method we have used so far.

Using lookup tables might be an interesting approach. However, if you just want a naïve table for 8-bit numbers, you would require 256*256 = 65536 table entries. And since you need 16-bit to store each result, that amounts to 128 KB of memory. A lot more than what a C64 has. You could say: let’s use a smaller table then, and construct larger multiplies from that with a long multiply approach. That is a possibility, taking a 4-bit table for example, then you’d only need 256 entries, or 512 bytes of memory. But 4-bit values are not as easy to manipulate as bytes, so the extra overhead for table lookups and the long multiply itself don’t make it a very attractive option.

But, what if we can build a smarter table? There is a lot of redundancy in a multiply-table, since x * y == y * x. Perhaps there is a simple way to create a more efficient table for 8-bit numbers.

We can rewrite x*y as a combination of (x+y)² and (x-y)² terms. Namely:

(x+y)² = x² + 2xy + y²

(x-y)² = x² -2xy + y²

So if we take (x+y)² – (x-y)², we get:

x² + 2xy + y² – x² + 2xy – y² = 4xy

Which means:

xy = ((x+y)² – (x-y)²)/4 = ((x+y)²/4) – ((x-y)²)/4)

(Trivia: this is known as quarter-square multiplication, and was already known in Babylonian times)

So now we only need a table for the function f(n) = (n²)/4. Since x and y are both 8-bit, the table only needs to go from 0..(255+255), so only 511 entries are required. This will fit in 1 KB of memory, and our calculation reduces to:

x*y = f(x+y) – f(abs(x-y))

Note: we use abs(x-y) to avoid issues with negative table indices, because x-y will be in -255..255 range. Since the function is just (n²)/4, it makes no difference, since (-n)² == n².

Since abs(n) is still rather clumsy to do on C64, you could modify the indexing. You could use 255-x+y as an index instead, which moves it to 0..511 range again. Then you could create a second table built around this indexing system: g(n) = ((n-255)²)/4. Then you get:

x*y = f(x+y) – g(255-x+y)

So just two table lookups and a 16-bit subtraction is all we need for our multiply now. And 2 KB worth of tables, of course.

With this information, you should be able to figure out how the highly optimized math routines at Codebase64 work, and how to use them in your own 6502 code, and how to build optimized versions for specific cases in your code (one interesting optimization is reusing the first operand of the multiply, which is useful for matrix multiply for example).

I think that about covers the mathematical ‘deficiencies’ of the C64. We can now perform add, sub, mul and div, and do it with 16-bit or even 32-bit precision if required. Which opens up the same possibilities that the 286 and 68000 processor have, which I’ve used earlier in the series.

	OEM on MartyPC: PC emulation done…
	equipthering on An Amiga can’t do Wolfen…
	Mike Dawson on Running anything Remedy/Future…
	.NET Core: the small… on Migrating to .NET Core: the fu…
	Scali on Video playback on low-end MS-D…