The myth of linux/open source security

I have tried to warn about the myth that is security in the open source world before. Today there is another big security issue in Linux: http://arstechnica.com/security/2013/05/critical-linux-vulnerability-imperils-users-even-after-silent-fix/

As the article points out: the problem is not just the vulnerability itself, but also how the kernel developers (including Torvalds himself) deal with security issues, the obscurity (oh the irony), and lack of security advisories when releasing patches.

This once again proves that “many eyes inspecting the source” is no guarantee for secure and robust code.
MythBusted

Posted in Software news | Tagged , , , , , , , , , , , , , | 4 Comments

Just keeping it Metro

Yes, Metro, not Retro. Or actually I should say Windows Store App now. As in, the new interface for Windows 8, mainly aimed at tablets and smartphones. Namely, I got myself a Windows 8 Pro update early on, at a low price. However, I wanted to install it on a separate partition, and keep my current installations in tact, with multi-boot. So I had to do some repartitioning in order to install Windows 8 properly. I didn’t get round to that until recently. And so I finally have a Windows 8 machine at my disposal now, both to test my code on, and to play around with and see what is new.

I did some reading on what’s new in the Windows 8 SDK. I also installed Visual Studio 2012 Express for Windows 8, and figured I’d have a look at creating Windows Store Apps. I noticed that their Direct3D examples were basically just using regular C++ code for the D3D part. They then used managed C++ to link it to the new Core UI environment, which is .NET-based. It is the exact same Direct3D 11.1 as for regular desktop apps, with only one change really: instead of attaching a swap chain to a handle of a desktop window (hWnd), you now create a swap chain for a CoreWindow. Effectively, it comes down to changing only a single function call, the one that creates the swap chain: http://msdn.microsoft.com/en-us/library/windows/desktop/hh404559(v=vs.85).aspx.

I already have a Direct3D 11 engine written in C++, so this looks promising. Let’s see what it takes to make it work inside a Windows Store App environment.

Phase one: Windows 8 SDK

First things first. Since Windows 8 is still fully backward-compatible with earlier versions of Windows (despite some people trying to cloud the issue), my code just worked as-is on Windows 8 in desktop mode. This version was still built with Visual Studio 2010, using the Windows 7 SDK and the old DirectX SDK from June 2010. The first step towards a Windows Store App would be to import the project into Visual Studio 2012, and make it work with the Windows 8 SDK (still as a desktop app).

One of the first things I ran into, was that I had to fix some QuickTime headers, because they defined a GetProcessInformation() function, which is also in the Windows 8 API, and so it clashed: http://msdn.microsoft.com/en-us/library/windows/desktop/hh448381(v=vs.85).aspx. The QuickTime SDK also came with its own stdint.h header, which interfered with the Windows 8 SDK one. The QuickTime SDK has not been updated since 2007 though (and there still is no support for x64), so I don’t expect Apple to release an update anytime soon.

Phase two: D3DX

Now I could build the code with VS2012, but I was still dependent on the old DirectX SDK, because my code made use of D3DX, which is not in the Windows 8 SDK. Since the DirectX SDK does not support ARM, I would never be able to port the engine to ARM devices as long as the D3DX code is there, so now would be a good time to replace that legacy code with the new libraries.

There are basically three different tasks I perform with D3DX:

  • Shader compilation
  • Texture loading
  • Mathematics

Shaders should now be compiled with the D3DCompiler library, which is included in the Windows 8 SDK: http://msdn.microsoft.com/en-us/library/windows/desktop/dd607340(v=vs.85).aspx.

There is no library for textures included in the Windows 8 SDK. However, Microsoft has released an open source project by the name of DirectXTK (Tool kit): http://directxtk.codeplex.com/. This handles DDS loading for all devices, and it handles the ‘legacy’ formats as well (jpg, png, gif, tga etc), with the exception of Windows Phone. On Windows Phone, the codecs for these formats are not included.

For mathematics, there is the DirectXMath library, which is included in the Windows 8 SDK: http://msdn.microsoft.com/en-us/library/windows/desktop/hh437833(v=vs.85).aspx. DirectXMath is entirely based on C++ inline functions and templates, and will use intrinsics for SSE2 (x86/x64) or NEON (ARM). You can also disable the use of intrinsics, if you still want to remain compatible with pre-SSE2 CPUs (Pentium 4 and Athlon64 were the first to have SSE2). I think I will want that for the x86 build of my engine. For x64, SSE2 is supported by definition. Likewise, there will not be any Windows RT devices with ARM CPUs that do not have NEON.

Since my engine is still a multi-API engine, supporting Direct3D 9 and 10/10.1 as well, it was slightly more difficult to get rid of the D3DX dependencies than with just Direct3D 11 code. I can use the new shader compiler for D3D10 and higher, but D3D9 will still need to rely on the old D3DX compiler. The DirectXTK texture functions will only work with D3D11, and I will need to use D3DX for the older versions (unless I want to extend the DirectXTK code myself, to make it support the other APIs, but I don’t really see much of a point in that).

Windows Store Apps only support D3D11.1 anyway, so they are not affected by what happens in the code for the other APIs. So it is not a problem when there are still some D3DX-ties there.

The mathematics are a slightly different story, since this code is shared by all APIs. D3DXMath was originally built as a set of extensions to the native D3DVECTOR and D3DMATRIX structures in Direct3D 9. In Direct3D 10 and newer, there are no native structures at all in the API. I used the same D3DXMath types as I did in D3D9, to keep things simple. But now it means that if I want to replace D3DXMath for D3D11, then I need to replace it for all APIs.

Since DirectXMath still use vectors and matrices, the memory layout is still very similar, and it is still quite easy to convert between Direct3D9/D3DX types and DirectXMath. Microsoft has an overview page on mixing the two: http://msdn.microsoft.com/en-us/library/windows/desktop/ff729728(v=vs.85).aspx

I decided to go for a quick hack initially. Namely, a lot of my functions and objects use D3DX structures and functions directly. If I were to rewrite all that with DirectXMath at once, it would take quite some time. Instead, I removed the D3DXMath.h from my project entirely, and typedef’ed some compatible DirectXMath types to D3D9/D3DX types. So now my code would actually use DirectXMath types, instead of D3DX ones. Something like this:

#if !defined(D3D9)
typedef DirectX::XMFLOAT3 D3DVECTOR;
typedef DirectX::XMFLOAT4X4 D3DMATRIX;
#endif
typedef DirectX::XMFLOAT2 D3DXVECTOR2;
typedef DirectX::XMFLOAT3 D3DXVECTOR3;
typedef DirectX::XMFLOAT4 D3DXVECTOR4;
typedef DirectX::XMFLOAT4X4 D3DXMATRIX;
typedef DirectX::XMFLOAT4 D3DXPLANE;
typedef DirectX::XMFLOAT4 D3DXQUATERNION;

Then I made simple replacement functions for all the D3DXMath functions that I used, implementing them with simple DirectXMath statements. Something like this:

inline D3DXMATRIX* D3DXMatrixMultiply(
  _Inout_  D3DXMATRIX *pOut,
  _In_     const D3DXMATRIX *pM1,
  _In_     const D3DXMATRIX *pM2
)
{
    DirectX::XMStoreFloat4x4( pOut,
       DirectX::XMMatrixMultiply( DirectX::XMLoadFloat4x4(pM1),
                                  DirectX::XMLoadFloat4x4(pM2) )
    );

    return pOut;
}

So I now had my own ‘D3DXMath’ implementation, which used DirectXMath underneath.

The code still compiled and worked the same as before, except I no longer had to link it against any D3DX libraries for D3D11. D3D9/10/10.1 would still use D3DX for some operations, but the mathematics were still done with DirectXMath.

Now I can just rewrite some of the code with more optimal DirectXMath at a later time (using aligned types everywhere would be nice for starters). It may already be slightly more efficient than the original D3DX code anyway, since the compiler is now emitting SSE2 intrinsics inline, rather than just linking to precompiled code.

Phase three: ARM

Now that I had removed one of the largest x86-only dependencies from the code, it was time to try and compile it for ARM, and more specifically, the new Windows Runtime environment. The Windows Runtime (WinRT) environment is basically a subset of the old Win32 and .NET environments. Microsoft provides a number of pages on MSDN to explain the differences: http://msdn.microsoft.com/en-us/library/windows/apps/br205757.aspx

The Windows Phone 8 is more or less a subset of WinRT, also still supporting some of the Win32 API: http://msdn.microsoft.com/en-us/library/windowsphone/develop/jj662956(v=vs.105).aspx

The .NET environment implements a subset of the regular .NET framework, combined with new libraries such as the Core UI: http://msdn.microsoft.com/en-us/library/windows/apps/br230302.aspx

Ironically enough, the .NET code in my codebase turned out to be the most difficult to port to WinRT. There have been a number of changes in C++/CLR, which means that the .NET wrapper I have made, would not compile for a Windows Store App. Likewise, the C# code depends on some third-party libraries such as SlimDX, which will not work either, so I could not use any of the existing .NET code. So for now I concentrated only on the C++ engine itself, and hooking that up to a minimal user-interface.

Anyway, after having cleaned out the remaining x86-only code, mainly DirectShow and QuickTime (these should be replaced with Media Foundation code for the future), I only had a small number of compile errors left. One thing is that the unsafe functions in the CRT now generate errors rather than just warnings. So I decided to finally fix these by using the *_s() functions for string processing and such.

Aside from that, there are a number of legacy functions that were no longer supported. Things like InitializeCriticalSection(). There is an InitializeCriticalSectionEx() that has been around since Windows Vista, which is supported however, so it was easy to fix. The same goes for CreateEvent() vs CreateEventEx() for example. And CreateFile() is not available either, but there is CreateFile2(). So these were all just quite straightforward fixes.

Sadly I do not have an ARM-powered Windows RT/Phone device yet, and the simulator included with Visual Studio is just a virtualized x86 environment. So although I can compile my engine for ARM now, I have not actually tested it yet. The x86 version works like a charm though. I’ve made a simple test application where my engine creates a textured donut (what else?), and I’ve tiled it with the Windows Store, so you can see that it is in fact running in the new UI of Windows 8:

TileDonut

Another interesting tidbit in the new Win8 APIs was this little flag: http://msdn.microsoft.com/en-us/library/windows/desktop/hh404455(v=vs.85).aspx. Yes, apparently Microsoft now allows you to detect whether you run on a tile-based deferred renderer (read: PowerVR hardware), so you can make special optimizations. Very nice!

And what do you think of Windows 8?

There has been a lot of discussion about the new user interface of Windows 8. And also about the new Windows Store, and the alleged doom for independent software vendors and all that. And how this is the end of Windows, and how people should move to linux instead. Well, that idea is laughable, when you think about it. Because even if you think Windows 8 is not as nice as Windows 7, it still makes a whole lot more sense to just stick to Windows 7 than to move to linux. The Windows 8 desktop looks and feels more similar to Windows 7 than any linux desktop environment. And Windows 8 is more compatible with Windows 7 than any linux environment. So I don’t quite see why Windows 8′s slightly different UI should move anyone to an OS with an even MORE different UI, especially since it is also a completely different OS, which is not compatible with your existing Windows applications. Some people are quite twisted.

I personally don’t really care about UIs. I have used tons of different systems over the years, and still do. While there are Start Menu apps for Windows 8, I don’t see the fun in that. I prefer to use Windows 8 as its designers meant it, so without the classic start menu. And indeed, once you get the hang of the new UI, it is not bad really. The key to Windows 8 for power users is not the new Start page, but rather the new shortcut keys. Windows has always had a number of clever shortcut keys (one of my favourites being ctrl+shift+esc to pop up the Task Manager. Which has been massively improved in Windows 8 by the way, and allows you to manage start-up programs as well). But Windows 8 introduces a number of new ones: http://windows.microsoft.com/en-us/windows-8/new-keyboard-shortcuts#1TC=t1

Just press the Windows key and start typing the app you want to start. That’s what you probably did since Windows Vista anyway, since Start Menus tend to be full of crud and slow to navigate by mouse. Or press Windows+Q to get to your Apps tile immediately. Anyway, just check the list and play around with it, there’s some interesting stuff for any power user.

Aside from that, Windows 8 worked fine for me so far. It’s fast, and the experience is generally quite smooth. There are a few minor glitches, but you get that with any new OS. It’s been fine for daily use for me, even better than Windows 7.

But you want to know what bothers me? No Solitaire! That’s right! That game has been part of Windows for a long, LONG time, but now it is gone (as are the other games, such as MineSweeper). I liked to play Solitaire every now and then, get your mind off things for a few moments, then get back to work with a fresh outlook on things. They offer some cardgames in the Windows Store now, but they are fullscreen. And that’s just not very nice on a large monitor. Luckily I am not the only one who missed these games. I first tried to start the games from Windows 7, but they refused. I suspected that there may be some version check in there (why?!). I figured I’d Google around first, and if all else fails, I’d take a stab at removing the version check. But, apparently I wasn’t the only one who was annoyed with this, and I found that there were some ready-made patches available already: http://www.howtogeek.com/122145/what-happened-to-solitaire-and-minesweeper-in-windows-8/

Excellent!

Another key thing that is missing, for some users at least, is the XP mode. I occasionally used it for older hardware. Then again, I have replaced most of that hardware since, so I don’t know if I will ever need it again.

Posted in Direct3D, Software development, Software news | Tagged , , , , , , , , , , , , , , , , , , , , , , , , , , | 6 Comments

Revision 2013 is keeping it real

This easter weekend the demo party Revision 2013 was held. And they were keeping it real. SvOlli was there, for a seminar on the Atari 2600 (‘rock bottom’ in terms of game/demo platforms):

It should give you an impression of just how crazy it is to write code for the Atari 2600.

Then again, these guys may have found a platform even MORE primitive than the 2600:

And if that wasn’t enough, lft held a seminar there as well. You may have heard of lft before, with his wild demos running on microcontroller-based platforms of his own design. He won the wild compo with one the previous year. And the year before that. And before Revision, he did it at Breakpoint as well, in 2009. And in 2008… What happened in 2010 then? He must have had an off day, he only came second that year (then again, getting beaten by a C64 playing video… how could anyone possibly top that?). But recently, he has moved onto the Commodore 64. And that is what his talk is about.

This ties in nicely with the blogs I have written recently. As I have been trying to do, he explains how different C64 coding is to modern programming. He draws some interesting parallels between cycle-exact coding and writing poems in special formats such as the sonnet. And he also compares the VIC-II bugs with… real bugs! Hence the title: “Poems for bugs”.

He also entered a C64 demo at the Revision compo (the main feature is actually his novel trackloading routine):

And speaking of demos… Another very nice demo at Revision was the winning oldskool Amiga 500 demo by Lemon, a ‘redux’ of their 1993 demo Rink-a-Dink:

With some new vector parts, which seem to be a tribute to the 3D part in the legendary Commodore 64 demo Edge of Disgrace, by Booze Design (starting at about 7:30):

And speaking of Amiga demos, Ghostown and Loonies won the regular Amiga demo compo:

This was probably the best demo of the entire party.

All in all, a lot of nice retro/oldskool things going on at Revision 2013!

Posted in Hardware news, Oldskool/retro programming, Software development, Software news | Tagged , , , , , , , , , , , , , , , , | 4 Comments

Latency and stuttering with 3D acceleration

The other day, I read this interesting article on Anandtech about the GPU stuttering issues that were recently discovered with AMD cards. I wanted to do a quick write-up about that, since it relates to what I am doing.

As I’ve written a while ago, I develop software with 3D acceleration for multiple screens at a time, with multiple GPUs. This poses very similar problems to the ones explained here with FRAPS: I can only synchronize the screens up to the Present()-call, and I am dependent on the underlying driver queues from there.

For me it is very good news that this has now become an important issue in driver development, because it will mean that different vendors will probably diverge less in latency in the future. So synchronization will be easier and more reliable for me in the future.

Note also that I wrote at the time that nVidia had some issues in OpenGL when rendering multiple windows, which AMD did not. This could actually have been a result of nVidia’s latency handling scheme. It appeared like there was some kind of scheduling going on, but it only worked on a process-wide basis, rather than scheduling per-thread, and balancing out the performance equally between all windows. And if that theory is correct, then it would also explain why AMD did not suffer from it: they simply did not try to be clever at all, so they could not get it wrong either.

And well, it has to be said: it can’t really be coincidence that once again it happens to be AMD where these issues were detected. nVidia said that they’d been working on latency issues since Fermi, and Intel probably has been working on them as well, since their drivers did not seem to show big issues either. But for some reason it never even occured to the AMD driver team that this kind of latency issue exists, which is a bit sad.

Posted in Direct3D, Hardware news, OpenGL, Software development | Tagged , , , , , , , , , , , | 2 Comments

Just keeping it real, part 10.1

Last time, I covered the basics of doing multiply and divide with 8-bit numbers. So far, these routines only worked for unsigned numbers. This time I will show how to add support for signed numbers, some ways to extend these routines to numbers of more than 8 bits, and also some optimizations.

Another thing that I have not mentioned specifically was a compare-operation for more than 8-bit. For the division you would need one. I will just start there.

Comparing more than 8 bits

Let’s say we want to compare two 16-bit numbers x and y. We can start by just looking at the high bytes. If high(x) > high(y), then it must follow that x > y. Likewise, if high(x) < high(y), it follows that x < y. If high(x) == high(y), then we have to check the low bytes. In that case, if low(x) < low(y), then x < y, if low(x) > low(y), then x > y, and if low(x) == low(y), then x == y.

So we could use some code like this:

lda x+1
cmp y+1
bne done

lda x
cmp y

done:

Quite simple, is it not? Once we have executed this code, the status register reflects the comparison of x and y just as if they were both 8-bit numbers. And as you can see, this can again be extended to as much precision as you want, just like the addition and subtraction routines. Note also that a compare is technically the same as a subtraction, except that only the status flags are set, and the result is discarded. So in some cases you can combine the subtraction and comparison steps into one, since after a subtraction the status flags are the same as a comparison.

Signing division and multiply

In order to make our routines handle signed numbers as well, we could start with a naïve approach again. Namely, it is easy to determine what sign our result should have, for both multiply and divide:

 X | Y | X*Y | X/Y
---+---+-----+----
 + | + |  +  |  +
 + | - |  -  |  -
 - | + |  -  |  -
 - | - |  +  |  +

It is just an exclusive-or relationship. So, we could simply determine the sign that our result should get, then perform abs(x)*abs(y) or abs(x)/abs(y), and then apply the correct sign to the result.

For multiply, we can solve it in a slightly more clever way though. Namely, all we have been doing are additions. Normally, addition behaves the same for signed and unsigned numbers. In this case however, we started out with 8-bit signed numbers, and they have been promoted to a 16-bit result. The low byte is still always the same, regardless of signed or unsigned numbers. But the high byte requires some adjustment now.

Namely, if the numbers are signed, we should interpret the most significant bit differently. Instead of it having the value of 27, it has the value of -27. So, if x < 0, then we should have added y*-27 rather than y*27. So, we can correct it by subtracting y*27, and then adding y*-27. Which is -2*y*27 == -y*28.

In other words, if we just subtract y from the high byte of the result, we have corrected for the sign of x. Then we do the same for y: if y < 0, subtract x*28 from the high byte:

z = unsignedmul(x, y);

if (x < 0)
    z -= (y << 8);
if (y < 0)
    z -= (x << 8);

That’s all, which is slightly less work than getting the resulting sign first, then abs(x) and abs(y), and applying the sign to the result.

Going beyond 8-bit

The multiply and divide routines consist only of shifting, adding, subtracting and comparing. Since we know how to extend each of these operations to more than 8-bit, it is trivial to extend the current multiply and divide routines to work on 16-bit numbers or more. Depending on the exact situation, this could be a good approach. However, there are other ways to process larger numbers. I would like to show some here.

With multiplication, we can break it up into multiple smaller multiply operations, known as a long multiplication. Let’s say we want to calculate z = x * y, where x and y are 16-bit numbers, and z is a 32-bit result.

We can write x and y as a combination of byte values and scale factors:

x = high(x)*28 + low(x)

y = high(y)*28 + low(y)

Substitute that, and we get:

z = (high(x)*28 + low(x)) * (high(y)*28 + low(y))

z = high(x)*high(y)*216 + high(x)*low(y)*28 + low(x)*high(y)*28 + low(x)*low(y)

Now z is rewritten as a combination of 8-bit multiplies and some powers-of-two, which can translate to bit-shifts:

z = (high(x)*high(y) << 16) + (high(x)*low(y) << 8) + (low(x)*high(y) << 8) + low(x)*low(y)

And these are all byte-sized shifts, so we really only need to add the relevant bytes to the relevant place in the result. We don’t need to physically shift the bits in memory.

This idea can be extended to any number of bits easily. It might not seem all that efficient right now, since we need to perform 4 separate multiplies and then some extra adds, where we could just have extended the logic inside the multiply loop to 32-bit instead. But what if we know a faster way to perform the 8-bit multiply? I will get back to that soon.

Division extended

First, let’s look at ways to extend the division routine. Our routine does not only perform a division. After executing, the value of x also represents the remainder (so effectively we also perform a modulo operation at the same time). This remainder will prove useful when we want to divide larger numbers.

Our division routine, performing z = x / y, has three limitations:

  1. x is limited to 16-bit, and x < y*28
  2. z is limited to 8-bit
  3. y is limited to 8-bit

1 and 2 are closely related. That is, if we want to divide a larger number by an 8-bit value, it follows that we get a larger result as well. This can be solved by a long division.

An example using x = 65535 and y = 4 which is the maximum 16-bit value, divided by a small 8-bit value (so we don’t meet the x < 4*28 requirement). z = 16383, remainder = 3, but how do we calculate this? We divide the numbers up in low and high bytes again, and look at the powers of 2 involved. We will first divide the high byte:

high(x) / y = high(z)

255 / 4 = 63, remainder = 3

But actually the high byte was scaled by a factor 28, so we should also scale our result and remainder by that factor, so 63*28 and 3*28 respectively.

Now we add that remainder to the low byte, and then divide it as well:

(remainder*28 + low(x)) / y = low(z)

(768 + 255) / 4 = 255, remainder = 3

Note: it is always true that (remainder*28 + low(x)) < y*28, since remainder < y by definition.

Now, we have both high(z) and low(z), so we can construct the result:

z = high(z)*28 + low(z) = 63*256 + 255 = 16383

And there we go. The remainder from the last division was also the correct remainder for the entire division, so we also have x mod y.

This idea can also be expanded easily to larger sizes of x, by simply performing an extra division for each extra byte in x, and chaining them together in the same way, using the remainder, with this long division approach.

Now we only have problem 3 left to solve. What if y is larger than 8-bit? In that case, we can reduce both x and y until y fits within 8-bit (we normalize to 8-bit). Let’s take x = 65535 and y = 1024. Then z = 63, remainder = 1023.

We have to reduce y to fit in the range of 0..255. So we divide it by 2 (shift right) until it fits. We need to divide by 23, so we get:

y’ = y / 23 = y >> 3 = 128

x’ = x / 23 = x >> 3 = 8191

Now we can calculate a guess:

z’ = x’ / y’ = 8191 / 128 = 63

As long as x’ fits the requirement of x’ < y’*28, our guess z’ can be at most 1 off, and more specifically, z’ is either z or z+1, since both x’*23 <= x and y’*23 <= y.

So, we can do a test by calculating z’ * y. If z’ * y > z, then apparently our z’ must have been z+1. Else z’ = z. So we can simply correct for that.

Some pseudo-code for this:

y' = y;
x' = x;

while (y' >= 256)
{
    y' >>= 1;
    x' >>= 1;
}

z = div( x', y' );
if (z > 0)
{
    z--;

    x2 = mul(z, y);
    x3 = x - y;

    if (x3 <= x)
        if (x2 <= x3)
        {
            z++;
            x2 += y;
        }

    r = x - x2;
}
else
    r = x;

Note that in this pseudo-code I had to circumvent some issues: mul(z, y) can only give a 16-bit value at most. So in the case where z’ is actually z+1, it may overflow. So instead we turn it around and test for z’-1, which always fits in 16-bit. This won’t work if z’ is 0, because z’-1 will underflow then. Then again, if z’ was 0, the initial guess must have been correct, because else z would have to be -1. This is not possible, since we have assumed non-negative numbers as input. So we only need to test cases where z’ > 0, which means we can always use z’-1 as a guess.

Note also that z’ is 8-bit by definition, and only y is 16-bit. When implementing this in assembly, you can do an optimized multiply routine, using the 8-bit value in the loop. And if you implement it in assembly, you could detect overflow with the carry flag, so you wouldn’t necessarily need to use z’-1 in the guess, but you can just use z’ directly.

In that case you do not have to use x3 = x - y to compare the guess against either, but you can just use x directly. This avoids the other problem here: if x < y, then x3 will underflow. The if (x3 <= x) is an extra check for underflow situations.

This code also takes a bit of extra care to calculate the proper remainder after the guess has been corrected.

Multiplication tables

The above long multiplication and division using a multiply to check and correct an initial guess would be more interesting if we could make our regular multiply faster than the loop-method we have used so far.

Using lookup tables might be an interesting approach. However, if you just want a naïve table for 8-bit numbers, you would require 256*256 = 65536 table entries. And since you need 16-bit to store each result, that amounts to 128 KB of memory. A lot more than what a C64 has. You could say: let’s use a smaller table then, and construct larger multiplies from that with a long multiply approach. That is a possibility, taking a 4-bit table for example, then you’d only need 256 entries, or 512 bytes of memory. But 4-bit values are not as easy to manipulate as bytes, so the extra overhead for table lookups and the long multiply itself don’t make it a very attractive option.

But, what if we can build a smarter table? There is a lot of redundancy in a multiply-table, since x * y == y * x. Perhaps there is a simple way to create a more efficient table for 8-bit numbers.

We can rewrite x*y as a combination of (x+y)2 and (x-y)2 terms. Namely:

(x+y)2 = x2 + 2xy + y2

(x-y)2 = x2 -2xy + y2

So if we take (x+y)2 – (x-y)2, we get:

x2 + 2xy + y2 – x2 + 2xy – y2 = 4xy

Which means:

xy = ((x+y)2 – (x-y)2)/4 = ((x+y)2/4) - ((x-y)2)/4)

(Trivia: this is known as quarter-square multiplication, and was already known in Babylonian times)

So now we only need a table for the function f(n) = (n2)/4. Since x and y are both 8-bit, the table only needs to go from 0..(255+255), so only 511 entries are required. This will fit in 1 KB of memory, and our calculation reduces to:

x*y = f(x+y) - f(abs(x-y))

Note: we use abs(x-y) to avoid issues with negative table indices, because x-y will be in -255..255 range. Since the function is just (n2)/4, it makes no difference, since (-n)2 == n2.

Since abs(n) is still rather clumsy to do on C64, you could modify the indexing. You could use 255-x+y as an index instead, which moves it to 0..511 range again. Then you could create a second table built around this indexing system: g(n) = ((n-255)2)/4. Then you get:

x*y = f(x+y) – g(255-x+y)

So just two table lookups and a 16-bit subtraction is all we need for our multiply now. And 2 KB worth of tables, of course.

With this information, you should be able to figure out how the highly optimized math routines at Codebase64 work, and how to use them in your own 6502 code, and how to build optimized versions for specific cases in your code (one interesting optimization is reusing the first operand of the multiply, which is useful for matrix multiply for example).

I think that about covers the mathematical ‘deficiencies’ of the C64. We can now perform add, sub, mul and div, and do it with 16-bit or even 32-bit precision if required. Which opens up the same possibilities that the 286 and 68000 processor have, which I’ve used earlier in the series.

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , , , , , , , , , | 1 Comment

Just keeping it real, part 10

Let’s explore the Commodore 64 some more. The topic today is: mathematics. And again, we are going to keep it real, VERY real. I have to admit, originally the ‘keeping it real’ was mainly a play of words on the 16-bit ‘real mode’ of x86 processors, as I wanted to explore not only the old CGA, EGA and VGA graphics standards, but I also wanted to explore them in their natural habitat, which would be pre-386 machines, which only worked in the original 16-bit segmented memory mode.

But another limitation I had, with virtually all the platforms discussed here (C64, Amiga, PC, GP2X), is that they do not have an FPU to process real numbers either (in fact, the example of handwritten machine and assembly code of Steve Wozniak that I used in a previous article, is actually a library for processing floating point on a secondary 6502 processor in an early Apple machine).

And lastly, especially with the Commodore 64, and having to write cycle-exact assembly code, self-modifying machine-code, or even having to organize your code byte-exact in some ways, we are clearly within the realm of Real Programmers, such as Mel. There are actually tons of examples of Real Programming for 6502 and C64, some of which can be found on Codebase64, or on Pagetable.com. I’m sure this is one that Mel would have liked: the ‘ninja’ way of creating a stable raster. In short, it sets up the timers so that the memory-mapped values read a valid jmp opcode, and then carefully sets up code so that each address contains a delay routine to remove the amount of jitter that the timer indicated.

I might get into other interesting hacks as well at some point. It should be clear at this point, that much like Mel, the C64 programmers are masters of their domain. Other interesting topics include tape or disk fastloading routines. People would basically replace the slow standard IO routines with their own custom cycle-exact code using much more efficient handshaking, which could boost throughput up to about 8 times. The 1541 diskdrive is an interesting beast anyway, since it contains its own 6502 processor, and has 2k of RAM. It is possible to upload code to the drive, and have it execute it. For example, if you had two or more drives, you could upload a disk copy program such as Fast Hack’em to the drives, and they would copy the disk with no further intervention from the computer. You could unplug the drives from the computer once the copying started. Another example is to use the drive as a sort of co-processor. For example, you could do some 3d animation where the drive will calculate object rotations for you, and let the main CPU worry only about the rasterizing.

Six degrees of separation

Anyway, I digress… mathematics was today’s topic. At this point, I realize I have been holding out on you. The 286/EGA routines I have developed earlier on in this series, have evolved further since I last covered them (okay, they appeared briefly in the GP2X demo The Chalcogens). I have since added an efficient sorting routine, so I can also handle inconvex objects such as a donut, and I have also added a polygon clipper. So it currently looks like this:

I have been planning to explain some things about the mathematics involved, how to deal with not having an FPU, making it fit into 16-bit integer operations and such (not to mention the limitations of 64k segments, when using approximation tables). But I will save that for another day. The C64 and its limitations pose some related problems, but also some new ones: it does not have a multiply or a divide instruction.

There are various degrees of ‘realness’ to programming, before you hit ‘rock bottom’, so to say (my definition of ‘rock bottom’ is the venerable Atari 2600, which is so primitive, it does not even have a framebuffer, and ‘racing the beam’ is the only way to get graphics on screen. Commodore 64 is not quite THAT primitive, but it is not too far off).

  1. Programmers who have not programmed in assembly will say: “You’re doing everything in assembly language?!”
  2. Programmers who have done some assembly, but only on modern CPUs will say: “You’re doing everything without an FPU?!”
  3. Or: “You’re doing everything with just 16-bit instructions and 64k memory segments!?”
  4. Or: “You’re doing everything on a machine at less than 100 MHz!?”
  5. Or: “You’re accessing all the hardware directly!?”
  6. And now perhaps even people who have programmed on ‘primitive’ 16-bit MS-DOS machines might go: “You’re doing everything on a machine that doesn’t even have mul and div instructions?! And with only 8-bit instructions!?”

Well yes, that is what we will be doing. But luckily, we have hit ‘rock bottom’ at that point. There isn’t much lower to go from here, realistically. The 6502 is a very early 8-bit microprocessor, similar to the Zilog Z80, Motorola 6800 or Intel 8080 found in early home/personal computers. You will not likely run into a machine with a less capable processor than this. So once you know how to work around its limitations, you should be able to program on any machine. This early technology is ‘genesis’: some of the earliest computers available to regular people. And also where computer games and the demoscene originated. An interesting piece of trivia: various scientific calculators are also based on such simple CPUs. Texas Instruments built a number of graphing calculators using the Z80 CPU for example. So that’s something to think about: the calculator only has a very basic 8-bit CPU, which does not even have multiply and division implemented in hardware. Everything is done in software, doing multiply and division, handling numbers much larger than 8 bits, handling floating point numbers, trigonometry and all that.

(Note also that self-proclaimed linux ‘hackers’ generally are barely in the upper layer of only using compiled languages. Or not even that. Just compiling your own kernel does not impress Mel in the least. You are WAY removed from what Real Programmers can do with a machine, any machine).

How limited is 8-bit really? 8 bits, so 8 binary digits. This gives us 28 different values. So a range of 0..255 for unsigned numbers, or -128..127 for signed numbers. When I started my simple scroller/music/rasterbar intro on C64, I was confronted with the limitations almost immediately. I wanted to clear the screen, which consists of 40×25 = 1000 characters. Let’s build a loop then, of 1000 iterations. Oh wait! We can’t count to 1000 in an 8-bit register! Now, I could use a 16-bit loop counter, and implement a 16-bit subtraction, or I could use two nested loops, with an outer loop of 4 iterations, and an inner loop of 250 iterations. However, K.I.S.S., so I chose to just do a single loop of 250 iterations, and write to 4 different locations each iteration.

On to the mathematics then. They aren’t particularly hard really, nothing that you wouldn’t be able to handle with a few years of grammar school. I suppose it is more about getting in the right mindset, applying your knowledge to actually solve problems. It reminds me of the N*9 problem that I brought up in an earlier blog. People tend to get stumped by it. Not because it’s hard, but because people aren’t used to solving such problems.

Addition and subtraction

Let’s start simple, and extend our 8-bit addition and subtraction to 16-bit and beyond. This is where the carry flag comes in. The carry flag is technically the 9th bit of the result (do not call it an overflow flag, because overflow is a term that is generally used with signed numbers, and most CPUs have a separate overflow flag in addition to the carry flag).

First, I suppose a small refresher on binary numbers may be useful. I covered it earlier, but here it is in a nutshell:

The first bit represents 1, the second bit represents the double of that: 2. The third bit is again the double of the that: 4. And so on.

In general: The bit at position n represents value 2 to the power n.

For the following, it makes sense to think of numbers as combinations of powers of two. For example, the number 15 is 20 + 21 + 22 + 23. But we don’t necessarily have to get down to the bit level. In this case it generally makes more sense to think of numbers one byte at a time. This is often easy to do with hexadecimal.

Let’s take the 16-bit number 624. In hex this is 270h. Each hex digit corresponds to 4 bits, so we can see immediately that the high byte is 2h and the low byte is 70h. So the number can be seen as 2h*28 + 70h, or in decimal: 2*28 + 112 = 512 + 112 = 624.

Now let’s add another 16-bit number to that, say 400, or 190h. We only have an 8-bit add, so we can only add one byte at a time. First we add the low bytes: 70h + 90h = 100h, or in decimal: 112 + 144 = 256. So, we have a 9-bit result. The low 8 bits are stored in the result register, so 00h. The most significant bit is stored in the carry flag. So carry is now set.

Now we add the high bytes: 2 + 1 = 3. So our intermediate 16-bit result is now 300h, or 768. But we have not added the 9th bit of the first addition yet. We add this to the result of the high byte, so we get 4 instead of 3. Our 16-bit result is now 400h, or 1024 in decimal. And that is correct! We have done a full 16-bit addition while only using an 8-bit add instruction.

Subtraction works very much the same way. You do a ‘subtract-with-borrow’. You start with the low byte again. If you do x - y where x < y, then the result will be less than 0, meaning that the subtraction has to ‘borrow’ a bit from the high byte. This will again be reflected by the CPU setting the carry flag. And this time you subtract the carry from the high byte after the second subtraction. So if we do 400h – 270h, we get the low byte first: 00h – 70h. This will ‘borrow’ a bit, so it actually does 100h – 70h, which results in 90h, and the carry set. Now we do the high byte: 4 – 2 = 2. And subtract carry: 2 - carry = 1. So our result is 190h. Which again is correct (actually, the 6502 does it exactly the other way around, and clears carry when a bit was borrowed, and also subtracts (1-carry) with the sbc instruction).

Since each addition and each subtraction will set the carry flag, this idea can be extended to numbers of any size. So you can do 24-bit, 32-bit and beyond just as easily. In fact, the 6502 is such a minimalistic CPU, that it only has adc and sbc instructions, which always add or subtract carry. If you want a ‘normal’ add or sub, you have to make sure that the carry flag is not set.

The actual code is quite simple, for z = x + y:

clc
lda x
adc y
sta z
lda x+1
adc y+1
sta z+1

and z = x - y:

sec
lda x
sbc y
sta z
lda x+1
sbc y+1
sta z+1

(Note: we assume Little Endian here, so the low byte is stored at the lowest address (x, y, z), and the high byte is stored at the highest address, (x+1, y+1 and z+1). In theory you could store each byte anywhere you like, but under normal circumstances it makes sense to just group them like this.)

So far this only covered unsigned numbers. However, because the 6502 uses two’s complement notation for signed numbers (like all popular CPUs), there is no need for a specific signed addition or subtraction, as two’s complement numbers behave the same for addition and subtraction, regardless of sign.

Okay, that was easy, and you may already have been familiar with instructions like adc/sbc from other CPUs. Now let’s move on to something that really only applies to early CPUs: the missing multiply instruction. So, we don’t have an instruction for it… does that stop us? Well no, because these CPUs are still Turing-complete. So, there must be a way for them to perform these operations.

Multiplication

Okay, let’s say we want to calculate z = x * y, where x and y are unsigned 8-bit numbers. The result will then be 8+8 = 16-bit (again, this follows from treating the bits as powers of two. If you multiply two powers, you can add the exponents).

Right, how do you solve this one? A naïve approach might be to just build a loop, something like this:

z = 0;

for (i = 0; i < x; i++)
    z += y;

Now that you know how to do 16-bit addition, you could at least implement this solution, and it will work. You might even think of an optimization where you swap x and y if x > y, so that you minimize the number of iterations.

However, worst-case it will still require 255 iterations, with 16-bit additions everytime. It is not going to be fast. Let’s take a closer look at the powers-of-two again. Say we want to calculate 6 * 50. If you look at 6 in binary, it is 101b. Now let’s write 6 as powers of two:

6 = 101b = 1*20 + 0*21 + 1*22

So 6 * 50 becomes:

(20 + 22) * 50 = 20 * 50 + 22 * 50

So we can look at each binary digit separately, and bit n will tell us whether we need to add 50*2n either 0 times or 1. So we only need to be able to multiply 50 by powers of 2 now. In general, every term will look like a form of y*2n. This is a lot easier, and can be done with a simple logical shift left operation.

We can check which powers of 2 to add to the result by simply shifting the bits of the number 6 to the right. The lowest bit will then be transferred to carry, and we can simply do a conditional jump to decide whether to add or not.

In high-level code, the algorithm looks like this now:

z = 0;

while (x > 0)
{
    x >>= 1;

    if (carry)
        z += y;

    y <<= 1;
}

Note that the shifting of y requires a 16-bit shift again. Much like the above adc/sbc, we can do the same with shifts, since bit 7 is shifted into the carry by the 6502′s asl instruction. If we then use rol for the high byte instead of asl, it will set the carry to bit 0. And now we have a 16-bit shift:

asl y
rol y+1

I will get back to multiplication later, but for now, I would like to move on to division first.

Division

Division is not much harder than multiplication. Let’s take z = x / y, where x is an unsigned 16-bit number and y is an unsigned 8-bit number. Again, a naïve approach might be to subtract in a loop such as this:

z = 0;

while (x > 0)
{
    x -= y;
    z++;
}

And again, it will work, but it will not be very fast. We still have a bad worst case. And this time we cannot swap x and y either, to make it faster.

However, we can apply a very similar algorithm to the one used above in the multiplication, based on powers of two. Namely, as we have seen, each term y*2n is only added 0 or 1 times. Which means we can do the reverse operation reasonably easy as well: We start with y*2n for the largest n, and check if it is smaller or larger than the current value for x. If x >= y*2n, then y*2n MUST be a factor in x. Because all lower powers of 2 added together would only add up to 2n – 1.

So if x is larger or equal, we subtract the y*2n from x, and add 2n to the current quotient. Then we move to n = n-1, until x is smaller than y.

Or, in high-level code:

z = 0;

d = y << 8;

if (y == 0)
    division by zero!!
else if (x >= d)
    divide overflow!!
else
{
    d >>= 1;
    p = 1 << 7;

    while (x >= y)
    {
        if (x >= d)
        {
            x -= d;
            z += p;
        }

        d >>= 1;
        p >>= 1;
    }
}

Note that we limit z to an 8-bit result in this routine, which means that 27 is the largest factor we can test for. For larger numbers we would have a divide overflow. Clearly we can not divide by 0 either, so we need to avoid that situation as well (it would result in an endless loop).

I think I will leave it at that for today. We have our basic addition, subtraction, multiply and divide. Next time I will want to revisit the multiply routine, and perhaps cover some other mathematical topics as well.

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , , , , , | 2 Comments

Just keeping it real, part 9

Right, the topic is still the Commodore 64. This time I want to get more technical and explain how to actually program some things. But first I would like to paint a picture of C64 programming in general.

When the C64 was introduced in 1982, game development was still in its infancy. In fact, the whole concept of home computers was still relatively new. Only a few years earlier, ‘home computing’ consisted of either designing and building a microcomputer yourself (such as the Homebrew Computer Club in Silicon Valley, which is the birthplace of Apple Computers), or ordering a computer in kit-form such as the Altair, or the Apple I, and putting it together yourself. Computers were as much about assembling and modifying hardware as they were about software, in those early days.

When pre-built, off-the-shelf home computers became readily available, this would also draw in a new breed of computer users, who were not necessarily interested in the hardware engineering side, but concentrated entirely on software engineering. Initially, most programs were still quite primitive, and generally written by one person. When it came to games, this meant that a single person would do everything from writing the code to drawing the graphics and doing the music. When you look at early C64 games, it is usually quite obvious that one man made the entire game. The graphics and sound or music are not very sophisticated.

Sound programming

In early C64 games, some primitive sound effects may be the only sound there is. If there is music, it is often an adaptation of a classical piece, or a classic pop song, rather than an own composition. In some cases it is quite obvious that the music was made by a programmer rather than a musician, because the adaptation will contain some off-key notes or bad timing, indicating that the programmer may have been somewhat tonedeaf.

A classic example of this is the VIC-20 version of Radar Rat Race, with a dodgy note:

Compare to the C64 version:

It seems that the person who did the music in the VIC-20 version was too tonedeaf to hear the problem. The C64 version, which was done later, fixes this.

But things changed for the better. One person who has played,  shall we say, an instrumental role here, is Rob Hubbard. He is somewhat of a legend when it comes to C64 music. He was among the first programmers to specialize in music. He already had a background as a professional session musician before he learnt how to program. In these early days, there were no tools for composing music on a C64 yet, so he had to write his own routines for editing and replaying music on the C64. Since his routines were far more advanced than anything before him, his music also sounded much richer and more mature than anything anyone had ever heard from a C64.

Many other C64 musicians would study Rob Hubbard’s routines and create their own routines based on his ideas. Just like Rob Hubbard, these people were still programmers as well. You will find that other famous composers such as Jeroen Tel and Chris Hülsbeck also had their own routines, giving their music their own ‘fingerprint’. Chris Hülsbeck also released Soundmonitor, which was one of the first ‘tracker’ tools for editing music.

The fact that these composers also wrote their own replay routines often makes their tunes instantly recognizable. These replay routines are still assembly routines at the lowest level, tweaking the registers of the SID chip directly, with exact timing. Each programmer/composer’s routine had its own distinct instruments and effects, which resulted in a signature sound and style. The epic soundtracks these people managed to create from a simple 3-channel soundchip from 1982 is all the more impressive if you also factor in the fact that the C64 only had a 1 MHz processor, and only 64k of memory. The music routines also had to be very compact, and very efficient. They had to use a minimum of CPU time, because most of the CPU time was spent on the game logic and the updating of the screen.

Therefore most music routines are called once per frame, so 50 times per second (on PAL). They are generally processed in the small timeslot between the last visible scanline of the current screen and the first visible scanline of the next screen, with the music routine taking only a few hundred CPU cycles at most.

The SID music file format can be seen as ‘compiled’ music: it consists of actual 6502 code, which updates the SID’s registers, usually 50 times per second (although there are also advanced ‘multispeed’ tunes which update more than once per frame, for even tighter timing, allowing for special sounds and effects). A SID player for other platforms actually emulates the 6510 and the SID chip of the C64, and runs the code ‘as is’.

This also means that it’s very easy for a programmer to use a tune in a C64 program. You can just load a SID file into memory, and call its entrypoint once per frame. Et voila, music is playing.

This period only lasted a short while though, as in the Amiga days music software would become readily available (NoiseTracker, ProTracker and various other ‘trackers’ were available for free), and no programming was required anymore. Musicians could concentrate entirely on composing their music. This music was stored as a data file, and played back through a standard replay routine which usually came with the tracker software. In a way this made compositions less personal, less unique.

Racing the beam

So, apparently programming music on the C64 requires an intimate knowledge of the sound chip and the CPU. With graphics it is no different. I already discussed copper/raster bars briefly in part 3, where I explained how carefully timed changes to the palette can draw horizontal bars or more elaborate patterns on screen. Techniques like these are known as “racing the beam”, as in: your drawing is timed to positions of the beam (cathode ray) of the display, and affect the screen directly. This is in sharp contrast with more modern systems, where you generally draw an entire frame in a so-called ‘back buffer’, which is not visible on screen. Once the entire frame has been drawn, the back buffer and front buffer are swapped, to display the new image.

‘Racing the beam’ is generally very easy on the Amiga, since the copper can do fairly accurately timed updates to the registers of the custom chipset. On the Commodore 64 however, there is no copper, so you have to carefully synchronize your CPU with the raster position manually. This requires a deeper level of understanding of the hardware than anything I’ve discussed so far. You need to count exactly how many cycles each instruction takes, in order to determine the raster position after each instruction.

And that is just the CPU-side…

Doing bad lines

To make things even more difficult, not every cycle is available to the CPU. This was also the case on the Amiga, where things like bitplanes, sprites and the blitter would steal cycles from the CPU, since they were all using the same memory. The exact timings for all DMA operations by the custom chipsets are also documented very well in the Amiga Hardware Reference Manual. However, on the Amiga the timing is not as relevant as on the C64, and generally you only have to worry about overall performance (in a nutshell: more and/or larger bitplanes and longer blit operations steal more cycles from the CPU). The copper would take care of critically timed routines.

On the C64, the memory is shared by the VIC and the CPU as well. The 6510 is so slow that it can only access memory at every other cycle. So the CPU takes the even cycles, leaving the odd cycles to the VIC. However, to make things more complicated, the VIC needs to use the even cycles as well, every now and then. The VIC can halt the CPU during these cycles.

On normal scanlines, the CPU gets 63 cycles (for a PAL system at least). However, every 8th scanline is a so-called ‘bad line’. As mentioned in part 8, the graphics modes of the VIC are closely related to the 8×8 character set. This also goes for the colour information. At every 8th scanline, a new line of characters starts, and the VIC-II will fetch data from the ‘color RAM’, which encodes the colour data for each character on screen. The VIC-II needs 40 extra cycles for this, leaving only 23 CPU cycles on a bad line.

A sprite stole my cycle!

The VIC chip can also display up to 8 sprites. For each sprite that is enabled on the current scanline, the VIC also needs to take two extra cycles from the CPU. In fact, it is more complicated than that, because the VIC will signal the CPU 3 cycles in advance that it wants to use the bus. So the actual amount of cycles it takes also depends somewhat on which sprites are enabled, and what the CPU is doing exactly when it is first signaled. If you really want to know, here is more information on it.

So, when you want to have cycle-exact routines on the C64, you not only need to time your code to be exactly 63 cycles on a normal scanline, but you also need to factor in the cycles lost by the sprites that are enabled, and handle the special case of a bad line. So you really REALLY have to know what your hardware is doing, at every given point in time. You can’t keep it any more real than this.

To make matters even more complicated, the shortest possible instruction on the 6510 processor takes 2 cycles. So it is impossible to delay your code by just 1 cycle. You can only get into cycle-exact sync by using delays of either 2 or 3 cycles.

Raster stability is a virtue

Why is it important to be cycle-exact anyway? Well, this is mainly for various hacks with the VIC-chip. These hacks generally consist of writing a certain value to a certain register at the right time, which overrides the value that the VIC-chip updated earlier. This will ‘confuse’ the VIC into thinking it is in a different phase of building up the screen than it really is. The side-effects could be very useful, such as avoiding a bad line, or in fact, triggering extra bad lines (which means you can update the colours more often than just per 8 scanlines, removing some of the limitation of the native graphics modes), or having sprites appear in the borders of the screen. For other effects, such as raster bars, it might not be necessary to be completely cycle-exact, but still you don’t want too much jitter, and you will need to put in delays of a few cycles to get to the start of a scanline.

I suppose the ‘Hello World’ of C64 graphics coding is doing raster bars. Let’s start there, and take it one step further, and set up a routine that is synchronized cycle-exact to the raster position. From there, you can perform a number of classic effects. Setting up such cycle-exact code is known as a “stable raster” in C64 jargon. How does one set up such a stable raster? Well, there are a number of different approaches. The one I have chosen here is the one I found to be most intuitive.

We will start out by setting up a raster interrupt. When our program starts, we have no idea where we are on the screen. So all we can do is pick a scanline to wait for (do not pick a bad line), set up the interrupt for it, and perform an endless loop, waiting for the raster interrupt to trigger.

Once the raster interrupt triggers, all we know is that we were in an endless loop, where each branch instruction takes 3 cycles. The interrupt was handled as soon as the last branch instruction had completed. Since we don’t know in which of the 3 cycles the instruction was when the interrupt occurs, we don’t know if we have 0, 1 or 2 extra cycles of ‘jitter’ since the interrupt was triggered (which may be good enough for some effects). Aside from that, there are 9 cycles of overhead for the CPU to finish the last instruction (at least 2 cycles) and push the context on the stack and call the interrupt routine (7 cycles). So we know that we are already quite far from the start of the scanline by the time we can execute our first instruction.

We can improve the accuracy by setting up a new raster interrupt for the next line, and fill up the remaining cycles on the scanline with nop-instructions. Each nop-instruction takes 2 cycles, so when the next interrupt triggers, we can only be 1 cycle off at most.

From here, we can poll the register of the raster position directly. We will assume that we are one cycle off, and delay our code so that we read the raster position at exactly the point when the next scanline starts. If we were indeed one cycle off, then the raster position has updated as expected. If we were already in sync, then we still see the previous scanline. Now we know whether to correct for 1 cycle or not, which is easily done with a conditional branch (a taken branch is 3 cycles, not taken is 2 cycles… although, it’s not quite as simple as that… is it ever? When the branch crosses a page boundary, it takes 4 cycles rather than 3. So if your branch happens to fall exactly on a page boundary, it will not work. It’s not likely, but you could disassemble your code to see the exact position of each instruction). And from this point on, we know EXACTLY where we are. We know on which scanline, and in which cycle on this scanline (namely, we are on the 3rd cycle of the current scanline after the last branch). So we know all we need to know about our raster. We can count cycles to the end of the scanline, and since we know on which scanline we are, we also know when we are on a badline or not (and adjust our timing for that).

So basically, if you want to perform cycle-exact code on scanline N, you’ll have to start with a raster interrupt for scanline N-2. This raster interrupt will then setup an interrupt for N-1 with a jitter of 1 cycle. Then we correct for that last cycle of jitter, and arrive cycle-exact on the 3rd cycle of scanline N (so if you need to be on the first or second cycle of a scanline, you’ll have to delay for an extra line, meaning you’ll need to start at N-3 instead).

Once you have everything under control, you should be able to do raster bars over the entire width of each scanline (including badlines), which is a good way to visualize your stable raster. It will result in something like this:

Now, for rasterbars this is overkill. This video was recorded with Vice, where the borders were set to ‘debug’. This means you see the full 63 cycles on screen. In reality the first part of the scanline is lost in the horizontal retrace which happens ‘on the left’ of the left border. So for rasterbars it is not required to be this cycle-exact. Even a few cycles of jitter could be hidden in this invisible part of the screen. But, it does show that we indeed have a jitter-free raster now, and we can count out cycles for any position on the screen, to perform any kind of VIC-II trickery we can think of.

Up close and personal

Well, that concludes this first encounter with C64 programming. A kind of programming that is very lowlevel. Not only are you doing everyhing in assembly language, but you also have to keep track of the cycles spent by your CPU and your VIC-II chip. You need to get to know your hardware inside-out, even more so than with the Amiga. Let alone modern computers. It isn’t even possible to know exactly what a modern computer is doing at every cycle anymore (and to people who think they’re retro when they code rasterbars by just drawing horizontal lines in an RGB backbuffer on a modern system: that has absolutely nothing to do with what the effect is about, as my explanation should have shown).

This ‘up close and personal’ style of programming probably also explains why cracking, training games, and modifying other people’s programs in various other ways were such popular pastimes on the C64. Programmers had a very good idea of what was happening inside the machine, and had no problem reverse-engineering other people’s code. Some programmers did not even use an assembler at all, but just coded directly with a monitor. They were used to ‘cryptic’ code, where memory addresses and things were just coded directly in hex, rather than using some ‘nice’ symbolic names. I noticed that Amiga code was also still quite ‘primitive’ and hardcore in this respect. Probably because many Amiga users started out on C64, and just continued to code that way on the Amiga later.

Racing the beam also confronts you directly with just how slow a 1 MHz 6510 CPU really is. Even a single cycle is already visible as 8 pixels at 320×200. And just storing a byte to screen memory already takes 4 cycles at the least. The 286 and Amiga that I’ve covered before, were already too slow to just do regular software rendering, and required all sorts of trickery to be able to fill polygons quickly. But on the C64 it seems even less likely that it will do any kind of 3d graphics. The CPU cannot even perform multiply or divide operations either (which I will probably cover in a future blog)! Nevertheless, people have been doing quite impressive 3d graphics on the C64.

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , , , , , , , | 1 Comment