Test Drive’s pixelized transition effect

Today I want to talk about the transition effect in the game Test Drive from 1987. The effect will switch from one image to the next by replacing it pixel for pixel in a somewhat randomized pattern:

I discussed this effect with Trixter, as I was thinking of ‘borrowing’ the idea for a future demo. I described how I thought the effect worked, roughly, and that I might want to reverse-engineer it to make sure.
He beat me to it however, and sent me this snippet as the core of the effect (comments are his):

lodsb           ; load packed pixels from source
and al, ah      ; mask some pixels out
not ah          ; invert mask
mov dl, es:[di] ; dl = existing byte onscreen
and dl, ah      ; mask some screen pixels out
or al, dl       ; combine source and screen pixels
mov es:[di], al ; store combination on screen
stosb           ; someone doesn't understand what stosb does!
not ah          ; invert mask
ror ah, 1       ; rotate mask by single bit

It was more or less what I expected, although it was simpler in nature, more linear. This code is run for an entire scanline. The order of the scanlines is randomized, but as you can see, it merely rotates the mask linearly along each scanline. But since this is done per-byte, and there are multiple pixels in a byte, it doesn’t appear to be as linear as it is. Apparently the mask is a byte value with 1 bit set, which is then rotated to the right one bit at a time. The most likely starting value for the mask would be 80h.

As you can also see from Trixter’s comments, even at first glance, the code does already look suboptimal. The mov es:[di], al is completely redundant, since stosb writes al to es:[di] as well, before increasing di.
Which is interesting in a way. Such an old game, which is not carefully hand-optimized. Not what I would have expected. It is even possible that this code was actually generated by a compiler.

Looking at it some more, I started spotting other optimizations. Firstly, if you were to take a second register for the inverted mask, you only need one extra rotate, and you can remove the two not-instructions:

lodsb           ; load packed pixels from source
and al, ah      ; mask some pixels out
mov dl, es:[di] ; dl = existing byte onscreen
and dl, dh      ; mask some screen pixels out
or al, dl       ; combine source and screen pixels
stosb           ; store combination on screen
ror ah, 1       ; rotate mask by single bit
ror dh, 1       ; rotate inverted mask by single bit

And once you’ve come this far, it becomes apparent that by choosing your registers wisely, you can combine the two 8-bit ands into a single 16-bit and:

lodsb           ; load packed pixels from source
mov ah, es:[di] ; ah = existing byte onscreen
and ax, dx      ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
ror dl, 1       ; rotate mask by single bit
ror dh, 1       ; rotate inverted mask by single bit

This reinforces my belief that it is not hand-coded. Because if you write this code in a language like C, you would do it ‘by the book’. And the above optimizations are not trivial to see from C code. From assembly however, they just leap out at you.

We could take the idea of going 16-bit even further:

lodsw           ; load packed pixels from source
and ax, dx      ; mask some pixels out
not dx          ; invert mask
mov bx, es:[di] ; dl = existing byte onscreen
and bx, dx      ; mask some screen pixels out
or ax, bx       ; combine source and screen pixels
stosw           ; store combination on screen
not dx          ; invert mask
ror dl, 1       ; rotate mask by single bit
ror dh, 1       ; rotate mask by single bit

Note however that for this version I chose to go back to using the not-instructions. The reason for this is that the rotates have to be done per byte. This would mean 4 rotate-instructions. The nots however can be done per word, so I still only need 2 of them in this 16-bit version.

These rotates however, they are quite a limiting factor on the overall optimization. So I wanted to try and get rid of them. I had the following idea: since you rotate a byte one bit at a time, you only have 8 unique values for the masks, a sequence which keeps repeating.
So if you unroll the code 8 times, you can hardcode all the masks as immediate operands, and remove the rotate-instructions altogether.

There is a catch however: The code has to be able to start with any given mask. The simplest way around that is to determine which mask you get, and then use a jumptable to jump to the right routine:

tdTable dw td0, td1, td2, td3, td4, td5, td6, td7

; Determine position of bit
xor bx, bx

@@bitLoop:
inc bx
add ah, ah
jnz @@bitLoop

add bx, bx
jmp cs:[tdTtable][bx][-2]

I first implemented it with byte-oriented routines.
I will just show the first and the second, so you can see how they are the same, except for the immediate operands being pre-rotated differently. The other 6 routines work the same way:

td0 PROC
mov cx, 100

@@loopHeight:
push cx
mov cx, 80/8

@@loopScanline:
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 7F80h   ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0BF40h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0DF20h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0EF10h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0F708h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0FB04h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0FD02h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0FE01h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
loop @@loopScanLine

pop cx
loop @@loopHeight

ret
td0 ENDP

td1 PROC
mov cx, 100

@@loopHeight:
push cx
mov cx, 80/8

@@loopScanline:
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0BF40h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0DF20h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0EF10h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0F708h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0FB04h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0FD02h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0FE01h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 7F80h   ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
loop @@loopScanLine

pop cx
loop @@loopHeight

ret
td1 ENDP

Quite a lot of code… However, we can ‘re-roll’ it somewhat, by using 16-bit instructions instead. That way you process 2 bytes at a time, so the unrolled code only has to do 4 iterations:

td0 PROC
mov cx, 100

@@loopHeight:
push cx
mov cx, 80/8

@@loopScanline:
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 4080h   ; mask some pixels out
and dx, 0BF7Fh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 1020h   ; mask some pixels out
and dx, 0EFDFh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 0408h   ; mask some pixels out
and dx, 0FBF7h  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 0102h   ; mask some pixels out
and dx, 0FEFDh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
loop @@loopScanLine

pop cx
loop @@loopHeight

ret
td0 ENDP

td1 PROC
mov cx, 100

@@loopHeight:
push cx
mov cx, 80/8

@@loopScanline:
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 2040h   ; mask some pixels out
and dx, 0DFBFh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 0810h   ; mask some pixels out
and dx, 0F7EFh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 0204h   ; mask some pixels out
and dx, 0FDFBh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 8001h   ; mask some pixels out
and dx, 07FFEh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
loop @@loopScanLine

pop cx
loop @@loopHeight

ret
td1 ENDP

So, and what was the point of all this? Well, nothing really! Namely, the transition only does a few scanlines per frame, and is tuned to perform a transition in 3 seconds. So, even though we now have a faster routine, it only means that we have more idle time, if we want to stick to the 3 second transition speed. Going faster could ruin the effect. But hey, at least it was fun, and we got to flex our creative muscle, so it’s all good!

4 Responses to Test Drive’s pixelized transition effect

Jason Knight says:

December 25, 2014 at 4:49 am

Since you’re simply undoing the NOT, why not not store the not? Looks like you’ve got some regs free there so…

mov dx, bx
lodsw
and ax, dx
not dx
and dx, [es : di]
or ax, dx
stosw
ror bx, 3

So long as it’s odd you’ll hit all the bits. Laughably, doing the rotate and using the registers MAY in fact be faster — even rolled up — than your use of immediates as well due to those extra bytes being fetched.

It might also be more efficient to process all the same data at the same time. If you KNOW that on each scanline you’re going to be hitting 20 characters of 2040, 20 characters of 0810, 20 characters of 0204, and 20 characters of 8001, put those values into a free register and hit them en-masse. The extra ‘add’ will likely take less time than all those immediates — especially if you can scrounge another register for the value to be added

; just repeat the following four times for a scanline
; assumes bp is 3 and BX is the mask
mov cx, 20
.loop:
mov dx, bx
lodsw
add si, bp
and ax, dx
not dx
and dx, [es : di]
or ax, dx
stosw
add di, bp
loop .loop
; I’d skip the following on the last one
ror bx, 3
sub di, 78
sub si, 78

Basically get as much out of inside the loop as possible. The two add reg,reg should have a lower footprint than those immediates, and using DX to store a copy of the current mask…

Skipping the ror on the last set of 20 would leave you with a nice mask for the next line.

Though this effect often annoyed me back in the day as three seconds is just too damned long… So many games that used this type of effect (Arena and Daggerfall come to mind) it’s annoying waiting for some goofy effect you can’t even bypass with a keypress. UI design 101, never take more than a second to do some silly animation that is not part of actually using whatever program it is.

- Scali says:
  
  December 25, 2014 at 7:11 pm
  
  Thanks for sharing your ideas, Jason.
  However, I don’t think your code is entirely correct. You have replaced the two byte-rors of 1 bit with a single word-ror of 3 bits. But that is not equivalent.
  The idea of replacing a not with a mov will work indeed. While you won’t win in size, the mov can be 2 cycles when it is run from the prefetch buffer, while not is 3 cycles.
  Likewise, I think your ideas of unrolling further and processing the bytes out-of-order (while using registers instead of immediates) could also work and save a bit here and there. I’d have to write out the code to see exactly how it would work out.
  
Krister Nordvall says:

December 26, 2014 at 6:16 am

You never mentioned which CPU this is optimized for but I assume it’s for 8088 processors. If so, you should use shl reg,1 instead of add reg,reg.

A simple improvement on that last routine is to load the mask to DX first and then do the ANDing as you load the pixel data. So for example, instead of this;
mov dx, es:[di]
and dx, 0DFBFh

do it like this;
mov dx, 0DFBFh
and dx, es:[di]

1 byte less which means 32 bytes less in total.

- Scali says:
  
  December 26, 2014 at 9:40 am
  
  Yes, the original game targets 8088, so I thought I would try to stick to that as well, even though I did not mention that explicitly.
  
  I suppose that is another reason why the suggested ‘ror bx, 3’ above is not applicable: it would require a 286.
  
  Thanks for the suggestions, they would shave a few cycles/bytes off indeed!
  The mov/and reversal is interesting… it works because mov has no special accumulator version, all immediate forms are 1 byte opcodes. An immediate and is only 1 byte when you use the accumulator.
  So if the code used ax instead of dx, you wouldn’t be able to save a byte, but for all other registers you do.

	equipthering on An Amiga can’t do Wolfen…
	Mike Dawson on Running anything Remedy/Future…
	.NET Core: the small… on Migrating to .NET Core: the fu…
	Scali on Video playback on low-end MS-D…
	IBM PCjr \| ancientel… on What makes the PCjr cool, and…