Triton’s Crystal Dream – the final chapter

A few years ago, I patched some of the code in Triton’s Crystal Dream demo from 1992:

Although my code made it work on my real 486DX2-80 machine with Sound Blaster Pro (the ‘ideal’ setup for this demo), it did not work in DOSBox at the time:

It still does not work properly in Dosbox though, but I think that is related to the way it uses DMA. It also won’t run properly under Windows 95, only from pure DOS. It might be because it does not reset the PIC timer rate to the default value… But I don’t know if and when I’ll ever look into that.

Well, the other day I was fixing some other bugs in DOSBox… namely, the screenwidth was hardcoded for the CGA composite emulation mode. I figured, since I had a working configuration to build the DOSBox source code anyway, I might as well give Crystal Dream another look.
As I said in the earlier blog, I thought the issue was related to the PIC, because DOSBox prints the following messages on the console during the demo:
PIC:ICW4: 1f, special fully-nested mode not handled
PIC:ICW4: 1d, special fully-nested mode not handled
PIC:ICW4: 1f, special fully-nested mode not handled

This turned out to be a wild goose chase however. I thought that the demo would use at least two interrupts, namely the timer interrupt, and the SB DMA interrupt. Since DOSBox cannot handle the special fully-nested mode, I thought perhaps some interrupts got lost.

This was not the case however. I opened up my old project, where I had reversed most of the sound playing code in order to fix the SB detection. And lo and behold: there was no interrupt handler at all for the SB!
Normally you’d get an interrupt when the DMA transfer is complete, so you set up a new buffer. So what is it exactly that Crystal Dream does then?

Well, one part of it is a rather curious DMA mode: It sets up a DMA transfer for 1 byte at a time, and uses the timer interrupt to replace this byte at the replay rate. This apparently works in DOSBox, because you can hear the first buffer playing (it sounds rather distorted though). However, the DMA then stops, while it continues playing on a real system. I looked into the SB emulation code of DOSBox somewhat, and found that if I forced the ‘autoinit’ DMA mode, that it would continue playing, without requiring an interrupt handler to restart the DMA. However, the code does not seem to set up the SB for autoinit, which makes sense, since only SB 2.0 and later support it, and this demo would target early hardware as well.

I happened to read about the single-byte DMA technique, when looking into DOSBox-X, a patched version of DOSBox aiming to emulate the hardware more correctly. It is known as ‘goldplay’ there, since Goldplay is an early MOD player which used this technique. I tried using ‘goldplay’ mode on Crystal Dream, but although it made the sound clean (regular DOSBox DMA code does not process the DMA quickly enough, so the sound is very noisy and distorted), it still stopped after the first buffer.

I looked into the code somewhat more, and noticed that there were some comments about Crystal Dream in the code. Apparently this issue has been looked at. And indeed, once I used the correct settings (the patches only apply for regular SB/SBPro configurations, not for the default SB16), DOSBox-X could play back the sound properly.

There is a document on how exactly Crystal Dream handles its Sound Blaster audio, and why it wouldn’t work in DOSBox: https://github.com/joncampbell123/dosbox-x/blob/master/NOTES/Triton%20-%20Crystal%20Dream.txt
Here is also a discussion on the issue and the fixes in DOSBox-X: http://www.vogons.org/viewtopic.php?f=41&t=31881&start=480#p362451
In short, it doesn’t rely on the SB’s interrupts at all. It simply polls the SB status and resets the playback from inside the timer interrupt.
While this works on real hardware, the DOSBox emulation wasn’t accurate enough to allow for such polling. But DOSBox-X contains some patches to handle the situation.

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , | Leave a comment

Test Drive’s pixelized transition effect

Today I want to talk about the transition effect in the game Test Drive from 1987. The effect will switch from one image to the next by replacing it pixel for pixel in a somewhat randomized pattern:

I discussed this effect with Trixter, as I was thinking of ‘borrowing’ the idea for a future demo. I described how I thought the effect worked, roughly, and that I might want to reverse-engineer it to make sure.
He beat me to it however, and sent me this snippet as the core of the effect (comments are his):

lodsb           ; load packed pixels from source
and al, ah      ; mask some pixels out
not ah          ; invert mask
mov dl, es:[di] ; dl = existing byte onscreen
and dl, ah      ; mask some screen pixels out
or al, dl       ; combine source and screen pixels
mov es:[di], al ; store combination on screen
stosb           ; someone doesn't understand what stosb does!
not ah          ; invert mask
ror ah, 1       ; rotate mask by single bit

It was more or less what I expected, although it was simpler in nature, more linear. This code is run for an entire scanline. The order of the scanlines is randomized, but as you can see, it merely rotates the mask linearly along each scanline. But since this is done per-byte, and there are multiple pixels in a byte, it doesn’t appear to be as linear as it is. Apparently the mask is a byte value with 1 bit set, which is then rotated to the right one bit at a time. The most likely starting value for the mask would be 80h.

As you can also see from Trixter’s comments, even at first glance, the code does already look suboptimal. The mov es:[di], al is completely redundant, since stosb writes al to es:[di] as well, before increasing di.
Which is interesting in a way. Such an old game, which is not carefully hand-optimized. Not what I would have expected. It is even possible that this code was actually generated by a compiler.

Looking at it some more, I started spotting other optimizations. Firstly, if you were to take a second register for the inverted mask, you only need one extra rotate, and you can remove the two not-instructions:

lodsb           ; load packed pixels from source
and al, ah      ; mask some pixels out
mov dl, es:[di] ; dl = existing byte onscreen
and dl, dh      ; mask some screen pixels out
or al, dl       ; combine source and screen pixels
stosb           ; store combination on screen
ror ah, 1       ; rotate mask by single bit
ror dh, 1       ; rotate inverted mask by single bit

And once you’ve come this far, it becomes apparent that by choosing your registers wisely, you can combine the two 8-bit ands into a single 16-bit and:

lodsb           ; load packed pixels from source
mov ah, es:[di] ; ah = existing byte onscreen
and ax, dx      ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
ror dl, 1       ; rotate mask by single bit
ror dh, 1       ; rotate inverted mask by single bit

This reinforces my belief that it is not hand-coded. Because if you write this code in a language like C, you would do it ‘by the book’. And the above optimizations are not trivial to see from C code. From assembly however, they just leap out at you.

We could take the idea of going 16-bit even further:

lodsw           ; load packed pixels from source
and ax, dx      ; mask some pixels out
not dx          ; invert mask
mov bx, es:[di] ; dl = existing byte onscreen
and bx, dx      ; mask some screen pixels out
or ax, bx       ; combine source and screen pixels
stosw           ; store combination on screen
not dx          ; invert mask
ror dl, 1       ; rotate mask by single bit
ror dh, 1       ; rotate mask by single bit

Note however that for this version I chose to go back to using the not-instructions. The reason for this is that the rotates have to be done per byte. This would mean 4 rotate-instructions. The nots however can be done per word, so I still only need 2 of them in this 16-bit version.

These rotates however, they are quite a limiting factor on the overall optimization. So I wanted to try and get rid of them. I had the following idea: since you rotate a byte one bit at a time, you only have 8 unique values for the masks, a sequence which keeps repeating.
So if you unroll the code 8 times, you can hardcode all the masks as immediate operands, and remove the rotate-instructions altogether.

There is a catch however: The code has to be able to start with any given mask. The simplest way around that is to determine which mask you get, and then use a jumptable to jump to the right routine:

tdTable dw td0, td1, td2, td3, td4, td5, td6, td7

; Determine position of bit
xor bx, bx

@@bitLoop:
inc bx
add ah, ah
jnz @@bitLoop

add bx, bx
jmp cs:[tdTtable][bx][-2]

I first implemented it with byte-oriented routines.
I will just show the first and the second, so you can see how they are the same, except for the immediate operands being pre-rotated differently. The other 6 routines work the same way:

td0 PROC
mov cx, 100

@@loopHeight:
push cx
mov cx, 80/8

@@loopScanline:
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 7F80h   ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0BF40h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0DF20h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0EF10h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0F708h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0FB04h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0FD02h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0FE01h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
loop @@loopScanLine

pop cx
loop @@loopHeight

ret
td0 ENDP

td1 PROC
mov cx, 100

@@loopHeight:
push cx
mov cx, 80/8

@@loopScanline:
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0BF40h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0DF20h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0EF10h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0F708h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0FB04h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0FD02h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 0FE01h  ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
lodsb           ; load packed pixels from source
mov ah, es:[di] ; dl = existing byte onscreen
and ax, 7F80h   ; mask some pixels out
or al, ah       ; combine source and screen pixels
stosb           ; store combination on screen
loop @@loopScanLine

pop cx
loop @@loopHeight

ret
td1 ENDP

Quite a lot of code… However, we can ‘re-roll’ it somewhat, by using 16-bit instructions instead. That way you process 2 bytes at a time, so the unrolled code only has to do 4 iterations:

td0 PROC
mov cx, 100

@@loopHeight:
push cx
mov cx, 80/8

@@loopScanline:
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 4080h   ; mask some pixels out
and dx, 0BF7Fh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 1020h   ; mask some pixels out
and dx, 0EFDFh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 0408h   ; mask some pixels out
and dx, 0FBF7h  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 0102h   ; mask some pixels out
and dx, 0FEFDh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
loop @@loopScanLine

pop cx
loop @@loopHeight

ret
td0 ENDP

td1 PROC
mov cx, 100

@@loopHeight:
push cx
mov cx, 80/8

@@loopScanline:
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 2040h   ; mask some pixels out
and dx, 0DFBFh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 0810h   ; mask some pixels out
and dx, 0F7EFh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 0204h   ; mask some pixels out
and dx, 0FDFBh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
lodsw           ; load packed pixels from source
mov dx, es:[di] ; dl = existing byte onscreen
and ax, 8001h   ; mask some pixels out
and dx, 07FFEh  ; mask some pixels out
or ax, dx       ; combine source and screen pixels
stosw           ; store combination on screen
loop @@loopScanLine

pop cx
loop @@loopHeight

ret
td1 ENDP

So, and what was the point of all this? Well, nothing really! Namely, the transition only does a few scanlines per frame, and is tuned to perform a transition in 3 seconds. So, even though we now have a faster routine, it only means that we have more idle time, if we want to stick to the 3 second transition speed. Going faster could ruin the effect. But hey, at least it was fun, and we got to flex our creative muscle, so it’s all good!

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , , | 4 Comments

CGADEMO by Codeblasters

Today I want to talk about a rather obscure, yet interesting demo, namely CGADEMO by Codeblasters, from 1992:

As you can read from the scroller, what’s interesting about this demo is that it runs at full framerate (60 Hz) even on the original IBM PC (8088 at 4.77 MHz with CGA). And that there are 16 colours on screen at the same time.

Unstable rasters

To start with the 16 colours… They use a trick similar to my palette switching in the 1991 donut. Namely, they change the background colour of the CGA palette at every scanline, which gives a rasterbar effect. They can pick any of the 16 CGA colours for the background, which allows them to show all 16 colours on screen at a time. This is very similar to what I have discussed on C64. This demo does not use a stable raster however, since that is very difficult to achieve on a PC anyway. Instead, they use polling of the ‘display enable’ status bit to determine when a scanline is finished (and the rasterbeam enters the horizontal retrace). The downside of this technique is that you’re burning up valuable CPU time in the polling loop, which you could have used for other effects.

When I discussed the rasterbars in INTROjr, I said this:

In INTROjr, the rasterbars may not be perfectly stable, but that is understandable, given that the CPU is far too slow to do any remotely accurate polling for the horizontal blank interval

Was I right? Well, I was… but I wasn’t… Namely, with CGADEMO, this happens with the rasterbars on the left of the screen:

Visible jitter. Sometimes the polling is so slow that the background colour is switched in the visible part of the scanline, so you see the old colour on the left. I figured that this would happen on a PC. However, when I wrote the above, I thought it would be worse than it actually is (on a C64 you have very fast access to the mem-mapped registers of the VIC-II chip. I figured communicating with the CRT controller on a CGA card with port I/O over an 8-bit ISA bus would be much slower). It’s actually not even that bad. You might not even notice at first.

Trixter originally sent me the CGADEMO binary, and he mentioned something about removing the noise. So I figured I’d have a look at the code myself, and see what he was talking about.

After reverse-engineering the code a bit, I isolated the rasterbar code:

waitHBL1:				; CODE XREF: start_0+70j start_0+7Fj
		in	al, dx		; Video	status bits:
					; 0: retrace.  1=display is in vert or horiz retrace.
					; 1: 1=light pen is triggered; 0=armed
					; 2: 1=light pen switch	is open; 0=closed
					; 3: 1=vertical	sync pulse is occurring.
		test	al, 1
		jnz	short waitHBL1

waitHBL2:				; CODE XREF: start_0+75j
		in	al, dx		; Video	status bits:
					; 0: retrace.  1=display is in vert or horiz retrace.
					; 1: 1=light pen is triggered; 0=armed
					; 2: 1=light pen switch	is open; 0=closed
					; 3: 1=vertical	sync pulse is occurring.
		test	al, 1
		jz	short waitHBL2
		mov	dx, 3D9h
		lodsb
		out	dx, al
		mov	dx, 3DAh
		loop	waitHBL1

The first HBL-loop waits until a new scanline starts. The second HBL-loop waits until the scanline ends, and we enter the horizontal retrace/blank interval. This gives us a small period of time to modify the background colour before the new scanline starts drawing. So it is important to change it as soon as possible (you could still insert some other effect code between the two loops… or if you are not concerned about the code running on machines that are ‘too fast’, you could remove the first loop altogether, if you can assume that your effect code will make sure that by the time you enter the polling loop for start-of-hbl, you will not still be in the previous hbl).

The problem here is the ‘lodsb': the rasterbar colours are pre-calced in a table, and loaded at the very last moment to change the colour (the out dx, al immediately after it). On an 8088, memory access is very slow. Each byte takes 4 cycles to load (this goes for both instructions and data). While lodsb in itself is a very efficient instruction, it is not the best choice in this particular case. Namely, we have plenty of time to load from memory while waiting for our scanline to finish drawing. If we would just perform the memory access there, we can save those 4 cycles later.

But, there is a problem: both the in and out instructions are hardwired to use dx and al (this is the sort of thing why x86 is hated by so many assembly programmers. Most other systems don’t hardwire their registers to instructions like that, and/or just use memory-mapped I/O, so you don’t need special instructions to access hardware registers). So, we can’t just move the lodsb out of the loop, since al will get overwritten by the in. What we can do however, is to store it in a different register, and then just copy it to al before the out.

So, you figure… bx is still free, shall we use that? Just mov ah, bh and that’s that? Well, then the joke is on you: the shortest form of a mov-instruction is still 2 bytes, where lodsb was only 1 byte. So you are only moving the memory access from the byte you wanted to read to an additional byte for your instruction. There is a ‘loophole’ however: some instructions on x86 have a special short encoding for the accumulator register. xchg is such an instruction. It only exists for ax though, not for al. But that doesn’t matter: we can just do xchg ax, bx.

I noticed another place where we can have a small gain however: test al, 1. Although we cannot shave a byte off this instruction… what we CAN do is to replace it with test al, ah. Using a register takes one less cycle than using an immediate operand. We are not using ah anyway, except that it gets swapped with bh as part of the xchg ax, bx. So we can easily load it with 1, and use the slightly faster test al, ah.

So my polling loop now looks like this:

		mov	bh, 1

loopHBL:		
		lodsb
		xchg	ax, bx

waitHBL1:				; CODE XREF: start_0+70j start_0+7Fj
		in	al, dx		; Video	status bits:
					; 0: retrace.  1=display is in vert or horiz retrace.
					; 1: 1=light pen is triggered; 0=armed
					; 2: 1=light pen switch	is open; 0=closed
					; 3: 1=vertical	sync pulse is occurring.
		test	al, ah
		jnz	short waitHBL1

waitHBL2:				; CODE XREF: start_0+75j
		in	al, dx		; Video	status bits:
					; 0: retrace.  1=display is in vert or horiz retrace.
					; 1: 1=light pen is triggered; 0=armed
					; 2: 1=light pen switch	is open; 0=closed
					; 3: 1=vertical	sync pulse is occurring.
		test	al, ah
		jz	short waitHBL2
		mov	dx, 3D9h
		xchg	ax, bx
		out	dx, al
		mov	dx, 3DAh
		loop	loopHBL

The few cycles we save are enough to move the jitter of the colour change outside of the visible area. Not the same approach as the stable raster I demonstrated on C64 earlier, but the result looks just as good in practice. So I have to correct myself on what I said earlier: even the slowest PC *is* fast enough to perform accurate enough rasterbars (but mind you, INTROjr runs on a PCjr, which is even slower than a regular PC, since it shares videomemory with the CPU, causing extra waitstates on reading instructions/data from memory).

Unsmooth scrolling

A second problem I noticed was with the scroller:

What is happening here? Well, I have discussed ‘racing the beam’ earlier. In this case, the scroller is losing the race to the beam. You can see the part where the beam overtakes the scroll routine.

Trixter says the scroller looks fine on his real IBM PC 5150 with real IBM CGA. My PC is a clone, a Philips P3105 with an ATi Small Wonder CGA-compatible videocard. It does have an original Intel 8088 CPU at 4.77 MHz. But apparently the system as a whole is even slower than the real IBM PC. I think the Small Wonder is the culprit.

As you may know, there is no scrolling hardware whatsoever on a CGA card, so what we have here is a bruteforce software scroller. Just a memcpy() routine basically. The original code looks like this:

doScroll	proc near		; CODE XREF: start_0+A0p
		mov	ax, 0B800h
		mov	ds, ax
		assume ds:nothing
		mov	es, ax
		mov	si, 1A91h
		mov	di, 1A90h
		mov	cx, 4Fh
		rep	movsb
		...
		retn
doScroll	endp

The problem here is rep movsb: the 8088 CPU executes this literally. So it performs 79 movsb instructions in a loop, moving 1 byte at a time.

However, the 8088 is a 16-bit CPU (it is basically an 8086 on an 8-bit data bus), so it also has the movsw instruction, which moves a word (2 bytes) at a time. If we use that, we only need half the iterations, and this cuts some of the extra overhead, so we save some valuable cycles. Why did they choose rep movsb? Not sure. Perhaps they assumed there would be no performance difference, since the 8088 has an 8-bit data bus (they also used rep movsb to copy the logo to videoram)… Or they used it because 79 is an odd number. But that is easy enough to solve: Just move 78 bytes with 39 movsw instructions, and then add an extra movsb for the 79th byte:

doScroll	proc near		; CODE XREF: start_0+A0p
		mov	ax, 0B800h
		mov	ds, ax
		assume ds:nothing
		mov	es, ax
		mov	si, 1A91h
		mov	di, 1A90h
		mov	cx, 27h
		rep	movsw
		movsb
		...
		retn
doScroll	endp

This bought me just enough time to keep the scroller ahead of the beam (an alternative solution would be to move the scroller position down a few scanlines, to buy some extra time. Or, the area of rasterbars could be reduced by 1 or 2 scanlines. But luckily that was not required, so the improved version looks exactly like the original).

Open Sores

I have done some other small tweaks and modifications to the source code. I have restructured some code, to avoid some jmps over data and such. I have also removed some code and data which was duplicate or not used. So the improved version is not only faster, it is also some 400 bytes smaller than the original. If you are interested, you can download the original, the improved version, and my reverse-engineered/semi-commented source code here.

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , , , , , , | 11 Comments

No love for some OSes?

As a developer, I always like to stay up-to-date with OS technology. Firstly, to learn about new features and technologies, which may or may not be useful for my products. And secondly, for making sure my software is compatible with these new variations (as I argued before). As a result, I was one of the relatively few people who used Windows XP x64. And I also was one of the relatively few people who didn’t skip Vista.

Especially with Windows XP x64 I noticed there just wasn’t a lot of interest for the OS from most developers. Which in itself is a tad strange, since it was the first desktop-version of Windows that was 64-bit capable. Not many companies bothered to port their software to 64-bit at an early stage. And in later years, although there were 64-bit versions of some applications, they would not always support XP x64. Even though XP itself was still supported by the same application. And XP and XP x64 were still being supported by Microsoft.

The same with Vista: even though it is still under extended support by Microsoft, we see that some applications simply don’t bother to support it at all. I think a good example is Adobe Reader. For some reason, version XI is available for XP and Windows 7/8/8.1, but Vista is stuck at version X. However, if you manually download and install Adobe Reader XI for Windows 7, it will install and work on Vista.

Another example is Google Chrome: there is a 64-bit version now, but it will only install on Windows 7 or higher (aside from the fact that they install it in “Program Files (x86)”, but that’s another story). So, I have tried to manually get it running on a Vista x64 system, and I found that it worked just fine. Only the auto-update doesn’t work properly, because it will update back to the 32-bit version. I then tried it on a Windows XP x64 system, and somewhat to my surprise it even worked there (I was not entirely sure whether the GDI fallback path was present in the 64-bit version, since it is only for Windows 7 and higher, and the default renderer uses Direct2D/DirectWrite, which XP x64 does not support).

So, while I understand that these applications were not tested properly under these OSes, and there may be some minor issues… It is perfectly possibly to get them working 100%, and I think it is a bit sad that these OSes just don’t appear to get any attention from developers. Some OSes just don’t get a lot of love, sadly.

Posted in Software development, Software news | Tagged , , , , , , , , | 5 Comments

Direct3D 11.3 (and nVidia’s Maxwell Mark 2)

A few days ago, nVidia introduced the new GTX970 and GTX980, based on the Maxwell architecture. A bit of a surprise however, is that this is a ‘Mark 2′ version of the Maxwell architecture, which has some new features that the original Maxwell in the GTX750 series did not have. Related to that surprise was another surprise, namely that these features will also be supported in Direct3D 11.3, not just Direct3D 12. It appears that Microsoft wants to support Direct3D 11.3 and Direct3D 12 side-by-side, at least for the short term. The new features include conservative rasterization, which was already mentioned earlier, with the presentation of Direct3D 12. Also, rasterizer ordered views, which were also mentioned before, which make efficient order-independent translucency possible, for example. Typed UAV loads are also new, which have not been mentioned before as far as I know. And finally, tiled resources are now expanded from 2d to 3d volume textures. These new features (along with nVidia’s new multiple-projection acceleration) allow for more efficient voxel-based rendering, especially interesting for efficient global illumination. This is something that nVidia is promoting heavily with their new cards, something they call Voxel accelerated Global Illumination (VXGI). It will be interesting to see how this will be implemented in actual games. It could be another big step up in realistic lighting. At any rate, this shows once again that the claims made by the pro-Mantle crowd are certainly not true: we most definitely have not reached the end of development as far as rendering methods are concerned. APIs and GPUs are still being updated to support new ways of rasterizing, as I already said before. As I said: if AMD does not see it that way, that could just mean a lack of vision on their behalf. nVidia has once again shown that they are committed to pushing graphics technology ever further, by once again being the first to introduce new features. We will have to see when and how AMD will respond. When will they introduce their next architecture? And will it include features such as conservative rasterizing (volume tiled resources should already be supported by GCN, although for some reason D3D11.2 only received support for 2d textures)? Even moreso: can AMD also improve the efficiency as much as nVidia has? Because another interesting feature of Maxwell is that despite the larger transistor count, and still being 28 nm, it is significantly less powerhungry than Kepler.

Posted in Direct3D, Hardware news, OpenCL, OpenGL, Software development | Tagged , , , , , , , , , , , , , | 17 Comments

Intel disables TSX in Haswell

I was going to do a blog about this earlier, but the timing was rather unfortunate, because I had just published another blog. Then it slipped my mind, until the news of the new Haswell EP-based Xeon CPUs that is. So here it is after all (so people can stop saying I only publish things about AMD’s bugs).

So, a while ago, Intel published an erratum relating to the new TSX instructions in Haswell. As you might recall, TSX was one of the most interesting things about Haswell in my opinion. Apparently there is a bug in the implementation, which causes unpredictable behaviour in some cases. Sounds like some kind of race condition. Since there apparently is no way to fix this in microcode, TSX will just be disabled by default. The upcoming Haswell-EX CPU should have a fixed TSX implementation, at which time it can be enabled again.

As for the new Xeons… well, I don’t think I’ll do a writeup on them. There are some interesting things, such as how the cache is organized in ‘clusters’, and how turbo mode is now so effective that even the 18-core model can perform very well in single/low-threaded workloads, making it the best of both worlds. But, all that is already explained in great detail in the various reviews, so I suggest those for further reading.

Posted in Hardware news | Tagged , , , , , , , , | 3 Comments

1991 donut – final release

No-XS has composed a special EdLib track for me to use in the 1991 donut, so I could finish it properly and release the final version:

There are a number of small changes and tweaks, I will just quote the nfo file:

The final version differs from the party version in a number of ways:
– AdLib music by No-XS
– Improved logo
– Moving lightsource and a tiny bit of ambient light
– Reduced precalc time
– Code is a lot faster (should no longer crash on very slow machines, not even a 4.77 MHz 8088)
– Instead of separate binaries for 286 and 8088, there is now one binary for all
– Specify ‘low’ on the commandline to switch to the low-poly donut for very slow machines

If you are interested in reading more about this intro, I will place the link to the two relevant blogposts here:
Just keeping it real… like it’s 1991
Just keeping it real… bugfixing like it’s 1991

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , , , , , , , , , , , | Leave a comment