BHM File Format release 0.3b

I have made a new release of the BHM File Format project.

In release 0.3b there have been some minor code refactorings and bugfixes, but the most important new thing is the BHM Visualizer tool:


This C# application (which should be Mono-compatible for multiplatform support) allows you to inspect the contents in a userfriendly graphical interface. It is a great help for debugging BHM import and export routines.

The preview-tab is a special case. It dynamically tries to load a .NET assembly by the name of ‘BHMViewer.dll’. It expects this assembly to contain an implementation of the IBHMViewer interface. This interface allows the BHM Visualizer to pass a Stream with BHM content to this plugin.

The plugin receives the Control-handle of the Preview-panel, and an Update()-method is called periodically. This allows the plugin to implement any kind of visualization of the BHM data.

As an example, I have taken the BHM3DSample OpenGL code, and wrapped the IBHMViewer interface around it. This way you can view the BHM files created by the 3dsmax exporter directly in the BHM Visualizer:


As usual, all code is included under the BSD license. So feel free to use, extend and modify these tools in any way you like.

Posted in OpenGL, Software development | Tagged , , , , , , , , , | Leave a comment

Latch onto this, it’s all relative

Right, a somewhat cryptic title perhaps, but don’t worry. It’s just the usual 8088-retroprogramming talk again. I want to talk about how some values in PC hardware are latched, and how you can use that to your advantage.

Latched values in this context basically mean values that are ‘buffered’ in an internal register, and will not become active right away, but after a certain event occurs.
As you might recall in my discussion of 8088 MPH’s sprite part, I used the fact that the start offset register is latched in the scrolling of the background image. You can write a new start offset to the CRTC at any time, and it won’t become active until the frame is finished.

This has both downsides and upsides. The downside is that you can’t change the start offset anywhere on the screen, to do C64/Amiga like bitmap stretching effects and such (although as you may know, reenigne found a way around that, which is how he pulled off the Kefrens bars in 8088 MPH, among other things). The upside is that you don’t have to explicitly write the value during the vertical blank interval when you want to perform smooth scrolling or page flipping. Which is what I did in the sprite part: I only had to synchronize the drawing of the sprite to avoid flicker, and I could fire off the new scroll offset immediately.

Another interesting latched register is in the 8253 Programmable Interval Timer. As you know, we don’t have any raster interrupt on the IBM PC. And as you know, we’ve tried to work around that by exploiting the fact that the timer and the CGA card are running off the same base clock signal, and then carefully setting up a timer interrupt at 60 Hz (19912 ticks), synchronized to vsync.

In the sprite part, I just left the timer running. The music was played in the interrupt handler. The sprite routine just had to poll the counter values to get an idea of the beam position on screen. In the final version of 8088 MPH, reenigne used a slightly more interesting trick to adjust the brightness of the title screen as it rolled down, because the polling wasn’t quite accurate enough.

The trick exploits the fact that when the 8253 is in ‘rate generator’ or ‘square wave’ mode, if you write a new ‘initial count’ value to the 8253, without sending it a command first, it will latch this value, but continue counting down the old value. When the value reaches 0, it will use the new ‘initial count’ that was latched earlier.
The advantage here is that there are no cycles spent on changing the rate of the timer, which means there is no jitter, and the results are completely predictable. You can change the timer value any number of times during a frame, and you know that as long as the counts all add up to 19912 ticks, you will remain in sync with the screen.

Here is a demonstration of this trick:

What you see here is that the timer is fired a number of times per screen, changing the background colour everytime, to paint some raster bars. The timing of the start of the red bar is modified every frame, adding 76 ticks (one scanline), shifting it downward, which makes the cyan bar grow larger. To compensate, the timing for the purple bar is modified as well, subtracting 76 ticks, to compensate.

The result is that all the bars remain perfectly stable and in sync with the screen. If you reinitialized the timer everytime, you would get some drift, because it’s difficult to predict the amount of cycles it takes to send the new commands to the timer, and to compensate for that.
In 1991 donut, I was not aware of the latching trick yet, so I reinitialized the timer every time. As a result, there is a tiny bit of jitter depending on how fast your system is. This is mostly hidden by black scanlines in the area where the jitter would occur, but on very slow systems, you may see that the palette changes on the scroller are a scanline off, or more.
Because 1991 donut is aimed at VGA systems, I would have to reinitialize the timer at the end of every frame anyway, because I have to re-sync it to the VGA card, which runs on its own clock. But with the latch trick, I could at least have made the palette changes independent of CPU-speed (and there are various other tricks I’ve picked up since then, which could speed up 1991 donut some more).

Always think one step ahead

The catch with this trick is that you always have to be able to think one step ahead: the value will not become active until the timer reaches 0. So you can’t change the current interval, only the next one. In most cases that should not be a problem though, such as with drawing sprites, raster bars and such tricks.

The actual inspiration for this article was actually not because of anything graphics-related, but rather because of analog joysticks. That may be another application of this trick. Namely, to determine the position on a joystick axis, you have to initialize the joystick status register to 1, and then poll it to see how long it takes until it turns into a 0. This ties up the CPU completely.
So, my idea was to set up a system where you’d use a few timer interrupts to poll the joystick during each frame. If you don’t need a lot of accuracy (eg only ‘digital’ movement), you’d only have to fire it a few times, theoretically 3 would be enough to get left-middle-right or top-middle-bottom readings. But in practice you’d probably want to fire a few more. But still it would probably cost less CPU time than polling.

Another application that may be interesting is related to music/sound effects. Instead of having just a single rate, you can multiplex multiple frequencies this way, by modifying the counter value at each interrupt, and keeping track of how often it has fired to determine which state you are currently in (more or less like the coloured bars representing different ‘states’ in the above example).

Bonus trick

To finish off for today, I also want to share another timer-related trick. This trick came to us because someone showed interest in 8088 MPH, and had some suggestions for improvements. Now this is exactly the kind of thing we had been hoping for! We wanted to inspire other people to also do new and exciting things on this platform, and push the platform further and further. Hopefully we will be seeing more PC demos (as in original IBM PC specs, so 8088+CGA+PC speaker).

This particular trick is the automatic end-of-interrupt functionality in the 8259 Programmable Interrupt Controller. This functionality is not enabled by default on the PC, which means that you have to manually send an EOI-command to the 8259 in your interrupt handler everytime. So, if you reinitialize the 8259 and enable the auto-EOI bit in ICW4, you no longer have to do this, which saves a few instructions and cycles everytime. Interrupts on 8088 are quite expensive, so we can use any help we can get. The above rasterbar example is actually using this trick.

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , | Leave a comment

DirectX 12 is out, let’s review

Well, DirectX 12 is launched, together with a new version of Windows (Windows 10) and an improved driver system (WDDM 2.0).

Because, if you remember, we also had:

  • Windows Vista + DirectX 10 + WDDM 1.0
  • Windows 7 + DirectX 11 + WDDM 1.1
  • Windows 8 + DirectX 11.1 + WDDM 1.2
  • Windows 8.1 + DirectX 11.2 + WDDM 1.3

So, this has more or less been the pattern since 2006. Now, if we revisit AMD and Mantle, their claims were that introducing Mantle triggered Microsoft to work on DirectX 12. But, since DX12 is released together with Windows 10 and WDDM 2.0, AMD is really claiming that they triggered Microsoft to rush Windows 10.

Also, since WDDM 2.0 is quite a big overhaul of the driver system, and both Windows 10 and DirectX 12 are built around this new driver system, Microsoft would have had to work out WDDM 2.0 first, before they could design the DirectX 12 runtime on top of it.

An interesting tweet in this context is by Microsoft’s Phil Spencer:

We knew what DX12 was doing when we built Xbox One.

This clearly implies that the ideas behind DX12 existed prior to the development of the Xbox One, and probably the ideas of using a single OS and API on all devices, from phones to desktops to game consoles (we already saw this trend with Windows 8.x/RT/Phone).

Perhaps the real story here is that Sony rushed Microsoft with the PS4, so Microsoft had to develop Direct3D 11.x as an ‘inbetween’ API until Windows 10 and DirectX 12 were ready, because they couldn’t afford to delay the Xbox One until they were ready. I guess we will never know. We do see signs of this ‘one Windows’ in early news about the Xbox One (by then still known as ‘Xbox 720’) in news articles going back as far as 2012 though. And we know that Direct3D 11 was used on Windows Phone as-is, which makes Windows 8 already mostly a ‘single OS on every platform’, with just Xbox One being the exception (Xbox actually does run a stripped-down version of Windows 8 for apps and general UI, but not for games).

Now, if we move from the software to the hardware, there are some other interesting peculiarities. As I already mentioned earlier, AMD does not have support for feature level 12_1 in their latest GPUs, which were launched as their official DirectX 12 line. I think even more telling is the fact that they do not support HDMI 2.0 either, and all their ‘new’ GPUs are effectively rehashes of the GCN1.1 or GCN1.2 architecture. The only new feature is HBM support, but nothing in the GPU-architecture itself.

I get this nasty feeling that after 5 years of AMD downplaying tessellation in DX11, we are now in for a few years of AMD downplaying conservative raster and other 12_1 features.

It seems that the AMD camp has already started with the anti-12_1 offensive. I have already read people claiming that “DX12.1” as nVidia advertises it, is not an official standard:

I’ve yet to find anything about DX12.1 that isn’t from nVidia, so it’s either an nVidia-specific extension to DX12 (e.g. like DX9a) or it’s a minor addendum (e.g. like DX10.1). Either way it appears the GeForce 900 series are the only thing that support it, and if that’s the case, it’s unlikely to be very important in the long run as obtuse/narrowly supported features tend to be passed over (e.g. like DX9a or 10.1, or other things like TerraScale or Ultra Shadow). Of course history may prove this assumption wrong, but that’s my guess. The link above includes slides from an nVidia PR presentation that shows a few features for DX12 and 12.1; perhaps others can find more about this.

Or this little gem on Reddit:

the guy in that blog surely show an anti amd bias. I would not dig too deep into his comments.

He spent a whole blog post complaining that amd does not have conservative rasterization to a standard that has not been released….

Erm yes, because when AMD released new videocards less than a month before the official release of DirectX 12, they are going to come up with ANOTHER new line of videocards right now with 12_1 support? Well no, we’ll be stuck with these GPUs for quite some time. If AMD had GPUs with 12_1 support just around the corner, they wouldn’t have put all that effort in releasing the 300/Fury line now. A new architecture is likely still more than a year away. AMD’s roadmap does not say much, other than HBM2 support in 2016.

The plot thickens, as it seems that Intel’s Skylake GPU actually does support 12_1 as well, and in fact, supports even higher tiers than nVidia does (Intel does not advertise with ‘DX12.1’ as nVidia does, but they do advertise with ‘DX11.3’, which as we know is a special update including the new DX12_1 features, as mentioned earlier with the introduction of Maxwell v2). Clearly AMD has dropped the ball with DX12. It looks like they simply do not have the resources to develop new GPUs in time for new APIs anymore. Which might explain the Mantle-offensive. AMD knew they couldn’t deliver new hardware in time. But they could deliver a ‘DX12-lite’ API before Microsoft was ready with the real DX12.

Lower CPU overhead, for whom?

The main point of Mantle was supposed to be lower CPU overhead. But is that all that relevant for the desktop? It doesn’t seem that way. Mantle didn’t exactly revolutionize gaming. What about consoles then? Well no, consoles always had special APIs anyway, with low-level access, so they wouldn’t really need Mantle or DX12 either.
But, there are other devices out there, with GPUs and even slower CPUs: mobile devices.
That might have been what Microsoft’s main goal was. Phones and tablets. Getting higher performance and better battery life out of these devices. This is also what we see with Apple’s Metal. They launched it primarily as a new API for iOS. That is not just a coincidence.

Mantle is DX12?

And what of these claims that Microsoft would have copied Mantle? There was even some claim that the documentation was mostly the same, with some screenshots of alleged documentation of both, and alleged similarities. Now that the final documentation is out, it is clear that the two are not all that similar at the API level. DX12 is still using a lightweight COM approach, where Mantle is a flat procedural approach. But most importantly, DX12 has some fundamental differences with Mantle. Such as the distinction between bundles and command lists. Mantle only has ‘command buffers’. Again, it looks like Mantle is just a simplified version, rather than Microsoft cloning Mantle.

Time for some naming and shaming?

Well, we all know Richard Huddy by now, I suppose. He’s made many sorts of claims about Mantle and DX11, changing his story over time. But what about some of the other people involved in this Mantle marketing scheme? I get this very bad taste in my mouth with all these ‘developers’ involved with AMD-related promotions.

First there is Johan ‘repi’ Andersson (Dice). I wonder if he really believed the whole Mantle-thing, even including the part where there were claims of no DX12 in the early days. He sure played along with the whole charade. I wonder how he feels now, now that AMD has pulled the plug from Mantle after little more than a year, and only a handful of games that even support Mantle at all, some of them not even being faster than DX11. It appears he has also lived in an AMD-vacuum as well, with claims such as that DX11 multithreading was broken.

What he really meant to say was that AMD’s implementation of DX11 multithreading was broken. Which you can see in FutureMark’s API overhead test.

As you can see, there is virtually no scaling on AMD hardware. nVidia however gets quite reasonable scaling out of DX11 multithreading. Sure, Mantle and DX12 are better, but nevertheless, DX11 multithreading is not completely broken. The problem is in AMD’s implementation: AMD’s drivers can not prepare a native command buffer beforehand. So the command queue for each thread is saved, and by the time the actual draw command is issued on the main thread, AMD’s driver needs to patch up the native command buffer with the current state. This effectively makes it a single-threaded implementation. As nVidia shows, this is not a DX11-limitation, it *is* possible to make DX11 multithreading scale (and in fact, even single-threaded DX11 scales somewhat on CPUs with more cores, so it seems that nVidia also does some multithreading of their own at a lower level).

Then I ran into another developer, named Jasper ‘PrisonerOfPain’ Bekkers (Dice). He was active on some forums, and was doing some promotion of Mantle there as well, by making claims about DX11 that were simply not true. Claiming that DX11 could not do certain things. When I pointed out certain DX11-features to do the things he claims were not possible, he changed his story somewhat, down to claims that Mantle would be able to do the same more efficiently, in theory. Which is something I never denied, as you know. I merely said that the gains would not be of revolutionary proportions. Which we now know is true.

And a third Mantle-developer I ran into on some forums was Sylvester ‘.oisyn’ Hesp (Nixxes). He also made various claims about Mantle, DX11 and DX12. None of which held up in the end, as more became known about DX12 and the future of Mantle. He also made some very dubious claims, which make me wonder how well he even understands the hardware in the first place (I suppose us oldskool coders have a slightly different idea of what ‘understanding the hardware’ really means than the newer generation). He literally claimed that an API-design such as DX12 could even have been used back in the DX8-era. Now, firstly such a claim is quite preposterous, because you’re basically saying that Microsoft and the IHVs involved with the development of DX have been completely clueless for all these years, and with DX12 they suddenly ‘saw the light’… Secondly, you demonstrate a clear lack of understanding what problem DX12 is actually trying to solve.

That problem is about managing resources and pipeline states, in order to reduce CPU-overhead on the API/driver side. In the world of DX8, we had completely different usage patterns of resources and pipeline states. We had much slower GPUs with much less memory, and much more primitive pipelines and programmability. So the problems we faced back then were quite different from those today, and DX12 would probably be less efficient at handling GPUs and workloads of that era than DX8 was.

And there are more developers, or at least, people who pretend to be developers, who have made false claims about AMD and Mantle. Such as the comment from someone calling himself ‘Tom’, on an earlier blog of mine about DirectX 11.3 and nVidia’s Maxwell v2. In that blog I pointed out that there had been no indication of current or future AMD hardware being capable of these new features. ‘Tom’ made the claim that conservative rasterization and raster ordered views would be possible on existing GCN hardware through Mantle.
Well, DirectX 12 is out now, and apparently AMD could not make it work in their drivers, because they do not expose this functionality.

Or Angelo Pesce with his ‘C0DE517E’ blog, whom I covered in an earlier blog. Well, on the desktop, GCN has not been very relevant at all, since the introduction of Maxwell. AMD has been losing marketshare like mad, and is at an all-time low currently, and dropping fast:

And don’t get me started on Oxide… First they had their Star Swarm benchmark, which was made only to promote Mantle (AMD sponsors them via the Gaming Evolved program). By showing that bad DX11 code is bad. Really, they show DX11 code which runs single-digit framerates on most systems, while not exactly producing world-class graphics. Why isn’t the first response of most people as sane as: “But wait, we’ve seen tons of games doing similar stuff in DX11 or even older APIs, running much faster than this. You must be doing it wrong!”?

But here Oxide is again, in the news… This time they have another ‘benchmark’ (do these guys actually ever make any actual games?), namely “Ashes of the Singularity”.
And, surprise surprise, again it performs like a dog on nVidia hardware. Again, in a way that doesn’t make sense at all… The figures show it is actually *slower* in DX12 than in DX11. But somehow this is spun into a DX12 hardware deficiency on nVidia’s side. Now, if the game can get a certain level of performance in DX11, clearly that is the baseline of performance that you should also get in DX12, because that is simply what the hardware is capable of, using only DX11-level features. Using the newer API, and optionally using new features should only make things faster, never slower. That’s just common sense.

Now, Oxide actually goes as far as claiming that nVidia does not actually support asynchronous shaders. Oh really? Well, I’m quite sure that there is hardware in Maxwell v2 to handle this (nVidia has had asynchronous shader support in Cuda for years, via a technology they call HyperQ. Long before AMD had any such thing. The only change in DX12 is that a graphics shader should be able to run in parallel with the compute shaders. Not something that would be that difficult to add to nVidia’s existing architecture, and therefore quite implausible that nVidia didn’t do this properly, or even ‘forgot’ about it). This is what nVidia’s drivers report to the DX12-API, and it is also well-documented in the various hardware reviews on the web.
It is unlikely for nVidia to expose functionality to DX12 applications if it is only going to make performance worse. That just doesn’t make any sense.
There’s now a lot of speculation out there on the web, by fanboys/’developers’, trying to spin whatever information they can find into an ‘explanation’ of why nVidia allegedly would be lying about their asynchronous shaders (they’ve been hacking at Ryan Smith’s article on Anandtech for ages now, claiming it has false info). The bottom line is: nVidia’s architecture is not the same as AMD’s. You can’t just compare things such as ‘engine’ and ‘queue’ without taking into account that they mean vastly different things depending on which architecture you’re talking about (it’s similar to AMD’s poorly scaling tessellation implementation. Just because it doesn’t scale well doesn’t mean it’s ‘fake’ or whatever. It’s just a different architecture, which cannot handle certain workloads as well as nVidia’s).

What Oxide is probably doing, is probably the same thing as they did with Star Swarm: They feed it a workload that they KNOW will choke on a specific driver/GPU (in the case of Star Swarm they sent extremely long command lists to DX11. This mostly taxed the memory management in the driver, which was never designed to handle lists of that size. nVidia fixed up their drivers to deal with it though. It was never really an API issue, they just sent a workload that was completely unrepresentative of any realistic game workload). Again a case of bad code being bad. When you optimize a rendering pipeline for an actual game, you will look for a way to get the BEST performance from the hardware, not the worst. So worst case you don’t use asynchronous shaders, and you should get DX11-level as a minimum (there is no way to explicitly use asynchronous shaders in DX11). Best case you use a finely tuned workload to make use of new features such as asynchronous shaders to boost performance.

It sounds like Oxide is just quite clueless in general, and that isn’t the first time. Remember this?

With relatively little effort by developers, upcoming Xbox One games, PC Games and Windows Phone games will see a doubling in graphics performance.

Suddenly, that Xbox One game that struggled at 720p will be able to reach fantastic performance at 1080p. For developers, this is a game changer.

The results are spectacular. Not just in theory but in practice (full disclosure: I am involved with the Star Swarm demo which makes use of this kind of technology.)

Microsoft never claimed any performance benefits for DX12 on Xbox at all, and pointed out that DX11.x on the Xbox One already gave you these performance benefits over regular DX11. Even so, DX12 gives you performance benefits on the CPU-side, while making the Xbox One go from 720p to 1080p would require more fillrate on the GPU-side. Not something any API can deliver (if the Xbox One was CPU-limited, then you could just bump up the resolution to 1080p for free in the first place). Oxide has a pretty poor track record here, spreading dubious benchmarks, and outright wrong information.

What is interesting though, is that AMD’s Robert Hallock has FINALLY admitted that DirectX 12 is not just Microsoft stealing Mantle, but Microsoft’s own creation:

DX12 it’s Microsoft’s own creation, but we’re hugely enthusiastic supporters of any low-overhead API. :)

Glad we got that settled.
So basically, not a lot of what we heard about AMD and Mantle turned out to be true. As I have been saying all along. Welcome to the era of Windows 10 and DirectX 12. These are going to be interesting times for game engines and rendering technology!

Edit: There have been some updates on the async compute shader story between nVidia, AMD and Oxide. See ExtremeTech’s coverage for the details. The short story is exactly as I said above: nVidia’s and AMD’s approach cannot be compared directly. nVidia does indeed support async compute shaders on Maxwell v2, and indeed, there are workloads where nVidia is faster than AMD, and workloads where AMD is faster than nVidia. So Oxide did indeed (deliberately?) pick a workload that runs poorly on nVidia. Their claim that nVidia did not support it at all is a blatant lie. As are claims of “software emulation” that go around.

The short story is that nVidia’s implementation has less granularity than AMD’s, and nVidia also relies on the CPU/driver to handle some of the scheduling work. It looks like nVidia is still working on optimizing this part of the driver, so we may see improvements in async shader performance with future drivers.

So as usual, you read the truth here first :)

Posted in Direct3D, Software development, Software news | Tagged , , , , , , , , , , , | 9 Comments

Fixing Genesis Project’s GP-01

Shortly before our release of 8088 MPH at Revision 2015, another 8088+CGA production surfaced at Gubbdata 2015:

This production was GP-01, by Genesis Project. One the one hand we were happy to see more people targeting the 8088+CGA platform. On the other hand, we tested it on our real machines, and it didn’t actually work, it’s a ‘DOSBox-demo’.
Here is a quick-and-dirty video of what happens on a real IBM 5160 with real IBM CGA (this is my machine, the one used for capturing 8088 MPH at the party):

Because it was so close to Revision 2015, and we did not want to draw any negative attention prior to the release of 8088 MPH, we did not report this at the time.

However, once Revision was over, and the attention around 8088 MPH had died down somewhat, I decided to look into this demo some more, and see what the problems were, and if I could fix them.
In short, we have the following issues:

  1. Music doesn’t play, it just generates one constant tone
  2. Screen is wrapping around vertically
  3. Keyboard locks up
  4. CGA snow

So, I reverse-engineered the demo, and inspected each of these issues, and tried to fix them. Let’s go over them one-by-one.


When I looked at the timing routines, it quickly became obvious why they weren’t working: The demo uses int 70h and port 70h/71h. That is the CMOS realtime clock, and requires a second interrupt controller, neither of which 8088-class machines have. These weren’t added until the later AT-standard, which was introduced with the 286 CPU. So on a PC/XT-class machine, the code simply enabled the speaker (causing it to play at whatever frequency it last used, apparently the bootup beep in this case), and then nothing happened, because the interrupt handler for the timer was never called, because that specific timer hardware doesn’t exist.
So I rewrote timing based on the standard timer interrupt, the same way as CGADEMO and 8088 MPH do it.

Screen wrapping

The problem with the screen wrapping was that the demo tried to set up 80×50 and 80×100 tweaked textmodes, but it only initializes a few registers. Apparently enough to make DOSBox work, but not enough for actual hardware, where you have to take care to generate a proper 60 Hz image with 262 scanlines in total.
In this case, what happened is that they changed the character height from 8 to 2 lines (which results in all character-based vertical timing being 1/4th of what it was before), but they did not adjust the rest of the code to make sure the visible area still has 200 lines, and then 62 lines for border and vertical blank. So as a result you only got 50 lines in total, then a few lines of border+vbl, and then the frame started over, effectively making it refresh at 4*60 Hz. Since the monitor still works at 60 Hz, it just displayed 4 of these frames on top of eachother. So I fixed up that code as well, to get a proper image on real hardware.


The keyboard-routine hooks into int 9 and then checks port 60h of the keyboard-controller for a keypress, but it did not disable and re-enable the keyboard-controller so that it would generate an interrupt again for the next keypress. Which is why it would get stuck after one keypress. So I fixed that as well.

So far, that led to this quick-fix version:

Now it was fixed just far enough to notice some new issues:

  1. Music plays low buzzing sound when it should be silent
  2. Logo has weird wraparound issues when scrolling into view
  3. Scroller is not smooth

Let’s fix some more!

Buzzing music

Looking at the music code again, it does not handle silence at all. The code is set up to just play note+duration data, and silence is just a note of 1 Hz. Apparently that isn’t audible in DOSBox, but you get this buzz on real hardware. So I rewrote the code to turn off the speaker when a note value of ‘1’ is encountered (value ‘0’ was already taken for ‘end of music’, making the music loop).

Logo wraparound

This is another thing that apparently happens to work in DOSBox, but not on real hardware: the logo is just being drawn to the screen with no clipping whatsoever. Because real CGA memory wraps around, you see the logo at the top of the screen when it is being drawn below the visible area. Namely, we are using 80×100 textmode, where each character is two bytes: attribute byte and ASCII value byte. So we are using the full 16KB of CGA memory for the onscreen buffer. On real hardware, this physical 16KB is mapped into the B800 segment multiple times, so writing beyond the first 16KB will cause the wraparound. Another thing that was not too hard to fix: just add some clipping to the blit-routine. Since the logo just scrolls vertically, it was simple enough to just check after each scanline to see if the value of the DI register would exceed (160*100).

Unsmooth scroller

The scroller was somewhat weird: there was some delay code to control the speed of scrolling, but it was just a countdown, so it was not trying to sync to the display at all. Again, this was very easy to fix: even in textmode you can just poll the CRTC registers for vsync status. The text was small enough and the screen position was low enough to get perfect 60 Hz scrolling on the frontbuffer with no flicker (in this case the snow is actually somewhat useful: you can see exactly where the CPU accesses video memory, so you see snow in the top part, where it updates the scroller, and no snow from the moment it starts polling for vbl).

This got us as far as this version:

Well, not bad… except for all that snow!


Is there a way around snow? Well yes, in theory, as you have seen in 8088 MPH’s plasma effect. Problem is, it requires you to throttle the writes to video memory, making all effects much slower. I decided not to bother with that, because it would be a lot of trouble to rewrite the effects like that, and the demo would no longer look as intended.
However, you’re in luck: many CGA clones do not suffer from snow. Here is what it looks like on such a clone, an ATi Small Wonder:

So there we are, we are running it on actual hardware, and it looks and sounds pretty much as intended, based on the DOSBox capture.

If you’re interested, you can download the fixed version, including reverse-engineered assembly listing here.
I hope there will be more people doing 8088+CGA productions!

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , | 2 Comments

nVidia does support D3D_FEATURE_LEVEL_12_0

There has been some FUD going around, in response to the fact that AMD’s latest hardware does not support D3D_FEATURE_LEVEL_12_1. Namely, the claim was that D3D_FEATURE_LEVEL_12_0 required some features that only AMD supports, namely resource binding tier 3.

Now that the final version of Windows 10 is released, and with that, the final version of the Windows 10 SDK and proper drivers, I set up my development machine for the ‘real stuff’, instead of the early access stuff I had been playing with up to now. When I ran DXDIAG, something caught my attention right away:


It reports both featurelevels 12.1 and 12.0. This is what I originally expected anyway, since DirectX versions/feature levels are always incremental/fully backward-compatible, to avoid complications when developing software.

So, I decided to look it up in the final DirectX 12 documentation. And indeed, they explicitly mention it:

A feature level always includes the functionality of previous or lower feature levels.

Now, if you look at the table below on that page, you see that indeed, 12.0 does not require resource binding tier 3, as was claimed by the AMD camp earlier. It requires tier 2+. Although AMD hardware supports tier 3, this is NOT a requirement for 12.0, and therefore nVidia’s 12.1 hardware also supports 12.0.

Just to be sure, I also did a quick test with actual code, to create a 12_0 device on my GTX970:


And as expected, this code succeeds without a problem, and the application ran just fine.

Well, that is settled then.

Posted in Direct3D, Hardware news, Software development | Tagged , , , , , , , | 2 Comments

8088 MPH: The final version

Although we were very happy with the win at Revision, we were not entirely happy with 8088 MPH as a whole. Because of the time pressure, there were some things we could not finish as well as we wanted. Also, as others tried to run our demo on real hardware, we found that even real IBM CGA cards had problems with some of the CRTC tweaks, since not all IBM cards used the same Motorola MC6845 chip. Some used a Hitachi HD6845 instead, which does not handle a hsync width setting of 0 in the same way (the real Motorola chips interpret a value of 0 as 16. The Amstrad CPC also suffers from the issue that during its lifespan, various 6845-chips have been used, introducing incompatibilities).

On the other hand, we also found out that there are indeed clones that are 100% cycle-exact to a real IBM PC. The Philips P3105 machine I have, which I discussed in the CGADEMO blog earlier, is one of these. In that blog I said the system was not cycle-exact, since the CGADEMO showed some problems with the scroller that were not visible on a real IBM. I also said that I suspected the ATi Small Wonder videocard. I was already working on 8088 MPH at the time, and I knew that I had to have a cycle-exact machine. So I bought an IBM 5160. Now that 8088 MPH was done, I had some time to experiment with the hardware. And indeed, it turns out that the Philips machine uses (clones of) the same Intel 82xx chipset as the IBM 5150/5155/5160 (also known as the MCS-85 set, originally developed for the 8085 CPU). Once I swapped the ATi Small Wonder for the real IBM CGA card, the Philips P3105 machine ran 8088 MPH exactly the same as the 5160. Likewise, putting the ATi Small Wonder in the 5160 made it produce the same bugs in CGADEMO and 8088 MPH as the Philips P3105.

So there is hope: there are at least SOME clones that are cycle-exact to a real IBM. The 82xx-chipset is probably a feature to look out for. I have another 8088 machine, a Commodore PC20-III, but it uses a Faraday FE2010 chipset instead, which is not entirely cycle-exact, not even with the real IBM CGA card installed. In fact, it crashes on the endpart. I am not sure if there are any CGA clones that are cycle-exact to a real IBM though. The ATi Small Wonder is not, at least. And neither is the Paradise PVC4 in the PC20-III. What’s more, even though both these cards have a composite output, neither of them produce the same artifact colours as the IBM CGA does. So the demo looks completely wrong.

Speaking of artifact colours, not all IBM CGA cards are equal either, as you might know. We distinguish two versions, known as ‘old-style’ and ‘new-style’ (but as mentioned above, there are variations on the components used as well… and apparently some versions have a green PCB while others have a purple PCB, but neither appear to have any relation to which version of the artifact colours you get). Here is a picture that demonstrates the difference between the two cards (courtesy of 5150_compare_cga_adapter

As you can see in the yellow outlines, the difference is mainly a set of resistors. These resistors are used in the generation of the composite output signal. The top card is an ‘old-style’ CGA card, and the bottom one is a ‘new-style’. We believe that IBM changed the circuitry because of the monochrome screen in the portable 5155 PC. This internal screen was connected to the TV modulator header of the CGA card, which basically feeds it the composite output. The problem with the old style card is that when using it in monochrome, there were only a few unique luminance values. The tweak to the circuitry delivered 16 unique luminance values, which made it more suitable for monochrome displays.

The graphics in our demo were entirely tuned for old-style CGA, partly because we wanted to target the original 1981 PC, and partly because old-style CGA has slightly nicer colours, especially for the new 256/512/1024-colour modes.

For more in-depth information, see this excellent Nerdly Pleasures blog.

So, we decided to make a demo about it: we made a to-do list of things we wanted fixed or changed in the final version. Let’s go over these changes.

Overall tweaks

There were some minor glitches between effects, when videomodes were switched, and timer interrupt handlers were installed or removed, which led to the music skipping, the screen losing sync and other tiny things like that.

We fixed all the visual and aural bugs we could (sadly because CGA is rather crude, it is very difficult to make videomode changes entirely invisible, so we have accepted that there are still some tiny visual glitches here and there).

One in particular I like to mention is the plasma effect. This runs in an 80-column textmode, which is susceptible to so-called ‘CGA snow’. Basically, because IBM used single-ported DRAM memory, there is a problem in 80-column mode. When the CPU accesses memory, the video output circuitry can not. This results in the output circuitry just receiving whatever value is on the data bus, rather than the byte it actually wants to read. The resulting random patterns of characters are known as ‘snow’ (the 80-column mode uses more bandwidth than the other modes. In other modes, IBM solved the snow issues by inserting wait states on CPU access).

Our plasma routine is actually much faster than what you see in the demo, because we ‘throttle’ it, and write to the screen only when there is no memory being read by the video output, to avoid snow. Or at least, that was the idea. In the capture of the party version you see one column of CGA snow on the left of the screen, which should have been hidden in the border.

This is because we worked under the assumption that if we set the DMA refresh timer to 19 memory cycles instead of 18 (19 being a divisor of the 76 memory cycles on a scanline), then bring the CRTC, CGA and CPU clocks in lockstep, the entire system would be in lockstep. At the party we discovered that the plasma would sometimes work as designed, and hide the snow in the border, and sometimes the snow would be visible.

As we investigated the issue further, we found that the PIT’s relative phase to the CGA clock can differ on each power-cycle. That is, the PIT runs at 1.193 MHz, which is 1/4th the frequency of the 4.77 MHz base clock of the system, from which the CPU and the ISA bus take their clock signal (and thereby the CGA card, since unlike later graphics cards, it has no timing source of its own. In fact, original IBM PCs have an adjustable capacitor on the motherboard to fine-tune the system clock, in order to fine-tune the NTSC signal of the CGA card). However, the PIT can start on any system cycle. So there are 4 possible phases between CGA and PIT, and as far as we know, there is no way to control this from software. So instead, we have a tool which determines the relative PIT/CGA phase that the PC has powered-up in. Then we just have to power-cycle until we have the right phase, and then we can run the plasma with no snow. Although there are other cycle-exact parts in the demo, they are not affected by this relative phase.

End tune

The end tune was rather quiet. The reason for this is in how the tune is composed, it just happens to be a rather quiet tune. So, we added a ‘preprocessing’ step, which can be seen as a ‘peak normalize’ step: the entire tune is mixed, and the peak amplitude is recorded. Then the song data is compensated to boost the volume to the maximum for that particular tune. Since the sample technique is timer-based (Pulse-width modulation), you have to be careful not to get any overflow during mixing. Where conventional PCM samples have no problem with overflow, other than some clipping/distortion (which is not very noticeable in small amounts), the result of an overflow with PWM can result in a timer value of 0, which is interpreted by the hardware as the maximum timer count of 65536. This means you will get a long pause in the music whenever this happens. Negative timer values will also be interpreted as large positive counter values.

So the end tune is now louder. Which not only makes it easier to hear on a real PC speaker, it also means that the carrier whine is relatively more quiet than before (the carrier whine is still the same amplitude as it was before). Aside from that, the limited sample resolution we have with the PWM technique is now used more effectively, so we have slightly better resolution. So overall it sounds better too.

DOS 2.x

As we found out after the release, when people tried to run the demo on DOS 2.x, it didn’t work. After some debugging, we found that this was a shortcoming of the loader. Namely, we retrieved the filename of the loader (8088MPH.EXE) in order to open the file and read some of the embedded data from it. However, as we found out, this functionality was only added in DOS 3.x. In DOS 2.x this simply is not possible. So, we added ‘8088MPH.EXE’ as a default value, which it will pick in that case. The rest of the demo did not have any problems running under DOS 2.x, so we are now entirely 2.x-compatible, instead of just 3.x. We also looked at DOS 1.x-support briefly, but that version of DOS did not have any memory management yet, nor did it have the ‘nested process’ functionality that we use with our loader to load and start each new part in the background.

So supporting DOS 1.x would require us to implement quite a bit of extra functionality, which effectively would make it trivial to run the demo directly as a booter, without any DOS. A bit too much for a final version, but perhaps for a future project, who knows…

Calibration screen

For the party version, we had just hardcoded the CRTC-values for the tweaked modes to best suit the capture setup that we had. These values can be less than ideal, or even problematic on other setups. So we decided to add a calibration screen for the final version, where you can fine-tune the CRTC-values to get a stable image and the best possible colours for the tweaked modes. This also allows you to avoid the hsync width of 0, so this makes the demo compatible with non-Motorola 6845 chips as well.

New-style CGA

By far the most work in the final version is the new set of graphics. While we had to make some changes to the code to support multiple sets of graphics and palettes (and try to stay within the 360k floppy limit), the new-style CGA support is mostly VileR’s handywork. He had to re-colour most graphics in the demo to compensate for the different colours of new-style CGA. His attention to detail is second to none. For example, he even did a new set of borders for the polygon part. I hadn’t even thought of doing that, since the border didn’t look that different on new CGA to me, and it didn’t look ‘wrong’ unlike some of the other graphics. But once I saw the subtle changes he made, it really did make new CGA look better that way.

Also, even though one of the reasons to target old CGA was a better selection of colours, I must say the new CGA graphics look fantastic, and they are very close to the old CGA versions. In fact, I might even prefer the blue shades on the donut in new CGA mode. I am very happy with that personally, since I only have a new-style CGA card, so I was not able to see the demo properly on my own machine until now.

The texture for the rotozoomer was also updated. The reason for this is that there was a slight miscommunication between Trixter and VileR for the party version, which resulted in VileR using a different palette for the texture than what Trixter intended. The rotozoomer runs in an interlaced screen mode. We tried to set the unused scanlines to a good background colour to compensate for that, but it did not work too well. The final texture works much better.

The plasma effect also received an update to the banners for ’16 color’/’256 color’. There was no time to finish these for the compo. And lastly, the end scroller used ANSI art, but they were only used for some ‘large font’ parts in the end scroller. VileR has now created a whole colourful ANSI background for the scroller.

One last graphics tweak is the title-screen. In order to get the gray colour for the text, we had to set the foreground mode to a lower brightness (this is using mode 6, like in the sprite part with the fade-in/out). This meant that the title screen also appeared in lower brightness than how it was designed. In the final version we have some code that also adjusts the palette on-the-fly as the title screen rolls down, so that the text remains the gray colour, while the title screen is the full brightness.

For more in-depth information on the graphics changes for the final version, I direct you to VileR’s excellent blog.


Since our demo introduces various ‘new’ videomodes, reenigne has also prepared some DOSBox patches to improve the CGA emulation, fixing some bugs (eg the tweaked modes in the sprite part did not work in composite mode with the official 0.74 version of DOSBox) and mostly adding a more accurate emulation of the CGA composite mode, by properly emulating the decoding of the NTSC colour signal. It also allows you to switch ‘monitors’ manually, since DOSBox originally implemented composite as a ‘video mode’, which it tried to auto-detect. That is not how it should work. In the real world, your CGA card has both an RGBI and composite output, and both are always enabled. So either you get everything on a composite screen, or you get everything on an RGBI screen. So you want to select which type of monitor DOSBox should emulate, which is now possible. It also allows you to switch between old style and new style colours.

The patches will not allow you to run 8088 MPH entirely in DOSBox, but some parts now look very close to real hardware, and should also benefit any games that make use of CGA composite modes.

In fact, the video capture we made of the final version of our demo is also done with an NTSC decoding routine, similar to the one in the DOSBox patch, but tuned for maximum quality rather than performance.

The demo was captured from real hardware in raw mode, and then the raw captured frames were sent to the NTSC decoder to generate high-quality colour frames. We feel that this capture method is closer to what you would see on a proper high-quality NTSC composite monitor than the cheap NTSC decoding built into most capture devices, which lead to annoying vertical banding issues (as you can see in the party capture video). Another small fix is that we didn’t use the correct aspect ratio for CGA when capturing the party version. The final version is captured with the correct aspect ratio.


If you haven’t seen it yet, there is a live capture of the demo at the Revision 2015 oldskool compo, recorded by Trixter:

I really like the audience reactions. We clearly tried to design our demo for an audience like this, with the intro and all, and putting people on the wrong foot, hitting them with unexpected tricks, and just as they figure out what is going on, we move on to the next part.

Likewise, we tried to build up the speaker music gradually, so that people can get used to the heavily arpeggiated beeps, as the complexity scales up to squeeze as much out of the speaker as possible, rather than losing people by blasting them with very complex beeping out of the gate.

The response of the audience is the confirmation that we got most parts right in that sense: people laughed at the right places, and cheered and applauded in others. If you listen closely, you can actually hear some ‘nerdgasms’: people shouting “What the…!”, “Jesus”, “Oh man!” or “Oh god!” :)

We didn’t know beforehand how much people would ‘get’ about this rather obscure platform and the limited graphics and music, but we couldn’t have hoped for a better response, nearly every part was greeted with cheers and applause. A very unique experience, and it’s unlikely that we will ever match this.

Anyway, here’s our final version (old CGA). We hope you like it!
And this is a capture of new CGA:

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , , , , , | 7 Comments

No DX12_1 for upcoming Radeon 3xx and Fury series?

A few days ago it was confirmed that none of AMD’s current GPUs are capable of supporting the new rendering features in DirectX 12 level 12_1, namely Conservative Rasterization and Rasterizer Ordered Views.
I already mentioned that back when Maxwell v2 and DX11.3 were introduced.
Back then a guy named ‘Tom’ tried to push the idea that these features are supported by Mantle/GCN already. When I asked for some specifics, it became somewhat clear that it would simply be some kind of software implementation rather than direct hardware support.
When I asked for some actual code to see how complex it would actually be to do this in Mantle, there was no reply.
Software approaches in general are no surprise. AMD demonstrated a software version of order-independent-transparency at the introduction of the Radeon 5xxx series.
And a software implementation for conservative rasterization was published in GPU Gems 2 back in 2005.
So it’s no surprise that you can implement these techniques in software on modern GPUs. But the key to DX12_1 is efficient hardware support.

That should have already confirmed that AMD was missing out on these features, at least on the current hardware.
Rumours have it that a large part of the new 3xx series will actually be rebranded 2xx GPUs, so nothing changes there, feature-wise.

So the only unknown is what the new Fiji chip is going to be capable of, for the Radeon Fury series.
The most info I’ve found on the Fiji chip so far is a set of leaked slides posted at Beyond3D.
But it still isn’t clear whether it supports conservative rasterization and rasterizer ordered views. The slides do not mention it, which is not a good sign. They do mention “Full DX12 support”, but only in combination with resource binding tier 3 (a feature that Maxwell v2 does not support).
I think it is safe to assume that if they supported conservative rasterization and ROV’s, that we would have heard of it by now, and it would definitely be mentioned on the slides, since it is a far more interesting feature than resource binding tier 2 vs tier 3.

So I contacted Richard Huddy about this. His reply pretty much confirmed that conservative rasterization and ROV’s are missing.
Some of the responses of AMD’s Robert Hallock on Twitter also point to downplaying the new DX12_1 features, and just pretending that supporting DX12 is supporting DX12, regardless of featurelevel.
Clearly that is not the case, 12_1 includes some rasterizing features that 12_0 does not. But if AMD needs to spread that message, with a new chip being launched in just a few days, I think we know enough.

Oh wow, there goes the hate campaign again:

Does this mean that Nvidia didn’t support DX 11 because they couldn’t do DX 11.1 or feature level 11_1? Does the fact that Nvidia downplayed the benefits of 11.1/11_1 mean that they didn’t support DX 11? That guy just sounds like an idiot/fanboy when he says that even if it may not have been his intention.

Of course if he said the same things about Nvidia back then, well, then he’s just an idiot. :)

Is it just me, or is this logic broken beyond repair? I don’t know what to make of this…. “nVidia didn’t support DX11 because they couldn’t do DX11.1?”
How is that an analogy to this? Nobody said AMD doesn’t support DX12. I’m just saying that the statement AMD is making is “DX12 is DX12, regardless of whether you support level 12_0 or 12_1”.
I’m just pointing out the obvious flaw in AMD’s statement: there are different levels in DX12 because they are different.

Not to mention that the situation with DX10/DX10.1 or DX11/DX11.1 are very different to the current situation. Namely, those were updates to the API that were done after the API had been on the market for years already. Back when DX10 and DX11 were launched, there was no concept of DX10.1 and DX11.1 respectively. And neither nVidia nor AMD had DX10.1/DX11.1 hardware at the launch of DX10/11.
In this case, there are different feature levels for DX12 introduced right at the launch of the API. So that does not compare at all to the situation with earlier versions of DX.

So who is sounding like an idiot/fanboy here?

Posted in Direct3D, Hardware news | Tagged , , , , , , , , , , | 42 Comments