No DX12_1 for upcoming Radeon 3xx and Fury series?

A few days ago it was confirmed that none of AMD’s current GPUs are capable of supporting the new rendering features in DirectX 12 level 12_1, namely Conservative Rasterization and Rasterizer Ordered Views.
I already mentioned that back when Maxwell v2 and DX11.3 were introduced.
Back then a guy named ‘Tom’ tried to push the idea that these features are supported by Mantle/GCN already. When I asked for some specifics, it became somewhat clear that it would simply be some kind of software implementation rather than direct hardware support.
When I asked for some actual code to see how complex it would actually be to do this in Mantle, there was no reply.
Software approaches in general are no surprise. AMD demonstrated a software version of order-independent-transparency at the introduction of the Radeon 5xxx series.
And a software implementation for conservative rasterization was published in GPU Gems 2 back in 2005.
So it’s no surprise that you can implement these techniques in software on modern GPUs. But the key to DX12_1 is efficient hardware support.

That should have already confirmed that AMD was missing out on these features, at least on the current hardware.
Rumours have it that a large part of the new 3xx series will actually be rebranded 2xx GPUs, so nothing changes there, feature-wise.

So the only unknown is what the new Fiji chip is going to be capable of, for the Radeon Fury series.
The most info I’ve found on the Fiji chip so far is a set of leaked slides posted at Beyond3D.
But it still isn’t clear whether it supports conservative rasterization and rasterizer ordered views. The slides do not mention it, which is not a good sign. They do mention “Full DX12 support”, but only in combination with resource binding tier 3 (a feature that Maxwell v2 does not support).
I think it is safe to assume that if they supported conservative rasterization and ROV’s, that we would have heard of it by now, and it would definitely be mentioned on the slides, since it is a far more interesting feature than resource binding tier 2 vs tier 3.

So I contacted Richard Huddy about this. His reply pretty much confirmed that conservative rasterization and ROV’s are missing.
Some of the responses of AMD’s Robert Hallock on Twitter also point to downplaying the new DX12_1 features, and just pretending that supporting DX12 is supporting DX12, regardless of featurelevel.
Clearly that is not the case, 12_1 includes some rasterizing features that 12_0 does not. But if AMD needs to spread that message, with a new chip being launched in just a few days, I think we know enough.

Oh wow, there goes the hate campaign again:

Does this mean that Nvidia didn’t support DX 11 because they couldn’t do DX 11.1 or feature level 11_1? Does the fact that Nvidia downplayed the benefits of 11.1/11_1 mean that they didn’t support DX 11? That guy just sounds like an idiot/fanboy when he says that even if it may not have been his intention.

Of course if he said the same things about Nvidia back then, well, then he’s just an idiot. :)

Is it just me, or is this logic broken beyond repair? I don’t know what to make of this…. “nVidia didn’t support DX11 because they couldn’t do DX11.1?”
How is that an analogy to this? Nobody said AMD doesn’t support DX12. I’m just saying that the statement AMD is making is “DX12 is DX12, regardless of whether you support level 12_0 or 12_1″.
I’m just pointing out the obvious flaw in AMD’s statement: there are different levels in DX12 because they are different.

Not to mention that the situation with DX10/DX10.1 or DX11/DX11.1 are very different to the current situation. Namely, those were updates to the API that were done after the API had been on the market for years already. Back when DX10 and DX11 were launched, there was no concept of DX10.1 and DX11.1 respectively. And neither nVidia nor AMD had DX10.1/DX11.1 hardware at the launch of DX10/11.
In this case, there are different feature levels for DX12 introduced right at the launch of the API. So that does not compare at all to the situation with earlier versions of DX.

So who is sounding like an idiot/fanboy here?

Posted in Direct3D, Hardware news | Tagged , , , , , , , , , , | 8 Comments

Food for thought

Question: Give at least 4 numbers in the range of 1 to 10.

Answer 1: 3, 5
Answer 2: 4, 5, 6, 20
Answer 3, 7, 7, 7, 7
Answer 4: 3, 4, 5, 6
Answer 5: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10

Answer 1 is incorrect, because there are less than 4 numbers.
Answer 2 is incorrect, because not all numbers are in the range of 1 to 10
Answer 3 is correct (the question does not specify that the numbers need to be unique)
Answer 4 is correct
Answer 5 is correct

But, is answer 4 ‘more correct’ than answer 3? And is answer 5 ‘more correct’ than answer 5?

Posted in Uncategorized | 7 Comments

8088 MPH: The polygons

The last effect in 8088 MPH to be discussed is the polygon renderer. As already mentioned earlier, it is not a regular polygon renderer, but actually a ‘delta-renderer’. This bears some similarity to 8088 Domination, where the video is encoded as sets of differences to the previous frame. In the case of 8088 MPH the goal is not so much compression as it is reduction of the amount of bytes written to CGA memory, since CGA memory is a huge bottleneck. You have about 170 KB/s write speed. Another problem with CGA is the lack of memory, as I mentioned in the previous article as well: there is no memory for more than one framebuffer, so no double-buffering.

Rendering, let’s recap

I have discussed polygon rasterizing before, but let’s review the whole conventional rendering process, so we can then discuss how the renderer in 8088 MPH is different from this process. In this case, I am taking the z-sorting renderer used in 1991 Donut an as example, since that is closest to what we have here:

  1. Clear backbuffer
  2. Transform vertices from object space to screen space
  3. Sort polygons in z-order
  4. Clip polygons against screen rectangle
  5. Render polygons to backbuffer in back-to-front order (Painter’s algorithm)
  6. Swap front and back buffers.

Now, there is one thing that should already be an obvious problem: Steps 1, 5 and 6 assume that we have a backbuffer. Namely, we first clear the whole screen, and then we render the polygons in a ‘random’ vertical order, because they are sorted on z. If you were to do this directly on the frontbuffer, you’d be racing the beam, and losing most of the time, so you will get a very flickery image, with random parts of polygons which may or may not appear depending on where the raster beam was at the point where you started drawing. So the backbuffer is very much required for a conventional polygon renderer.

In 1991 Donut I used a backbuffer in video memory, which meant that I could just change the start offset address to swap between buffers, which takes virtually no CPU time at all. On CGA, there is not enough memory, so at the full resolution it would only work if the backbuffer is done in system memory, and instead of doing a simple swap, we do a full memcpy to video memory. Since the memory speed is so low, this takes a lot of time and will give visible tearing while drawing.

As discussed previously with the vectorbobs, it is possible to do double-buffering in CGA memory, if you set up a tweak mode with a smaller window size. However, vectorbobs can be erased efficiently by just drawing black areas at the positions of the previous frame. With polygons, it is generally most efficient to just clear the whole screen. Drawing black polygons to erase the objects of the previous frame would be more complex and take longer. But even at a smaller window size, clearing the whole buffer still takes a lot of time in CGA memory. We’re still talking about 8 KB per frame, and at ~170 KB/s, you’d need ~0.05 seconds to clean that much. So even before you actually started drawing, you already limited yourself to about 20 fps max, on just half the screen area. Yes, CGA is *very* slow.

So this is where we start to think about delta-rendering. Even if we could just avoid clearing the whole screen, and only clear the parts around the polygons, we could already save a lot of expensive CGA writes. And since we can’t realistically do much more than single-coloured polygons, a large part of the pixels on screen will remain the same colour on consecutive frames. There will only be changes around the edges of the polygons.


So I worked out a solution to find these changes based on a span-buffer approach. Normally you render out polygons as spans of pixels per scanline. In this case I do not actually render the pixels directly, but I use a linked list of spans for each scanline. A span contains the leftmost and rightmost x-coordinates, and the pixel colour. Each span is then inserted into the list, where spans are split, merged, or eliminated on insert, to eliminate any overdraw. Therefore you still want to insert the spans back-to-front, so you still use a sort of painter’s algorithm. In order to ‘clear the screen’, you insert a set of spans with the background colour which span the entire screen before you start inserting your polygons. All these operations are very simple integer-based operations, just comparisons, additions or subtractions, so that makes it very suitable for the 8088, which is much slower at multiply and division than these simpler operations.

Once you have inserted all your polygons, you have a complete ‘snapshot’ of your framebuffer in a compact form. If you also saved the span-buffer of your previous frame, you can now quickly determine the differences per scanline by comparing the spans of the two buffers (so in a way, we are still double-buffering, just not with framebuffers). You do a very similar insert and clip/eliminate operation as with the earlier rendering of the spans, to get a list of spans that only contains the differences. There are a number of advantages to this method, including:

  • Clearing the screen is a very cheap operation
  • The performance is virtually independent of the size of the polygons, it depends on the size of the changes
  • The spans are grouped per scanline, so rendering top-to-bottom and left-to-right is trivial

This means that we no longer need a backbuffer. We don’t actually clear the screen anymore, we only draw the differences, so any flicker will be very minor anyway. And on top of that, we can render in scanline-order. This is especially nice for CGA, since rendering in memory order (such as when you would copy a backbuffer to a frontbuffer with a memcpy operation) means you do the even scanlines first, and then the odd scanlines, which can cause visible interlacing effects.

And we can just use the full 160×200 resolution, and use some border graphics to spice things up.


An interesting difference with 1991 Donut is the sorting strategy. For 1991 Donut, I wanted to scale up the polycount as far as possible. Since sorting doesn’t scale well with a large number of elements, I did not want to do the sorting per polygon. It would probably not be very fast with the 128 polygons I wanted to use. So, what I did there was to subdivide the donut in 8 sections, and sort per section. This means the sorting cost is constant, regardless of polycount.

The downside to this technique is, however, that you need to create your donut so that it can be subdivided into 8 sections. This means that you can’t really construct a donut with less than the 40 polygons I used for the ‘low’ feature in 1991 Donut. This was still quite a lot of polygons for an 8088 at 4.77 MHz to handle. So therefore I decided to drop this sorting method in favour of a more conventional per-polygon sort, and instead construct a donut from 25 polygons, which is about as low as you can go, while still having something that somewhat resembles a donut shape.


I have already mentioned the shading in an earlier post, but I will just state it again here for completeness. VileR came up with a number of gradients that were possible in various CGA modes and palettes. The common black, cyan, magenta, white palette happened to generate a blue gradient of 7 colours, and a red gradient of 6 colours. This would allow for some semi-useful flatshading, so I decided to give it a try and see how it looks.

The downside of flatshading is that your polygons will change colour, which means you get more changes per frame when they do, slowing down the routine. The choice for the pyramid at the start was to show more colour variation by using another palette, and also to start simple, keep expectations low, so that we could build up to the cube and donut with more impact.


If you recall, I did some experiments with dithering earlier, when I targeted Hercules:

Since I only had one colour to work with there, I had to create a number of patterns to simulate different levels of brightness. I chose to use 8×8 patterns, since there are 8 pixels in a single byte. So each pattern consists of 8 bytes. I select the correct byte for each scanline based on the y-coordinate modulo 8. For this code however, I could use a simpler form of dithering, since I already had multiple colours.

Adding dithering to this particular renderer was actually quite simple, if you figure that the span-buffering works with bytes to store the colours anyway. Each pixel is only 4 bits, so I copy the 4-bit colour value to both pixels that are packed in the byte, when I render the span to the screen. In order to speed up that process, I moved that to the start of the polygon rasterization, since the same colour value is used for all spans in the polygon anyway. So the spans already store the expanded colour value, which is used as-is by the rendering routine.

Expanding this to dithering was quite simple: Instead of just taking the same colour value for both pixels in the byte, I just calculated the lighting with one extra bit of precision, and if the lowest bit is set, I take the two nearest colours, and place them in the byte for the even scanlines. Then I swap the order of these two colours for odd scanlines. This results in an X-pattern, which is simple but effective, and doesn’t need any change to the span-buffering and rendering mechanism itself, only a simple tweak to the polygon rasterizer, with a negligible impact on the overall speed.


Dithering effectively doubles the number of colours you use, so you will get more frequent changes of colours, which slows the routine down even more. But it did look very nice, and the speed was actually quite acceptable still, so we chose to use the dithered version of the routine in the final demo.

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , , , , , , , , , , | 3 Comments

8088 MPH: Sprites? Where we’re going, we don’t need… sprites!

I would like to give some technical background information on the sprite part in 8088 MPH, but before I do that, I want to discuss the music and the musicians, as they did not seem to have gotten very much attention in my previous article. One might think that the musicians were not as much part of the project as the coders and graphician were. And in a way that is true.

Phoenix was the only musician who was actively involved with the project, but he joined at a later stage. Initially the plan was that Phoenix would do all the music for the demo. But later he said he wouldn’t have enough time, so we had to look for other musicians. Which is not easy :) As I said before, we had to ‘invent’ this demo platform, and that goes for the music as well. Although MONOTONE was not written specifically for this demo, it was written by Trixter,  and it had not seen widespread use yet. So on the one hand, it was difficult to find musicians who had any experience with MONOTONE at all, or with PC speaker music in general, or who were willing to give it a try. And on the other hand, MONOTONE had not seen much of a workout yet since its release, so the musicians ran into some bugs, and also requested some modifications to the user interface to make tracking a bit more efficient.

But, it all worked out in the end. Phoenix found enough time to finish one track, which is the first one in our demo. Virt did the second MONOTONE track, and Coda did the chiptune on short notice (the mod player was one of the last parts to get finished, less than two weeks before the compo). I had also asked a few DESiRE people if they could do something with MONOTONE, and they did. So we actually had more music than we could use in the end. And primitive as MONOTONE may be on PC speaker, I think even the beeper tracks in our demo are up there with the best beeper music done on PC, getting a LOT of sounds and effects out of that primitive beeper.

No sprites

Let’s revisit the CGA hardware which I briefly covered earlier. We have 16k of video memory, which is just enough for a single framebuffer in 320×200 with 4 colours or 640×200 with 2 colours. The main chip is a Motorola 6845 CRTC, which is not really designed for graphics modes, and only supports up to 127 lines. Therefore, CGA uses a ‘hack’ to display 200-line graphics: it has two bitplanes, one for even and one for odd scanlines. So the CRTC ‘thinks’ in 100-line modes, while we actually get 200 lines on screen.

Pixels are packed side-by-side in bytes, so you have the following pixel formats:

  • 2-colour: 1-bit pixels, 8 per byte
  • 4-colour: 2-bit pixels, 4 per byte
  • 16-colour (composite artifact mode): 4-bit pixels, 2 per byte

Since there is no hardware functionality whatsoever for sprites, bobs, overlays or anything remotely useful, we have to use CPU-based routines to draw sprites. The fastest possible way to draw sprites is to use ‘compiled’ sprites: you generate pieces of code which write the sprite data directly into video memory.

Because of the cumbersome pixel format you will need multiple variations of your sprite-code to have pixel-exact placement in the x and y directions. Namely, because of the separate even and odd bitplanes, some of the code for drawing a sprite is different depending on whether you start on an even or an odd scanline. Because when you want to switch from even to odd scanlines, you need to adjust your pointer with a certain offset. And when you want to switch from odd to even, you need a different offset.

And for the horizontal case, it is quite expensive to shift the data on the pixel level at runtime, so you will want to compile different pre-shifted variations of your code.

As for the code for the pixels in the sprite itself, we have a 16-bit instructionset, so we can process two bytes (one word) at a time, in the best case. However, sometimes it makes more sense to only process a single byte. I defined the following classes of pixel groups:

  • All pixels in word opaque
  • All pixels in word transparent
  • Some pixels in word opaque/transparent
  • All pixels in byte opaque
  • All pixels in byte transparent
  • Some pixels in byte opaque/transparent

For each class, I hand-optimized ‘templates’ of assembly-code. Then I defined heuristics for when to use which template, to always get the fastest/smallest code. The compiler will grab two bytes worth of pixels from the source bitmap, then it will select the proper template based on these heuristics. It then fills the template with the proper pixel/masking info and emits the code and data. It does not always choose to use immediate operands, unlike many examples of sprite compilers you’ll find online. Whenever possible, it uses movsb or movsw, which leads to much smaller and faster code than regular mov instructions with immediate operands.

Aside from that, there are also some optimizations to generate the best possible code. For example, transparent bytes or words will simply update a pointer. Likewise, switching to a new scanline is also just a pointer update. The compiler will group these pointer updates together. Another optimization is that the compiler will first process all even scanlines, and then all odd scanlines. The reason for this is that switching to the next scanline will be an offset of less than a byte, which results in a shorter instruction encoding.


The first version of the compiler would just assume that black pixels are transparent, and everything else is opaque. However, when VileR got the idea for doing the DeLorean sprite, he asked if it was possible to also have black in the sprite, because it would look much better. So, I modified the compiler to take two bitmaps as input rather than one. The second bitmap is just a black/white transparency mask, allowing you to mask any pixel as either opaque or transparent, which gives you the ability to draw opaque black pixels.


We don’t just need to draw the sprite, we also need to erase it. So the compiler does not just generate the code to draw the sprite, but also to erase it. The first version assumed that the sprites would always be drawn on a black background. So the background colour could just be compiled into the erase routine. This erase routine is also more coarse than the sprite routine: if the whole background is black anyway, you don’t need pixel-exact drawing. You can just erase at the byte-level, which is a lot faster, because you don’t need to do any read-modify-write operations on video memory. And again, the compiler will try to use words where possible, for maximum speed and minimum code size, it basically just uses stosw and stosb, and skips any transparent bytes/words by adjusting the pointer.

When I saw that the performance of these routines was quite good, I figured I could also try to use a background image. So I extended the compiler to generate code that would copy pixels from a background image to the screen, again at the byte-level, using words where possible, basically just movsw and movsb, and again skipping transparent bytes/words with pointer adjustments.
Now I had routines that were equivalent to the blitter objects (bobs) used on Amiga. Only these run in software, so they are more like software objects (sobs).

Another way to scroll


Once I had the sprites working on a background, and performance still seemed quite good, I had the idea that it would be possible to move the background by adjusting the start offset register. As long as I adjusted the target address for the sprite to compensate for the scrolling of the screen, I could have sprites floating over a scrolling background. The only problem is though: how does one scroll the background when there is no extra memory?

Well, as you have probably already noticed, the background is set up to be ’tiled': the building repeats the same 60-pixel high segment multiple times. This means that we only have to scroll up 60 pixels to get to a ‘wraparound’ point. Now, with a regular 160×200 16-colour mode, you wouldn’t have enough memory for that. The framebuffer takes 16000 bytes, and you have 16384 bytes. That is just enough for 4 extra scanlines. My first idea was to just make the visible window smaller, so you’d get a 160×144 mode. But that would have a rather strange aspect ratio. So instead I set up a tweakmode that is 140×170 large. This looks like a full screen, because the aspect ratio is about the same. Also, by reducing the width, I free up a lot of memory to scroll in a new building segment. Once I had that working, I could have ‘endless’ upward and downward smooth scrolling of the background, by just taking a scroll offset modulo 60. Mind you, because of the ‘hack’ with even and odd scanlines mentioned earlier, you can only set the start offset to any of the even scanlines, so you can only scroll with two scanline increments.

Putting it all together

We aren’t quite there yet… Remember, that we only have a single framebuffer. Our tweakmode gives us extra off-screen memory to scroll into view, but we still need to draw the sprites directly into the visible area. So yes, we are once again racing the beam. Now, on most 80s hardware this is not that much of a problem, because the hardware was designed for this: you will have scanline counters or even raster interrupts to synchronize your code with a specific position on the screen.

Alas, CGA has none of this. There is the lightpen functionality, which could be used… but because of the ‘hack’ with even/odd scanlines, it does not give scanline-exact results. Also, it is rather cumbersome to use, and may or may not work on a wider range of hardware. Instead, I opted to use the timer. Since the timer runs off the same clock as the CGA card, they are exactly in sync. A single scanline takes 76 timer ticks. An entire screen takes 262*76 = 19912 ticks.
So, if I set the timer to 19912 ticks, and I make it restart at the bottom of the screen, I effectively have a ‘reverse’ scanline counter, except it is scaled up by a factor of 76. Once I had this set up properly, I could poll the timer to determine when the last scanline of my sprite was drawn out by the beam, and I could start erasing it and draw the position for the next screen.

The only limitation I had was that the sprite could not take more than an entire screen to draw. When VileR sent me the DeLorean, I was a bit worried at first… It was quite a big sprite, would the compiled code be fast enough? But luckily it was, and in fact, there’s actually some rastertime left.

The fun part is that I do not actually do any synchronization to the vertical blank at all. The timer interrupt is used to play the music. The sprite is redrawn by polling the counter, and since I can never draw more than one sprite per frame, it synchronizes automatically. I also exploit the fact that the start offset register is latched: I set the new scrolling offset immediately before I start drawing my sprite. The 6845 will latch it, and it won’t become active until the next frame.

Moving around

Now I had the technology in place to scroll the screen and move sprites on top. But I still had to create some kind of acceptable motion to show it off. Initially I thought I could just hardcode a few motions and glue them together with some more hardcoding. Just some movements based on some sine-functions or whatnot. But after I messed about with that idea for a while, I concluded that the movements were still too ‘robotic’, and didn’t really do the effect justice. So I would have to try something else.

So I decided to try and rig up some keyframe-based animation instead, where I could put in some sprite/scroll positions in keyframes, and have smooth spline-based interpolation in between. I used a Hermite spline interpolation in some places, which takes two points to interpolate between, and two other points as ‘control points’ for the total curve, with some extra control values for the ‘tension’ of the path around the points. In other places I used cosine interpolation instead, because it gave slightly smoother, more pleasing results.

Because time was really tight by now, I wrote up the whole keyframe player and animation path in a single evening. We stuck with that rough draft for the demo, with the only difference being that I tweaked the speed of the animation somewhat by adjusting the time of some keyframes, to match the part up with the music.


As you may have noticed, this part also uses some fade-in/fade-out techniques. How is that possible? Well, this is an idea brought forth by VileR, which only works in this particular mode, which is effectively mode 6 with colorburst enabled. Namely, mode 6 is a monochrome mode, where the palette register has a special meaning: The background colour is actually interpreted as the foreground colour in this mode. The background colour is always black, and the foreground colour defaults to 15, which is white.
By adjusting the foreground colour, you are modifying the artifact colours. There are a few foreground colours you can use to gradually reduce the overall luminance, namely 15, 7, 8 and 0 (in that order). This is what I use to fade from black to colour at the start, and back to black at the end of the DeLorean part.

The fade-to-white is slightly more complicated. I have done this based on a set of gradients that VileR provided for this mode. I have created a table that maps each colour to the next brighter colour in a gradient (but leaving black black at all times, to create a more interesting effect). Then I just read all pixels on the screen and remap them with this table. Repeat this a few times, and eventually everything saturates to white.
Because the CGA memory is too slow, you can’t do a whole iteration in just one frame. Which is why I opted for the ‘sideways burst’ effect instead. This is more visually pleasing, and doesn’t suffer from the problem that I can’t process the pixels fast enough for a whole screen update.
This idea was based on effects on C64, which rewrite colorram in a similar way to do fade/flash effects. Unlike the foreground-colour trick, this type of fade can be adapted to any mode, and can be used to fade to white, to black, or any other colour, as long as you construct the proper tables for it.



The vectorbobs part is using the same sprite compiler, but in a different way. Since we are dealing with 3d-sprites here, we need to sort them in z-order. This makes it very difficult to draw them based on a scanline counter while racing the beam. So instead, I opted for another tweakmode here, which is effectively 112×140. This mode takes only 8k of memory, so we have enough memory for a backbuffer. This means we don’t have to worry about racing the beam anymore. Just draw in z-order in the backbuffer, and then update the start offset register to flip the buffers (in this case I do have to wait for vsync, because otherwise I may start drawing in the ‘backbuffer’ while the pageflip is not yet active, so I may cause flicker).
Another advantage here is that I no longer need to draw at the full 60 fps, so I could throw some more vectorbobs at it. During testing I found that 32 bobs would still run quite acceptably. Then I had the idea to generate a more interesting shape, something with text. So I tried to approximate the IBM logo, which ended up at 30 bobs.
Also, since the window is now much smaller, it didn’t make a lot of sense to use a background image here. So I opted for a black background, also using the slightly faster erase-to-black routines I mentioned earlier.

Back to the future

We still have some plans for the future, with this sprite routine. There are some ideas we had, but there was not enough time to work them into the demo. For example, we wanted to support multiple shapes at the same time. This would allow multi-coloured vectorbobs, and vectorbobs that resize based on the z-coordinate, for example. We also wanted to do animated sprites. This can simply be done by calling a different sprite routine for every frame. For a large sprite, such as the DeLorean, you may not want to animate the whole sprite, but rather cut it up into subsections of animated and non-animated parts, to save on the total amount of code you need to generate.

Fun fact: I did some of the development in DOSBox. While DOSBox 0.74 does support emulation of composite video to some extent, it had a bug which prevented the tweakmodes from working properly (in composite mode, the width was hardcoded to 80 bytes, regardless of how you set up the CRTC registers). So I patched this bug and used a custom-built DOSBox for some of the development, until I had to move to real hardware to fine-tune the code.

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , | 5 Comments

8088 MPH: How it came about

Well, you’ve already seen the demo in my previous post. And Trixter, reenigne and VileR have already covered most of the technical details in these articles:
8088 MPH: We Break All Your Emulators
1K colours on CGA: How it’s done
8088 PC Speaker MOD player: How it’s done
More 8088 MPH how it’s done
CGA in 1024 Colors – a New Mode: the Illustrated Guide
I have done some technical articles myself:
8088 MPH: Sprites? Where we’re going, we don’t need… sprites!
8088 MPH: The polygons

So I think I will approach it from a different angle.

If you have been following this blog closely, you’ll know that I’ve already been in contact with Trixter and reenigne for a few years now, and we have already been sharing some code, knowledge and such regarding oldskool PC programming.

However, about a year ago, the idea of doing a serious 8088+CGA production together started taking shape. In my experience this has mostly been Trixter’s idea. Perhaps even a long-time dream of his, to put the PC on the map as a serious demo platform. He ‘recruited’ VileR because of his discovery of a 512-colour trick, and his excellent skills in working within the limitations of CGA. He ‘recruited’ reenigne because of his cycle-exact CGA hacking, which unlocked new videomodes and made new effects possible for the first time. And he ‘recruited’ me, because he wanted the demo to also have some 3d parts.

I feel it has been an honour to have been part of this project. I’m not sure if he selected me because he thought I could deliver the goods, or just because I was the only one crazy enough to try 3d on 8088+CGA, but it’s all the same :)

Since Trixter asked me to do some polygons, I started writing a sprite compiler. Erm what? Yes… I had played with the idea of a sprite compiler before, and this would be an excellent target platform. So I finally had a good excuse to write it. Also, my initial experiments in trying to port the 3d renderer of 1991 donut to CGA were not too successful:

I figured that I wouldn’t be able to do much more than a small cube at a handful of frames, because of the extremely limited CGA memory write speed (about 170 kb/s). So doing some vectorbobs might be more realistic.

I would like to stress how much of a team-effort this demo has been. We set up a mailing list, and started bouncing ideas around, sharing code, debugging and optimizing each other’s code, and getting inspired to do new effects or expand on them.

One example I’d like to share is what led to the final shading on the 3d objects. It started out with a discussion on creating fade-ins or fade-outs. I wanted to try a fade-to-white effect, which is commonly seen on C64. Now, we only have 16 fixed colours, but if you group them into gradients, you can get quite decent results. So I asked VileR if he had some suggestions for gradients that I could use. He came up with a whole list of gradients, where some were unexpectedly large. In one particular graphics mode, he had a blue gradient of 7 colours, and a red gradient of 6 colours.

Initially I was not planning on doing any lighting on the polygons, because the palettes seemed too restricted. But once I saw this, I thought: “Okay, shading may actually look good in that particular palette”. And Trixter had been asking me to add dithering for a long time. Initially I was reluctant to do so, again because the palettes just didn’t seem to have any usable colours to make dithering look good. But now I thought: “If I were to dither between the two nearest colours in these gradients, I’d actually have quite a selection of shades, and it may actually look good!”

The only problem that was left was that shading made the renderer slower, and dithering even more so. Namely, as already explained in Trixter’s coverage, the renderer is designed to only render the changes with the previous frame, to save fillrate. If you don’t do any shading, it is basically only updating the changes at the edges. Everytime a polygon changes shade, the entire polygon needs to be filled on screen again. And by adding dithering, you effectively double the amount of shades you have, so changes in polygon colour will be more frequent. So I implemented it, and shared the results. We voted that the shading and dithering looked good enough to use despite the hit in framerate, so we kept them.

The fade-in/out routines also made it into the demo by the way, they are used in the DeLorean part.

Tools used during production

As you can see in that early cube video, I initially used a Philips XT clone. I got this machine from BokanoiD, and it was very useful during early development. However, as I already discussed earlier with the CGADEMO, it is not a cycle-exact clone of the IBM PC/XT.

Another problem with that machine was that although the ATi Small Wonder videocard had a composite out, it did not appear to generate a signal. Upon closer inspection, it appeared that a lot of components were missing from the card, mostly resistors, a transistor and some caps. These components were basically what makes up the RGBI->NTSC composite DAC ladder. Reenigne and I studied some online photos of other ATi Small Wonders, which did have all the components, and we decided to try and solder them on to see what happens. So reenigne sent me the parts, which I then soldered onto my card… and indeed, the NTSC composite output started working! I am not sure why the components were left off… Perhaps because it was a machine sold in a PAL country, and the NTSC signal wouldn’t work anyway, and might even damage equipment?

However, as I noticed, it only worked correctly in regular RGBI-oriented stuff. The fixed 16 colours in 40×25 textmode worked correctly, and the 4-colour CGA palettes also worked. But when using the colorburst on 620×200 mode, the 16 artifact colours were all wrong compared to a real IBM CGA (either old style or new style).

So, I knew that I had two reasons to locate a real IBM machine: for cycle-exact effects and for proper composite output. I eventually had to buy an IBM PC/XT 5160 machine second-hand. I could not find any with a CGA card installed though. So Trixter offered to send me one of his cards. That way we at least knew that it was an original, tested and compatible IBM CGA card. He only had one old style card though, which he needed at home for captures, so he sent me a new style card. It would be cycle-exact, but its colours would be slightly off. Not too much of a problem for software development. He would take his old style CGA card to the Revision, so we could put it in my machine for the final capture on-site. Which is exactly what happened: the capture shown during the compo, which is now also on YouTube, is taken from my machine with Trixter’s old style CGA card.

During development, I initially used a Philips CM8833 monitor. This was the RGBI CGA monitor I originally used back in the late 80s on my Commodore PC10-III. Sadly, it broke down after a while. Luckily Ikilledher had a Commodore 1084S, which he gave to me, so I could continue development with a real monitor. Neither of these monitors did composite in colour though, so I used a Samsung LCD TV for that.

Another problem I ran into was with OpenWatcom. Initially I was using Turbo C++ 3.1 for all my DOS retroprogramming needs. But I ran into some issues, such as that it does not seem to handle uninitialized data properly. Even uninitialized data takes up space in your binary. Since we were on a tight budget of a single 360KB floppy, this was a problem. Also, the compiler is far from state-of-the-art. So I tried looking at OpenWatcom, since that is a more modern cross-compiler, which supports 8088/8086.

After modifying my codebase to be compatible with OpenWatcom, I found that the code would run fine in DOSBox or PCem, but it locked up on a real 8088. I spent a number of hours debugging the exact issue, and eventually pinpointed it to the FPU detection routine in their libc. Namely, on 286 and newer, you can execute an fwait instruction if no FPU is present, and it will just complete immediately. On an 8088/8086 however, the CPU will wait endlessly for the FPU to signal that the bus is free. If you do not have an actual FPU installed, the code will just lock up. So, once I found what was causing this problem, I modified the code in the libc to be compatible with 8088, while maintaining compatibility with newer systems. I have filed a bug report, and made the patched libc available here.

Once I solved this issue, I noticed that my code was still not behaving properly. As I found out, OpenWatcom defaults to unsigned char, where Turbo C++ and MS C/C++ default to signed char. Once I had fixed the datatypes to signed char, I could move to OpenWatcom C for development. The 3d polygon part is done with OpenWatcom C and inline assembly. The sprite part was done in Turbo Assembler.

At the party

As Trixter has already mentioned, we came to the party with a working set of effects, but the two days of development at the party really transformed it into a polished product. The first real response we had was when gasman came to see the demo running on real hardware. We saw a smile on his face in all the right places. That was the first sign that this demo might actually work the way we hoped. Since nobody had ever done a major production on this platform for a major demoparty before, we had no idea how people would respond to it, and whether they would ‘get’ the platform. Which is why we designed the intro sequence to explain that this was not just a PC (286/386/486, VGA, Sound Blaster etc), but actually THE IBM PC, the original from 1981, which is no match for a C64 hardware-wise. We thought our main competitors would be some very strong C64 demos, a platform that has been explored for demos for some 30 years now, with some very experienced demo makers in the scene.

Since the organizers were behind on schedule, and we had a rather ‘unique’ platform (NTSC composite, which is always a gamble in a PAL country, and IBM CGA signals are not 100% perfect anyway, so you need capture equipment that is somewhat forgiving), gasman agreed to let us provide our own capture, since we had a working capture setup already. He did not even ask to inspect the inside of our machines. I think that was a nice sign of respect and trust.

The demo turned out to be a huge success at the party. People loved it, despite the rather crude hardware and rough soundtrack. I think some of the best responses we could have gotten are the ones that say that this is what the demoscene is all about: pushing hardware to the limits and beyond. And how it inspires other people to push on as well.

Loader Text screens

I have seen complaints that the loader font was difficult to read, so I will just give you the texts of all loader screens, in case you had trouble reading them:

welcome to 4.77 mhz! welcome to cga! what’s a bitplane?

and now i see with eyes serene the very heart of the machine

no copper! no vic-ii! what are we supposed to do?

dots are my favours… except when they’re saviour’s.

sprites? where we’re going, we don’t need sprites

you may want to close your eyes for this

race the beam  on your mark… get set… go!

if my calculations are correct, when this baby hits 8088 miles per hour, you’re going to see some serious *!?*

and now we must bid you farewell  no paula… no sid… no problem

Common misconceptions

This demo seems to have become bigger than just the demoscene, and has found its way to various other tech-related newssites, blogs and forums. It even made its way to Adafruit, where a very appropriate quote from Teller happened to appear underneath the article:

Sometimes, magic is just someone spending more time on something than anyone else might reasonably expect.

I suppose that is certainly true for this production. A lot of work went into this production. We are the first to do such a production on this platform, so we had to research the hardware thoroughly, and write our own tools for everything. Reenigne’s blog does a good job of explaining just how much thought went into the new display modes, or into the mod player for example. Some people, even Sylvester Hesp, game developer at Nixxes, think it’s just a case of getting together for a few days, writing some assembly, ‘counting some cycles’, and that’s it. But writing and optimizing the actual code is only a small part of what made this demo possible. We first had to figure out what to write, and how to write it. The same goes for my polygon routines for example, they work very differently from the polygon renderers I have discussed on this blog earlier, and in fact I have never written a polygon routine like this before.

A lot of research went into this demo, in every single part of it. Some people seem to think that you do a demo like this top-down: “I want to do this-and-that effect”, and then code it. But instead, this demo is very much a bottom-up affair: we started off by just studying the hardware, and exploring the limits. As we became more familiar with the hardware, and got more control over it, we got inspired to try new effects, or to add extra features to existing effects (eg, the sprite part started out as just moving sprites, but later we found a way to combine sprites and scrolling. And initially the sprite would just move near the bottom of the screen, but after some experimenting we managed to make it move over the entire screen without flicker).

Some people also seem to take the comparison with C64 the wrong way. As far as I am concerned, “PC SUXX!” is not a joke. If you have been following my blog, you know that I grew up with a C64, and am still quite fond of the machine and the scene surrounding it. This demo does not intend to prove that the PC is better than the C64. We know it is not, and we know that there are some things the C64 does, that we can never do. Also, the C64 has about 30 years of a headstart on us. It would be unrealistic to think that we could close that gap completely with just one demo. That level of refinement takes many years to reach. We were heavily inspired by the C64 scene, and took various ideas from C64 and adapted them to our platform. We approached the PC as a fixed hardware platform, allowing special cycle-exact code and tricks, much like what people do on C64.

Some people also think that our system is a lot faster because we have a 4.77 MHz CPU, and the C64 has a 1 MHz CPU. This is a case of the MHz myth, and Trixter has already covered that in an earlier blog. The short version is that the DRAM modules used in all early 1980s microcomputers are more or less the same speed, so all 8-bit systems have more or less the same memory bandwidth. This dictates most of the performance (there was no caching yet). CPUs may run at vastly different clockspeeds, but performance-wise, they are very similar. The Z80 also ran at 3.5 Mhz in most implementations, but still a ZX Spectrum isn’t exactly a computing powerhouse compared to a C64 either.

So where does that leave us? Well, we did not just want to do the first serious demo on the original IBM PC with CGA for the novelty and the ‘1k hack’. We wanted to give it the best we got, so we have optimized all routines in our demo to the best of our abilities, much like many contemporary C64 demos. So we tried to set the standard for this platform as high as we could. I am sure it can be done better, but until someone else makes a demo for this same platform, we won’t really know just how good or bad our attempt was. At some point I said though: “Some stuff looks or sounds so smooth and effortless, that you can’t tell anymore how difficult it was to make”.

And lastly, some people even think the demo is fake. Sadly, it is rather difficult to find the right hardware, but I have found two people who recorded the demo running on their IBM PC 5150:

The first video shows the whole demo, and there are two problems:

1) The high-colour modes do not work, because the used IBM CGA card has an HD6845 instead of the MC6845 that we used during development. This can be fixed with a small tweak in the code however.

2) The Kefrens bars are unstable. As you can read in the comments, this was caused by a network driver. Doing a clean boot without loading the driver was enough to stabilize that.

The second video does not show the whole demo, but it does show the 256 colour plasma and the 1024 colour girl working, so apparently this PC has a compatible IBM CGA card. Put the two together, and there is the proof that the whole demo works on a real IBM PC 5150 with IBM CGA.

I don’t think that part has quite sunk in yet… We actually won the oldskool demo compo with an IBM PC from 1981 with CGA! That’s just crazy!

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , , , , , , , , , | 6 Comments

Just keeping it real at Revision 2015

The demo I have made together with Trixter, reenigne, VileR, coda, virt and Phoenix has won the oldskool compo at Revision 2015:

I will discuss some of it in more detail at a later time. Trixter has already done a global write-up of the demo. And reenigne has done a piece on the 1024-colour tweak, the mod player, some other stuff, and VileR has also done a piece on the new CGA colours.

(Update: I have done pieces on the sprite and polygon parts as well now, and some background info on overall development)

For now, I’ll just leave you with the comments in the #revision IRC channel at the time our demo was shown:

00:34 <+Harekiet_rev> CGA!
00:34 < Raylight-PwL> lost for words
00:34 <+vscd[NPL]> WORD!!!! Platfor 8Bit
00:34 < Ham^Sfl> 1981 8088 PC
00:34 < Ham^Sfl> :D
00:34 < pinchy> hehe
00:34 < wullon> Interesting
00:34 <@Wayne^AWY> oh no It’s IBM
00:34 < HellMood> instant love! ^^
00:35 < Ham^Sfl> beware of great audio beepers :D
00:35 < platosha> didn’t know that pc had composite video
00:35 < HellMood> !love
00:35  * SceneSat spots that HellMood simply loooooves Hornet + CRTC + DESiRE – 8088 MPH! Ah, l’amour!
00:35 < Shad0w> :o
00:35 < Ham^Sfl> CGA ! :D
00:35 <+KeyJ> WT…F!
00:35 < joga> \,,/
00:35 <@[shd]> dafuq?
00:35 < cyb> wait WHAT!?!?!?!
00:35 <+KamikaTze> CGA rules!
00:35 <+HopperTSQ>
00:35 < kowalski> O_o
00:35 < mruczek> whoa
00:35 < joga> aww yiss
00:35 <+Krill> blast from the past :)
00:35 < ne7> arg :D i’m spoiling the surprise hehe
00:36 <+jdb78> now THIS is oldskool :D
00:36 < ne7> :)
00:36 < wullon> Love this
00:36 -!- merry_sofascenr [] has quit [Ping timeout]
00:36 < wullon> !love
00:36  * SceneSat spots that wullon simply loooooves Hornet + CRTC + DESiRE – 8088 MPH! Ah, l’amour!
00:36 < Ham^Sfl> LOLve this :D
00:36 < sylto> !love
00:36 < pinchy> the 8088 heart on fire
00:36  * SceneSat spots that sylto simply loooooves Hornet + CRTC + DESiRE – 8088 MPH! Ah, l’amour!
00:36 < cyb> JAW: DROPPED
00:36 < rIO_sK> fuck!
00:36 < Cubed> !likes
00:36  * SceneSat sees that Cubed likes Hornet + CRTC + DESiRE – 8088 MPH!
00:36 < noby> ahh trixter you… trixter!
00:36 <+KamikaTze> Trollmaster 3000!!!
00:36 < Ham^Sfl> in fact, this 1-bit tune is awesome
00:36 < Ham^Sfl> !massive
00:36  * SceneSat agrees with Ham^Sfl that Hornet + CRTC + DESiRE – 8088 MPH is a MASSIVE tune!
00:36 < joga> i love the beeper
00:37 < Sofa6434> amazing
00:37 < noby> !like
00:37  * SceneSat sees that noby likes Hornet + CRTC + DESiRE – 8088 MPH!
00:37 < kowalski> !love
00:37 < AMiiiiiGAAAAA> terrible sound – but nice gfx
00:37  * SceneSat spots that kowalski simply loooooves Hornet + CRTC + DESiRE – 8088 MPH! Ah, l’amour!
00:37 < ne7> catching up now :D
00:37 < ne7> !loves
00:37  * SceneSat spots that ne7 simply loooooves Hornet + CRTC + DESiRE – 8088 MPH! Ah, l’amour!
00:37 < Shad0w> i like the beeps
00:37 < cyb> o_O
00:37 < Ham^Sfl> suddenly I wanna try this one on an old old PC
00:37 <+air2k^rev> bleepcore.
00:37 < rIO_sK> from a pc speaker this isn’t that bad  :D
00:37 < Sofa3554> !massive
00:37  * SceneSat agrees with Sofa3554 that Hornet + CRTC + DESiRE – 8088 MPH is a MASSIVE tune!
00:37 < sylto> oooooook
00:37 <+Urtie> I would straight up not have believed this if I saw this back then
00:37 < Ham^Sfl> 1-bit beepers is all you need
00:37 < cyb> how.the.fuck
00:38 < yessopie> greetz to the virgins who were ritually sacrificied to make this black magic possible
00:38 < Ham^Sfl> back to the past
00:38 <+anti_> I guess I’m already sleeping and this is only a dream…
00:38 < rIO_sK> yessopie: :D
00:38 < sylto> how?
00:38 < HellMood> M A G I C ;)
00:38 <+anti_> hardware hack?
00:38 < noby> well.. :)
00:38 < HellMood> yep
00:38 < Ham^Sfl> 1-bit arppeggios sounds nice
00:38 < Sofa6434> hahaha
00:38 <+vscd[NPL]> nice gfx
00:38 <+KeyJ> I bet that most of these multicolor trick only work over composite
00:38 < pinchy> damn
00:38 < Ham^Sfl> KeyJ: sure
00:39 < yessopie> putting the color into color graphics adapter
00:39 <@[shd]> i gues something like changing the palette with everx x-th scanline
00:39 < AMiiiiiGAAAAA> the muzak is killing my nerves
00:39 <+KamikaTze> my ears hurts
00:39 < noby> KeyJ: that was listed in the platform details on the composlide indeed
00:39 <+Krill> cga over composite? o.O
00:39 < sylto> thanks for the explanation
00:39 < ne7> :) u can do a lot with phasing sounds quick in 1bit :)
00:39 <+KeyJ> Krill: original CGA has a composite output
00:39 < platosha> is pal timing thing may be?
00:39 < kowalski> zx spectrum-like multicolor tricks? )
00:39 <+Harekiet_rev> !awesome
00:39  * SceneSat sees the awesomeness that Hornet + CRTC + DESiRE – 8088 MPH holds, according to Harekiet_rev!
00:39 <@s7evens> !like
00:39 < gizmo__> wtf
00:39  * SceneSat sees that s7evens likes Hornet + CRTC + DESiRE – 8088 MPH!
00:39 < Ham^Sfl> this one is really great
00:40 < ne7> insanely good hehe
00:40 < Ham^Sfl> !likes
00:40 < sw4mp> !insane.
00:40  * SceneSat sees that Ham^Sfl likes Hornet + CRTC + DESiRE – 8088 MPH!
00:40 < cyb> !awesome
00:40  * SceneSat sees the awesomeness that Hornet + CRTC + DESiRE – 8088 MPH holds, according to cyb!
00:40 < joga> !awesome
00:40  * SceneSat sees the awesomeness that Hornet + CRTC + DESiRE – 8088 MPH holds, according to joga!
00:40 <+Krill> KeyJ: wow, didn’t know. only know them with vga-style output
00:40 <+Krill> guess i’m too young :)
00:40 < FLD> !massive
00:40 < Sofa3554> I, for one, welcome our new 8088 overlords
00:40  * SceneSat agrees with FLD that Hornet + CRTC + DESiRE – 8088 MPH is a MASSIVE tune!
00:40 <+KamikaTze> S O U N D
00:40 < gizmo__> how is that even possible
00:40 < easy_john> awesome
00:40 < sylto> innovative and interesting, too 1337 for me
00:40 < pinchy> come on torus
00:40 < gizmo__> interlacing?
00:40 < pinchy> hehe
00:40 < rIO_sK> composite artifacts at their best…awesome
00:40 < FLD> man im glad i stayed up for the last entry :)
00:40 <@velo^aprx>
00:40 < gizmo__> awesome
00:41 < Ham^Sfl> too much Alan Silvestri… GEMA alert!
00:41 < Ham^Sfl> :D
00:41 < yessopie> Ham- I recognized that, too
00:41 <+vscd[NPL]> Seems they have computer which can beat our computers ;)
00:41 < platosha> hmm, artifacts…
00:41 < Sofa6434> 1bit musicdecoder
00:41 < Ham^Sfl> this is a great demo
00:41 < cyb> dosbox needs to be rewritten after this ;)
00:42 <+Harekiet_rev> well that was way more awesome then expected :)
00:42 < Sofa3554> !love
00:42 < anix> good shit
00:42  * SceneSat spots that Sofa3554 simply loooooves Hornet + CRTC + DESiRE – 8088 MPH! Ah, l’amour!
00:42 < gizmo__> wow wow wow
00:42 <+Harekiet_rev> noooo dosbox is fine….
00:42 < Ham^Sfl> I wonder if it would work on DosBOX (probably no)
00:42 < dexxo> great stuff, even with the pc speaker spasms
00:42 <+maep> <3 virt
00:42 <+KeyJ> Ham^Sfl: no way
00:42 < joga> speaker’s supposed to taze you a bit
00:42 <+anti_> We talked about something like this before the BBSs started to die … I think … my brain … too long ago …
00:43 <+KeyJ> I would be totally surprised if that worked on ANY emulator, at all
00:43 <+vscd[NPL]> The last thing I made on my 8088 was Monkey Island ;)
00:43 < rIO_sK> Ham^Sfl: if I’m ight the trick isto use composite output artifacts, so i think not
00:43 < Ham^Sfl> KeyJ: they didnt emulate CGA to the required level
00:43 < noby> PEEEZEEEEE!
00:43 < corpsicle> nice
00:43 < wullon> Winner
00:43 < sylto> wowzerz
00:43 < scenept> brutal
00:43 <+KeyJ> Ham^Sfl: that (especially composite), and I guess they do some timing-sensitive stuff
00:43 < Ham^Sfl> great compos
00:43 <+OergWin98> Scgali: fucking great work
00:43 < joga> tight compos
00:43 < platosha> if it’s a pal encoding trick, i doubt dosbox would ever be capable to emulate it
00:43 <+OergWin98> you guys rock
00:43 < SaphirJD> holy hell, that one was epic
00:44 < gizmo__> fantastic
00:44 < legend_> epic compos
00:44 <+shaderpig> impressive!
00:44 < Piru> uh. compo overdose
00:44 <+KeyJ> racing the beam without interrupts = things gonna fuck up if the cpu speed is the slightest bit off
00:44 < Ham^Sfl> KeyJ: yeah
00:44 <+shaderpig> awesome compos today.
00:44 <+Harekiet_rev> the scroller texts were a bit unclear :(
00:44 <+OergWin98> platosha: It somewhat emulates CGA color trickery, but surely not this awesome shit :D
00:44 < ne7> fabulous :D
00:45 < Salinga> awesome compos. 2015 was a VERY good year so far. :)
00:46 < [adam]> :)
00:46 < Raylight-PwL> Totally awesome!
00:46 < Cubed> Fantastic start, definitely feel inspired to make something now
00:47 < Sofa6434> Wondering if this can be topped in todays compoblock
00:47 < Toffeeman> Very good demos today.

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , | 4 Comments

Triton’s Crystal Dream – the final chapter

A few years ago, I patched some of the code in Triton’s Crystal Dream demo from 1992:

Although my code made it work on my real 486DX2-80 machine with Sound Blaster Pro (the ‘ideal’ setup for this demo), it did not work in DOSBox at the time:

It still does not work properly in Dosbox though, but I think that is related to the way it uses DMA. It also won’t run properly under Windows 95, only from pure DOS. It might be because it does not reset the PIC timer rate to the default value… But I don’t know if and when I’ll ever look into that.

Well, the other day I was fixing some other bugs in DOSBox… namely, the screenwidth was hardcoded for the CGA composite emulation mode. I figured, since I had a working configuration to build the DOSBox source code anyway, I might as well give Crystal Dream another look.
As I said in the earlier blog, I thought the issue was related to the PIC, because DOSBox prints the following messages on the console during the demo:
PIC:ICW4: 1f, special fully-nested mode not handled
PIC:ICW4: 1d, special fully-nested mode not handled
PIC:ICW4: 1f, special fully-nested mode not handled

This turned out to be a wild goose chase however. I thought that the demo would use at least two interrupts, namely the timer interrupt, and the SB DMA interrupt. Since DOSBox cannot handle the special fully-nested mode, I thought perhaps some interrupts got lost.

This was not the case however. I opened up my old project, where I had reversed most of the sound playing code in order to fix the SB detection. And lo and behold: there was no interrupt handler at all for the SB!
Normally you’d get an interrupt when the DMA transfer is complete, so you set up a new buffer. So what is it exactly that Crystal Dream does then?

Well, one part of it is a rather curious DMA mode: It sets up a DMA transfer for 1 byte at a time, and uses the timer interrupt to replace this byte at the replay rate. This apparently works in DOSBox, because you can hear the first buffer playing (it sounds rather distorted though). However, the DMA then stops, while it continues playing on a real system. I looked into the SB emulation code of DOSBox somewhat, and found that if I forced the ‘autoinit’ DMA mode, that it would continue playing, without requiring an interrupt handler to restart the DMA. However, the code does not seem to set up the SB for autoinit, which makes sense, since only SB 2.0 and later support it, and this demo would target early hardware as well.

I happened to read about the single-byte DMA technique, when looking into DOSBox-X, a patched version of DOSBox aiming to emulate the hardware more correctly. It is known as ‘goldplay’ there, since Goldplay is an early MOD player which used this technique. I tried using ‘goldplay’ mode on Crystal Dream, but although it made the sound clean (regular DOSBox DMA code does not process the DMA quickly enough, so the sound is very noisy and distorted), it still stopped after the first buffer.

I looked into the code somewhat more, and noticed that there were some comments about Crystal Dream in the code. Apparently this issue has been looked at. And indeed, once I used the correct settings (the patches only apply for regular SB/SBPro configurations, not for the default SB16), DOSBox-X could play back the sound properly.

There is a document on how exactly Crystal Dream handles its Sound Blaster audio, and why it wouldn’t work in DOSBox:
Here is also a discussion on the issue and the fixes in DOSBox-X:
In short, it doesn’t rely on the SB’s interrupts at all. It simply polls the SB status and resets the playback from inside the timer interrupt.
While this works on real hardware, the DOSBox emulation wasn’t accurate enough to allow for such polling. But DOSBox-X contains some patches to handle the situation.

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , | Leave a comment