Intel disables TSX in Haswell

I was going to do a blog about this earlier, but the timing was rather unfortunate, because I had just published another blog. Then it slipped my mind, until the news of the new Haswell EP-based Xeon CPUs that is. So here it is after all (so people can stop saying I only publish things about AMD’s bugs).

So, a while ago, Intel published an erratum relating to the new TSX instructions in Haswell. As you might recall, TSX was one of the most interesting things about Haswell in my opinion. Apparently there is a bug in the implementation, which causes unpredictable behaviour in some cases. Sounds like some kind of race condition. Since there apparently is no way to fix this in microcode, TSX will just be disabled by default. The upcoming Haswell-EX CPU should have a fixed TSX implementation, at which time it can be enabled again.

As for the new Xeons… well, I don’t think I’ll do a writeup on them. There are some interesting things, such as how the cache is organized in ‘clusters’, and how turbo mode is now so effective that even the 18-core model can perform very well in single/low-threaded workloads, making it the best of both worlds. But, all that is already explained in great detail in the various reviews, so I suggest those for further reading.

Posted in Hardware news | Tagged , , , , , , , , | Leave a comment

1991 donut – final release

No-XS has composed a special EdLib track for me to use in the 1991 donut, so I could finish it properly and release the final version:

There are a number of small changes and tweaks, I will just quote the nfo file:

The final version differs from the party version in a number of ways:
– AdLib music by No-XS
– Improved logo
– Moving lightsource and a tiny bit of ambient light
– Reduced precalc time
– Code is a lot faster (should no longer crash on very slow machines, not even a 4.77 MHz 8088)
– Instead of separate binaries for 286 and 8088, there is now one binary for all
– Specify ‘low’ on the commandline to switch to the low-poly donut for very slow machines

If you are interested in reading more about this intro, I will place the link to the two relevant blogposts here:
Just keeping it real… like it’s 1991
Just keeping it real… bugfixing like it’s 1991

Posted in Oldskool/retro programming, Software development | Tagged , , , , , , , , , , , , , , , , , , , | Leave a comment

8088 Domination by Trixter/Hornet

If you’re into retro-demoscening, you have probably already seen this demo, but just to be safe, I’ll just do a quick blog on it anyway, because it’s just that cool.

It started with 8088 Corruption back in 2004:

Basically it is a full-motion video player for the original 4.77 MHz 8088 PC, with CGA, a harddisk and a Sound Blaster card. Due to the limitations of the platform, Trixter decided to use textmode rather than graphics mode, since framerate tends to be more important than overall resolution/detail in terms of reproducing realistic video. Trixter did an interesting talk on that, explaining it in more detail.

Anyway, fast-forward to 2014, and Trixter had some new ideas on how to encode video in a way that it can be replayed fast enough even in graphics mode, on the exact same platform, which became 8088 Domination:

Trixter has also put in a lot of work on explaining the inner-workings, so I will just link you to that:

8088 Domination Post-Mortem, Part 1

8088 Domination Post-Mortem, Conclusion

A website documenting the video codec used in the demo, with full source code

Hope you enjoy it!

 

Posted in Oldskool/retro programming, Software development, Software news | Tagged , , , , , , , , , , | Leave a comment

When old and new meet: lessons from the past

As you may know, I like to dabble in retro-computing. I have held on to most computers I’ve used over the years, and I play with them from time to time. However, an interesting change has happened over the years: in the early days, a computer was more or less designed as a disposable item. As a result, the software also was disposable. By that, I mean, a Commodore 64 is just that: a Commodore 64. It is not backward-compatible with the VIC-20 or other Commodore machines that went before it, nor is it forward-compatible with later machines, such as the C16, the Plus/4 or the Amiga. As a result, the software you write on a Commodore 64 is useful for that machine only (the C128 being the exception to the rule, but only because the C128 was designed to have a C64 mode). So once you upgrade to a new machine, you also have to rewrite your software.

However, companies such as Apple and IBM were mainly selling to businesses, rather than to consumers. And for businesses it was much more important that they could continue to use the software and hardware they had invested in, when they bought new machines. So in this market, we would see better backward and forward compatibility with hardware and software. Especially on PCs, things seem to have stabilized quite a bit since the introduction of the 386 with its 32-bit mode, and the Win32 API, first introduced with Windows NT 3.1, but mostly made popular by Windows 95. We still use the Win32 API today, and most code is still 32-bit. As a result, the code you write is no longer ‘disposable’. I have code that I have written in the late 90s that still works today, and in fact, some of it is still in use in my current projects. Of course, the UNIX-world has had this mindset for a long time already, at least as far as software goes. Its programming language C was specifically designed for easy sharing of code across different platforms. So UNIX itself, and many of its applications, have been around for many years, on various different types of machines. Portability is part of the UNIX culture (or was, rather… linux did not really adopt that culture).

With Windows NT, Microsoft set out to have a similar culture of portability. Applications written in C, against a stable API, which would easily compile on a variety of platforms. Since I came from a culture of ‘disposable’ software and hardware, I was used to writing code that would only get a few years of usage at most, and did not really think about it much. Focusing on computer graphics did not really help either, since it meant that new graphics APIs would come out every few years, and you wanted to make use of the latest features, so you’d throw out your old code and start fresh with the latest and greatest. Likewise, you’d optimize code by hand in assembly, which also meant that every few years, new CPUs would come out with new instructions and new optimization rules, so you would rewrite that code as well. And then there were new programming languages such as Java and C#, where you had to start from scratch again, because your existing C/C++ code would not translate trivially to these languages.

However, over the years, I started to see that there are some constant factors in what I do. Although escapades in Java and C# were nice at times, C/C++ was a constant factor. Especially C could be used on virtually every platform I have ever used. Some of the C/C++ code I have written more than 15 years ago, is still part of my current projects. Which means that the machines I have originally written it on, can now be considered ‘oldskool/retro’ as well. I have kept most of these machines around. The first machine I used for Win32 programming was my 486DX2-80. I also used a Pentium Pro 200 at one time, which was also featured in my write-up about the PowerVR PCX2. And I have a Pentium II 350, with a Radeon 8500. And then there is my work with 16-bit MS-DOS machines, and the Amiga. Both have C/C++ compilers available, so in theory they can use ‘modern’ code. So let’s look at this from both sides: using old code on new platforms, and using new code on old platforms, and see what we can learn from this.

Roll your own SDK

SDKs are nice, when they are available… But they will not be available forever. For example, Microsoft removed everything but Direct3D 9 and higher from its SDK a few years ago. And the DirectX SDK itself is deprecated as well, its last update dating from June 2010. The main DirectX headers are now moved into the Windows SDK, and extra libraries such as D3DX have been removed. Microsoft also removed the DirectShow code earlier, and especially the base classes used by the samples are interesting to keep around, since a lot of DirectShow code you find online will be based on these. So, it is wise to store your SDKs in a safe place, if you still want to use them in the future.

Another point is: SDKs are often quite large, and cumbersome to install. Especially when you are targeting older computers, with smaller harddisks, and slower hardware. So, for DirectX I have collected all the documentation and the latest header and library files for each major API version, and created my own “DirectX Legacy SDK” for retro-programming. The August 2007 DirectX SDK is the last SDK to contain headers and libraries for all versions of DirectX, as far as I could see. So that is a good starting point. However, some of the libraries may not be compatible with Visual C++ 6.0 anymore (the debug information in dxguid.lib is not compatible with VC++ 6.0 for example), so you may have to dig into older SDKs for that. I then added the documentation from earlier SDKs (the newer SDKs do not document the older things anymore, so you’ll need to get the 8.1 SDK to get documentation on DX8.1, a DX7 SDK for documentation on DX7 etc). And I have now also added some of the newer includes, libraries and headers from the DirectX June 2010 SDK, so it is a complete companion package to the current Windows SDK. When you want to do this, you have to be a bit careful, since some headers in the June 2010 SDK require a recent version of the Windows headers, which cannot be used with older versions of Visual C++. I tried to make sure that all the header files and libraries are compatible with Visual C++ 6.0, so that it is ‘legacy’ both in the sense that it supports all old versions of DirectX, and also supports at least Visual C++ 6.0, which can be used on all versions of Windows with DirectX support (back to Windows 95).

For several other APIs and libraries that I use, such as QuickTime and FMod, I have done the same: I have filtered out the things I needed (and in the case of QuickTime I have also made a few minor modifications to make it compatible with the latest Windows SDK), and I keep this safe in a Mercurial repository. I can quickly and easily install these files on any PC, and get my code to compile on there, without having to hunt down and install all sorts of SDKs and other third-party things. I have even extracted the relevant files from the Windows 8.1 SDK, and use that on Windows XP (you cannot install the SDK on an XP or Vista system, but the SDK is compatible with Visual Studio 2010, so you can use the headers and libraries on an XP system and build XP-compatible binaries, if you configure the include and library paths manually).

So basically: Try to archive any SDKs that you use, and keep note of what OSes they run on, and what versions of Visual Studio and the Windows SDK they are compatible with. Anytime a newer SDK drops support for certain features or APIs, or for older versions of Visual Studio or Windows or such, you may want to take note of that, and keep the earlier SDK around as well, possibly merging the two. And it makes a lot of sense to put the relevant files of the SDK into a code repository (at least the headers and libraries, optionally docs and even sample code). Having version control will help you if you are merging from two or more SDKs, and/or modifying some headers manually (such as the QuickTime example), and allows you to go back to earlier files if something accidentally went wrong, and you run into some kind of compatibility problems. It’s also convenient if you are developing on multiple machines/with multiple developers, because setting up the proper build environment is as simple as cloning the repository on the machine, and adding the proper include and library paths to the configuration. It keeps things nice and transparent, with everything in one place, set up the same way, rather than your SDKs scattered through various subdirectories in Program Files or whatnot.

Working code is not bugfree code

Both my OpenGL and Direct3D code have quite a long history. The first versions started somewhere in the late 1990s. Although there have been quite significant updates/rewrites, some parts of code have been in there since the beginning. My current Direct3D codebase supports Direct3D 9, 10 and 11. The Direct3D 9 code can be traced back to Direct3D 8.1. I started from scratch there, since my Direct3D 7 code was not all that mature anyway, and the API had changed quite a bit. Direct3D 9 was very similar to 8.1, so I could upgrade my code to the new API with mostly just search & replace. In those days, I mainly targeted fixedfunction hardware, and only started to experiment with shaders later. Initially the shaders were in assembly language, later HLSL was introduced.

As a result, my Direct3D 9 code has been used on a variety of hardware, from fixedfunction to SM3.0. When I originally started with D3D10-support, I thought it was best to start from scratch, since the APIs are so different. But once I had a reasonably complete D3D10-engine, I figured that the differences could be overcome with some #ifdef statements and some helper functions, as long as I would stick to a shader-only engine. So I ended up merging the D3D9 code back in (D3D9 was still very important at that time, because of Windows XP).

And it turned out to be a good thing. My engine runs in D3D9Ex mode on Vista or higher, which means it will use the new-and-improved resource management of the D3D10 drivers. No more separate managed pool, and no need to reset lost devices. I periodically run it on an old XP machine, to verify that the resource management still works on regular D3D9. And indeed, that is where I caught two minor bugs.

I then decided to look at pre-DX9 graphics cards. As I said, my D3D9-code was originally used on fixedfunction hardware, and I later merged that code into my D3D10/11-engine. This posed the following dilemma: D3D10+ is fully shader-based. Do I limit my D3D9-code to be shader-only as well, or do I somehow maintain the fixedfunction-code as a D3D9-only feature?

I decided that it would not be too invasive to leave the fixedfunction code in there, so I did. It would just be in there, dormant, since the actual code I was developing, had to be shader-based anyway, in order to be compatible with D3D10/11. Since that code was still in there, I decided to grab an old laptop with a Radeon IGP340M, a DX7-class GPU (no pixelshaders). There were a handful of places where the code assumed that a shader would be loaded. I had to fix up the code there to handle nullpointers gracefully. And indeed, now I could run my code on that machine. I could either use fixedfunction vertexprocessing (based on the old FVF flags), if the shaders were set to null, or use CPU-emulated vertex shaders (the engine has always had an automatic fallback to software vertexprocessing when you try to use shaders, I actually blogged about a bug related to that. The message of that blog was similar: testing on diverse hardware allows you to detect and fix some peculiar bugs). And if the pixelshaders are set to null, the fixedfunction pixel pipeline is enabled as well. This meant that I could at least see the contours of 3d objects, because the vertex processing was done correctly, and it was rasterizing the triangles with the default pixel processing settings. I then added some texture stage states to my materials, to configure the pixel processing, to make some simplified replacements for the pixel shaders I was using. Et voila, the application worked more or less acceptably on this old hardware.

Then it was onto another interesting configuration: my Pentium II 350 with a Radeon 8500. The first thing the Pentium II did, was to remind me about SSE2-support… I had moved my codebase from D3DXMath to DirectXMath a while ago, as I mentioned earlier. I also mentioned that I was thinking of disabling SSE2 for the x86-build of my engine. Well, apparently that has slipped my mind, but the PII reminded me again, since it does not support SSE2, so it crashed. So, first thing I had to do was to add the _XM_NO_INTRINSICS_ flag in the right place.
Take note that you also have to recompile DirectXTK with that flag, since it uses DirectXMath as well, and you will not be able to link SSE2-based DirectXMath code to vanilla x86 DirectXMath code, since the compiler will not see the types as equivalent.

Another SSE2-related issue was that the June 2010 DirectX runtime can not be installed on a non-SSE2 machine. This is because the XAudio component requires SSE2, and once the installer reaches that component, it will run into an error and rollback. However, my engine does not use that component. In fact, I only need D3DX9_43.dll and D3DCompiler_43.dll. Simply copying these files to the \Windows\system32 directory, or the application directory, is enough to make my code run on a non-SSE2 machine.

Now that I had made the code compatible with the Pentium II CPU, it was time to see if it actually worked. And well, it did, but it didn’t, if you know what I mean… That is, the code did what it was supposed to do, but that is not what I expected it to do. Namely, the application ran, and compiled its shaders, but then it appeared to use the same fixedfunction fallback-path as on the Radeon IGP340M. Now, the Radeon 8500 is from a rather peculiar era of programmable GPUs. It is a DirectX 8.1 card, with pixelshader 1.4 functionality. DirectX 9 supports PS1.4 (unlike OpenGL, where there is no standardized API or extension for any shader hardware before SM2.0), so what exactly is happening here?

Well, DirectX 10, that’s what’s happening! Microsoft updated the HLSL syntax for DirectX 10, in order to support the new constant buffers, and also some other tweaks, such as new names for the input and output semantic flags. Since the first DirectX 10-capable SDK (February 2007), DirectX 9 also defaults to this new compiler. The advantage is that you can write shaders that compile for both APIs. The fine print however tells us that this compiler will only compile for PS2.0 or higher (VS1.1 is still supported though). So what happened in my code? My engine detected that I have PS1.4, then it asks the compiler to compile a PS1.4 shader. The compiler does so, but in the process it silently promotes it to PS2.0. So it returns me a properly compiled shader, however it is not compatible with my hardware. When I later try to use this shader, the code will ignore the shader, because it is not compatible with my hardware, and so I get the fallback to fixedfunction. Now isn’t that interesting? Everything works as it should, the shader gets compiled, and the code works without error. You just don’t get to use the shader you compiled.

So, back to the fine print then. I noticed the following flag: D3DXSHADER_USE_LEGACY_D3DX9_31_DLL. Interesting! This will make the function fall back to the compiler of the October 2006 SDK, which is the last SDK before DirectX 10. In fact, this SDK already featured a preview version of D3D10, and also included a beta version of the new compiler. So with this flag you get access to a compiler that supports PS1.x as well. I added a few lines of code to make the engine fall back to this compiler automatically for hardware that is below PS2.0. Now I ran into the problem that my HLSL shaders were converted to the new D3D10 syntax some years ago. So, I did some quick modifications to get them back to the old syntax.

And there we go! At this point my D3D9 once again runs on everything it can run on. DX7-class non-shader hardware, SM1.x hardware, and of course the SM2.0+ hardware which it normally runs on. Likewise it once again runs on older CPUs as well, now that it no longer needs extensions such as SSE2. And in the end, it really was only a few lines of code that I had to change. That is how I tend to look at things anyway: I don’t like to think in terms of “This machine/OS is too old to bother to support”, but rather: “Is it theoretically possible to run this code on such a machine/OS, and if so, how much effort is required for that?” In this case, it really was not a lot of effort at all, and the experience taught me some of the more intricate details of how the DirectX SDK and D3DX evolved over the years.

So basically: Never assume that your code is bugfree, just because it works. Slightly different machines and/or different use-cases may trigger bugs that you don’t test for in day-to-day use and development. Tinkering with your code on different machines, especially machines that are technically too old for your software, can be fun, and is a great way to find hidden bugs and glitches.

Less is more

I have already covered this issue somewhat with the section on SDKs: newer code is not always backward-compatible with older tools. Especially C++ can be a problem sometimes. Older platforms may not have a C++ compiler at all, or the C++ support may be limited, which means you cannot use things like the STL. This made me rethink some of my code: does it really need to be C++ anyway? I decided that most of my GLUX project might as well be regular C, since it is procedural in nature anyway. So I converted the code, which only required a few minor changes anyway. Now it should be compatible with more compilers across more platforms. Which is nice, since I want to use some of the math routines on Amiga as well.

I noticed that in my BHM project, I moved to using STL a while ago, by introducing the BHMContainer class. However, the BHMFile class was still preserved as well, so I could still use that on platforms with no STL support, such as the Amiga, or DOS. This gave me something to think about… In C you do not have standard container classes, so I had written my own linkedlist, vector, hashtable and such, back in the day. In some of my C++ code I still use these container classes, partly because I am very familiar with their performance characteristics. But another advantage is that these container classes can be used with very limited C++ compilers as well, and I could even strip them down back to their original C-only form, to get even better compatibility. So there is something to be said for using your own code instead of STL.

Another thing is the use of standardized datatypes found in stdint.h/stdbool.h/stddef.h and related files. The most useful here are the fixed datatypes, in this case. C/C++ have always had this problem that the size of each datatype is defined by the architecture/platform, rather than being an absolute size. Although integers are 32-bit on most contemporary platforms, you may run into 16-bit or even 8-bit integers on older platforms. Especially for something like a file format, the right way to define your datastructures is to use typedef’ed types, and make sure that they are the same size on all platforms. This has always been good practice anyway, but since the introduction of the fixed datatypes (int8_t, int16_t, int32_t etc), you no longer have to do the typedefs yourself.

Old compilers may not have these headers, or may only have a limited implementation of them. But it is not a lot of work to just do the typedefs manually, and then you can use all the ‘modern’ code that makes use of them. Since my BHM code did not yet use proper typedefs, I have updated the code. That is actually a very important update to the code: finally the datastructures are guaranteed to be exactly the same, no matter what compiler or architecture you use. This means that I can now use BHM for my MS-DOS retro-projects as well, for example.

Lastly, neither my MS-DOS nor my Amiga C++ compiler support namespaces. So that was another thing to think about. In some cases, the old C-style ‘namespace’ technique may be better: You just prefix your symbol names with some kind of abbreviation. For example, in my CPUInfo-project I prefix some things with ‘CI_’. This should avoid name clashes in most cases, and is compatible with virtually all compilers.

So basically: If you want to go for maximum compatibility and portability, less is more, especially with C++. So if you are writing code that is mostly procedural, you might as well use regular C, instead of C++. And if you want to use C++, you can re-use your code on more platforms and compilers if you stick to a relatively basic subset of C++ functionality (eg. not using namespaces, not using STL, and only using relatively basic classes and templates).

Conclusion

Apparently there are some things we can learn from using old code on new systems, and using new code on old systems. It is often not too hard to get the two to meet. I hope the experiences I have written down for you have given you some new insights in how to write and maintain code that is not ‘disposable’, but is as flexible, portable and compatible as can be. In general this idea of ‘digital longevity’ seems to be quite new still. With the disappearing of Windows XP, a lot of people are only now starting to face these problems that the programs they are using may not work on new systems. And they probably wish they would have archived their code and applications at an earlier stage, and mapped out the compatibilies and incompatibilities with machines, OSes and tools better.

Posted in Direct3D, Oldskool/retro programming, OpenGL, Software development | Tagged , , , , , , , , , , , , , , , , , | Leave a comment

More hate fail…

Someone posted a link to my blog on the AnandTech forums… It’s funny how none of the responses discuss any of the blog’s contents (sad, since the post was mainly meant as a discussion piece. I pose a number of facts, but do not draw any conclusions either way, I leave that up for discussion). They are quick to discuss me personally though. About how I was banned there… Well, the reason I was banned there was quite simple: I was openly criticizing John Fruehe’s statements there. Apparently some of the moderators bought into Fruehe’s story at the time, so they just saw me as a ‘troublemaker’, and they thought they had to ban me. And as the internet goes, ofcourse the ban was never undone (let alone my reputation restored) once the ugly truth about Fruehe became known.

Another guy seems to remember another discussion about ATi vs nVidia anisotropic filtering. Funny how he still insists that I don’t know what I’m talking about. The reality of course is that his argument was flawed, because of a lack of understanding on his part. I never claimed ATi’s AF is ‘correct’ by the way. In fact, my argument was about how arbitrary the whole concept of AF is in general, so ‘correctness’ does not really come into play at all. Apparently he never understood that point. I merely pointed out that the arguments he tried to use to support his case were flawed (such as claiming that you can never have gray areas when using mipmapped checkerboard textures), and that the filtering can be classified as ‘perfectly angle-independent’ (which does not equate ‘correct’… angle-dependency is just one aspect of filtering. The argument he wanted to start was about how the mipmaps may be filtered and/or biased, resulting in undersampling and/or oversampling. Which, as I said in that blog, may or may not result in better perceived quality, even with more angle-dependency. In his case ‘quality’ seemed to equate ‘less gray areas’, but as I said, from a theoretical standpoint, gray areas can be considered ‘correct’, when you are at the limit of your mipmap stack).

Well, my blog is still up, I still stand by what I said at the time about ATi’s filtering (and what I didn’t say: I didn’t say it is ‘correct’… nor would slightly different implementations be ‘incorrect’, it is merely ‘within specifications’). I still say he is wrong, and lacks understanding. And if you disagree, you can still comment.

But well, apparently people don’t want any kind of technical discussions, they don’t want to understand technology. They just want to attack people. Quite ironic by the way that I am both attacked for being anti-AMD, and for defending AMD/ATi for having better angle-independency than nVidia at the time, in the same thread.

Update: BFG10K thought he had to respond and display his cluelessness again:

Quote:

Scali says: May 29, 2010 at 1:09 pm Ofcourse the gray area is correct for the Radeon 5770.

I am talking about the gray area, not about AF as a whole. Which is ‘correct’ for the Radeon 5770, as in, the way it is implemented, that is what it should yield. Other cards also yield gray areas near the middle, as you can see in this review for example. They just have slightly different transitions.

It isn’t correct, never was, and never will be. To state otherwise reveals a shocking lack of understanding, especially when reference versions are readily available to compare.

Ah, no technical argument whatsoever, only appeal to authority (as before). I however *did* give technical arguments: filter down a checkerboard texture, and your smallest mipmap will be gray. That’s just what you get when you filter down a texture that is 50% black and 50% white. So, it is correct that when you sample the smallest mipmap, you will sample only gray pixels. The only thing that is left up for debate is when and where in your image these gray pixels will become dominant. Which depends on things such as LOD biasing and what kind of approach you are taking with your mipmap sampling (eg, do you only take the two nearest mipmaps for all samples, or do you select the mipmap for each sample individually? Somewhat arbitrary yes, but ‘flawed’, no). With a checkerboard pattern, NOT seeing gray areas would actually indicate a sampling problem in certain areas (you are sampling from mipmaps that have more detail than is warranted given the texel:pixel mapping). And as I said, the ‘moire pattern’ that would be painted by the texture noise may be *perceived* as better quality (it gives the impression that the actual checkerboard texture can be seen even at great distances), while from a technical point of view it is not.

Referring to reference implementations is missing the point. As I said, there isn’t so much one ‘correct’ way to realtime AF. There are various ways to implement and tweak an anisotropic filter. One filter, set up a particular way, does not make other filters ‘incorrect’.

As this tutorial also points out:

The OpenGL specification is usually very particular about most things. It explains the details of which mipmap is selected as well as how closeness is defined for linear interpolation between mipmaps. But for anisotropic filtering, the specification is very loose as to exactly how it works.

The same goes for Direct3D (they both pretty much share the same specs when it comes to rasterizing, texturing, filtering and shading. After all, they both run on the same hardware). There is a ‘gray area’ (pun intented) that AF implementations can work in.

AMD themselves admitted the implementation was flawed and changed it (mentioning it in one of the 6000 series slides), but he’s still fighting the good fight on his oh-so-authoritative blog.

Sources for this? I see none. Yes, AMD has improved filtering since the 5000 series. However, that does not imply that the 5000 series was somehow ‘flawed’ or ‘incorrect’, so I doubt AMD would use such terms (in fact, I doubt they’d use those terms even if it were true). Well, he surely proved that he doesn’t have enough comprehension skills to even understand what I wrote. And once again he makes no technical argument whatsoever. The mark of a fanboy: to him, apparently the implementation of his favourite brand is the only ‘correct’ one, and everything else must be ‘incorrect’. Sadly for him, the D3D/OGL specs don’t agree with that. So I pose D3D/OGL specs, technical explanations and logical arguments, and he counters with personal insults and other fallacies… And then even claims he’s winning the argument, because *I* would have a lack of understanding? What an idiot. Pure Dunning-Kruger again. He would probably feel quite dumb if he ever read any actual API specs on the topic, but I think we can safely assume he’s never actually going to bother and try to educate himself on the matter in the first place.

Posted in Direct3D, OpenGL, Software development | Tagged , , , , , , , , , | 2 Comments

Richard Huddy back at AMD, talks more Mantle…

Richard Huddy did an interview with Tech Radar. One of the things he discussed there was the current state of Mantle, and its future.

One interesting passage in the interview is this:

DirectX is a generic APi. It covers Intel hardware, it covers Nvidia hardware and it covers ours. Being generic means that it will never be perfectly optimized for a particular piece of hardware, where with Mantle we think we can do a better job. The difference will dwindle as DX 12 arrives. I’m sure they’ll do a very good job of getting the CPU out of the way, but we’ll still have at least corner cases where we can deliver better performance, measurably better performance.

He basically concedes here that Mantle is NOT a generic API, and is cutting a few corners here and there because it only has to support GCN-based hardware (after all, if both DX12 and Mantle were designed to be equally generic (as the original claims about Mantle were: it would run on Intel and nVidia hardware), then there would be no corners to cut, and no extra (measurable, note that word) CPU overhead to avoid. The only thing they are avoiding here is the abstraction overhead that is in DX12, which allows it to support GPU architectures from multiple vendors/generations.

And, if we were to just apply some basic logic here: AMD is not *capable* of designing a generic API on their own. DirectX is designed with a committee with all IHVs involved, so as soon as someone proposes some kind of feature or API construct that will not work on some IHV’s hardware, the IHV will jump in. So in the end everything that is in the API will work on all hardware, and any incompatible features have been dropped.

Even if we were to assume that AMD would be fair and impartial to other IHVs in their design, they simply don’t have full knowledge of their competitor’s inner workings and limitations. So the thought of AMD (or any other IHV) designing a cutting-edge graphics API that is generic enough to be compatible with other IHVs is quite ridiculous anyway.

So, that leaves virtually none of the original claims about Mantle… We’ve already seen earlier that Mantle would not be a console API, and now it is not going to be a generic API either, but it will remain specific to AMD.

Huddy still claims that Mantle is what inspired DX12 though… At the same time he admits that some of the DX12 features are not supported on Mantle and AMD hardware yet:

They are pixel synchronization, which let you do some cool transparency effects and lighting transparent substances which is very, very hard on the current API. There’s something called bindless resources which is a major efficiency improvement again in how the GPU is running, making sure it’s not stalling waiting for the CPU to tell it about some of the changes that are needed.

The point about pixel synchronization…. I believe that is actually a reference to the order-independent transparency, which actually comes from Intel, and is known as PixelSync.

As for bindless resources… As I already said earlier, nVidia has been doing OpenGL extensions for bindless resources since 2009.

So these are some DX12-features that have clearly not originated from AMD, but from its competitors.

Posted in Direct3D, Software development, Software news | Tagged , , , , , , , , , , , , , | 6 Comments

Just keeping it real… bugfixing like it’s 1991

As you may have noticed, the 1991 donut intro did not have any music. I did not cover it in the previous blog, but there was some music planned for this small intro. I chose to use EdLib, because AdLib was one of the few sound cards available for PC back in 1991, and the EdLib replayer is relatively light on CPU, and contains only 16-bit code, so it would work on a 286 like the rest of my code.

However, at the time I was having problems getting the EdLib code to work together with the rest of the intro. I could get the EdLib code to work in a standalone program, but the whole system would crash when I called the same EdLib routines from the intro. I tried to debug it at the party place, but I could not pinpoint the cause at the time, or find a good workaround.

Over the weekend, I tried to give it another look. I had already arrived at the point where I suspected that malloc()’s heap was getting corrupted. And it seemed unlikely that the EdLib code was causing this, since there is nothing suspect going on in the EdLib code. It doesn’t make heavy use of the stack, and it does not call any kind of DOS or BIOS interrupts either, certainly nothing to do with allocating memory. Besides, it would often crash on the first call into the player code, so it looked like the player code was getting corrupted by something else.

I have had a discussion over the Second Reality code with some other demo coders recently. And one of the things we discussed was that they constructed a sort of ‘loader’, which provided various functions to other programs, including the music, through an interrupt handler. So I figured I could apply that idea here: if I write a loader in asm, I know 100% sure that it is not doing anything weird to the memory. If that loader then loads my C program, the C program should stay within its own memory as well, and the two would not corrupt eachother.

So I tried that… but I introduced a different bug there, which I overlooked. Initially I thought it was the same bug, and I was just blind to it, as I’d been looking at this code for far too long. So I asked Andrew Jenner and Trixter if they could have a look, because I was about to give up. Andrew found the bug, I forgot to call the function that set up the int handler. Apparently that got deleted while I was experimenting and fixing other things, and I overlooked that. Once I put the line back in there, things started working as expected: the intro code would just call into the music player once a frame, and the two processes would live happily side-by-side. Finally I had a way to play music for my intro!

However, we still had not found the actual bug, we merely had a workaround at this point. So we were not satisfied yet, there was still a challenge to overcome. I decided to set up a minimal program in C, which only loads the music from disk and plays it, to see if we could figure out exactly what goes wrong, and why. I then sent the program off to Andrew and Trixter with my initial analysis:

As far as I could trace it, it seems to be a bug in the linker.
I printed out the address of Player, and the address that malloc() returned.
I also printed out the bytes for the entry point of Player (not Player itself, but offset 0x62e that it jumps to).
This is what happens:
Player: 05110000
Player bytes: 1E060E1F
pSongData: 03D91704
Player bytes: 000C8019

So apparently it mallocs memory somewhat below the player… And after the fread(), the player got overwritten.
So we have 0x5110 for our player, and 0x5494 for our allocated memory. Which is slap-bang in the middle of the player code, right?
So obviously things die when you try to load your song there (or a torus for that matter).

So the question is: why is malloc() returning a block of memory that is part of the mplayer.obj in memory? The song I’m loading is only 4kb, and that is the only data there is. We have separate data and code segments, since it’s a small model, so in theory I should be able to allocate close to 64k before I need to worry about stack trouble.
So to me it looks like there is just something broken in the generated MZ header or something, causing malloc() to place the heap in the wrong place. It is probably placed after the code segment generated by the C compiler, but it does not seem to pay attention to the segment in other objs.
In which case I guess there are 3 possible locations for the bug:
1) The .obj file has an incorrect header, causing the linker to generate incorrect information -> assembling with the version of tasm included with TC++ 3.1 may solve that
2) The .obj file is correct, but the linker generates incorrect information anyway -> linking with a different linker (Microsoft?) may solve that
3) The headers are correct, but there is a bug in the libc causing malloc to interpret the headers wrongly -> roll your own malloc()?

I then tried to rebuild the code as stated, but that did not solve it. I also tried to use the Microsoft linker on the code, but although I managed to create a binary, it did not run. It would probably be quite a chore to figure out how exactly a Turbo C++ binary is set up. Then Andrew responded with his analysis:

The problem is as follows:
* The malloc() implementation looks at a variable called __brklvl to decide where to start allocating memory.
* The startup code (c0.asm) initializes the stack and __brklvl using the value of a symbol called edata@ which is in the segment _BSSEND.
* mplayer.obj doesn’t use the normal _TEXT and _DATA segments but instead has a single segment called MUSICPLAYER for both its code and its data. MUSICPLAYER has no segment class.
* tlink places segments without a segment class after the normal _TEXT, _DATA, _BSS, _BSSEND and _STACK segments – i.e. in the very place the startup code assumes is empty.

So the fix to the problem is a one-liner – in mplayer.asm just change the line that says:
musicplayer     segment public
to:
musicplayer     segment public ‘far_data’

Then MUSICPLAYER will be placed by the linker after _TEXT and before _DATA, so it won’t collide with anything. I suggest using ‘far_data’ instead of ‘code’ or ‘data’ so that MUSICPLAYER doesn’t take up any space in your normal code and data segments (which are limited to 64kB).

And there we have it! Our answer at last! After changing the segment class of the EdLib player, the code would finally work properly, and I can build a single-file intro with music. It seems I was on the right track with my initial analysis, but it seems that there is not really a conclusive answer as to what the bug really is. You could look at it from various angles:

  1. There was indeed wrong information in the .obj, because it was confusing the linker as to where the _BSSEND should be placed. Rebuilding it (after adding ‘far_data’) fixed it.
  2. The linker was in error, because if it had linked the segments in a different order, _BSSEND would have ended up in the right place after all.
  3. Libc is in error, since you cannot reliably assume that _BSSEND is the last bit of used memory in the binary. Perhaps it should have tried to parse the MZ header fields instead, to work out where the memory is.
  4. One should not link additional segments to a small model program, because that is not in line with the small model definition.

I personally don’t really agree with #4. In my interpretation the model only applies to the code that the compiler generates. Since you have far pointers and far function definitions, and even farmalloc() and farfree(), there should be no reason why you can’t interface code and data outside your own code and data segments.

I am leaning mostly towards #2 myself. Namely, #1 would not be a conclusive solution. If you are writing the code yourself, you have control over the segments. But in this case, MPLAYER.OBJ was supplied by a third party, and normally, changing the segment class would not be an option.
And if #2 is fixed, then the _BSSEND assumption will always be correct. The linker already seems to have some kind of predefined order for segments with known classes. If it would just place class-less segments at the front of that order, rather than at the end, the problem would be solved.
#3 would also be a robust solution, but it would make libc larger and more complex, so #2 would be preferred, especially in that era.

These are the most annoying, but at the same time interesting bugs. Bugs you just don’t see, because they aren’t in your code. You seem to be doing everything right, but it just does not work. Anyway, I may release a final version of the 1991 donut intro, with music, and perhaps a few other small tweaks. But at any rate, the next release WILL have music!

Posted in Oldskool/retro programming | Tagged , , , , , , , , , , , , , , , , , , , , , | 2 Comments