More hate fail…

Someone posted a link to my blog on the AnandTech forums… It’s funny how none of the responses discuss any of the blog’s contents (sad, since the post was mainly meant as a discussion piece. I pose a number of facts, but do not draw any conclusions either way, I leave that up for discussion). They are quick to discuss me personally though. About how I was banned there… Well, the reason I was banned there was quite simple: I was openly criticizing John Fruehe’s statements there. Apparently some of the moderators bought into Fruehe’s story at the time, so they just saw me as a ‘troublemaker’, and they thought they had to ban me. And as the internet goes, ofcourse the ban was never undone (let alone my reputation restored) once the ugly truth about Fruehe became known.

Another guy seems to remember another discussion about ATi vs nVidia anisotropic filtering. Funny how he still insists that I don’t know what I’m talking about. The reality of course is that his argument was flawed, because of a lack of understanding on his part. I never claimed ATi’s AF is ‘correct’ by the way. In fact, my argument was about how arbitrary the whole concept of AF is in general, so ‘correctness’ does not really come into play at all. Apparently he never understood that point. I merely pointed out that the arguments he tried to use to support his case were flawed (such as claiming that you can never have gray areas when using mipmapped checkerboard textures), and that the filtering can be classified as ‘perfectly angle-independent’ (which does not equate ‘correct’… angle-dependency is just one aspect of filtering. The argument he wanted to start was about how the mipmaps may be filtered and/or biased, resulting in undersampling and/or oversampling. Which, as I said in that blog, may or may not result in better perceived quality, even with more angle-dependency. In his case ‘quality’ seemed to equate ‘less gray areas’, but as I said, from a theoretical standpoint, gray areas can be considered ‘correct’, when you are at the limit of your mipmap stack).

Well, my blog is still up, I still stand by what I said at the time about ATi’s filtering (and what I didn’t say: it is not ‘correct’… nor would slightly different implementations be ‘incorrect’). I still say he is wrong, and lacks understanding. And if you disagree, you can still comment.

But well, apparently people don’t want any kind of technical discussions, they don’t want to understand technology. They just want to attack people. Quite ironic by the way that I am both attacked for being anti-AMD, and for defending AMD/ATi for having better angle-independency than nVidia at the time, in the same thread.

Update: BFG10K thought he had to respond and display his cluelessness again:

Quote:

Scali says: May 29, 2010 at 1:09 pm Ofcourse the gray area is correct for the Radeon 5770.

I am talking about the gray area, not about AF as a whole. Which is ‘correct’ for the Radeon 5770, as in, the way it is implemented, that is what it should yield. Other cards also yield gray areas near the middle, as you can see in this review for example. They just have slightly different transitions.

It isn’t correct, never was, and never will be. To state otherwise reveals a shocking lack of understanding, especially when reference versions are readily available to compare.

Ah, no technical argument whatsoever, only appeal to authority (as before). I however *did* give technical arguments: filter down a checkerboard texture, and your smallest mipmap will be gray. That’s just what you get when you filter down a texture that is 50% black and 50% white. So, it is correct that when you sample the smallest mipmap, you will sample only gray pixels. The only thing that is left up for debate is when and where in your image these gray pixels will become dominant. Which depends on things such as LOD biasing and what kind of approach you are taking with your mipmap sampling (eg, do you only take the two nearest mipmaps for all samples, or do you select the mipmap for each sample individually? Somewhat arbitrary yes, but ‘flawed’, no). With a checkerboard pattern, NOT seeing gray areas would actually indicate a sampling problem in certain areas (you are sampling from mipmaps that have more detail than is warranted given the texel:pixel mapping). And as I said, the ‘moire pattern’ that would be painted by the texture noise may be *perceived* as better quality (it gives the impression that the actual checkerboard texture can be seen even at great distances), while from a technical point of view it is not.

Referring to reference implementations is missing the point. As I said, there isn’t so much one ‘correct’ way to realtime AF. There are various ways to implement and tweak an anisotropic filter. One filter, set up a particular way, does not make other filters ‘incorrect’.

As this tutorial also points out:

The OpenGL specification is usually very particular about most things. It explains the details of which mipmap is selected as well as how closeness is defined for linear interpolation between mipmaps. But for anisotropic filtering, the specification is very loose as to exactly how it works.

The same goes for Direct3D (they both pretty much share the same specs when it comes to rasterizing, texturing, filtering and shading. After all, they both run on the same hardware). There is a ‘gray area’ (pun intented) that AF implementations can work in.

AMD themselves admitted the implementation was flawed and changed it (mentioning it in one of the 6000 series slides), but he’s still fighting the good fight on his oh-so-authoritative blog.

Sources for this? I see none. Yes, AMD has improved filtering since the 5000 series. However, that does not imply that the 5000 series was somehow ‘flawed’ or ‘incorrect’, so I doubt AMD would use such terms (in fact, I doubt they’d use those terms even if it were true). Well, he surely proved that he doesn’t have enough comprehension skills to even understand what I wrote. And once again he makes no technical argument whatsoever. The mark of a fanboy: to him, apparently the implementation of his favourite brand is the only ‘correct’ one, and everything else must be ‘incorrect’. Sadly for him, the D3D/OGL specs don’t agree with that. So I pose D3D/OGL specs, technical explanations and logical arguments, and he counters with personal insults and other fallacies… And then even claims he’s winning the argument, because *I* would have a lack of understanding? What an idiot. Pure Dunning-Kruger again. He would probably feel quite dumb if he ever read any actual API specs on the topic, but I think we can safely assume he’s never actually going to bother and try to educate himself on the matter in the first place.

Posted in Software development, Direct3D, OpenGL | Tagged , , , , , , , , , | Leave a comment

Richard Huddy back at AMD, talks more Mantle…

Richard Huddy did an interview with Tech Radar. One of the things he discussed there was the current state of Mantle, and its future.

One interesting passage in the interview is this:

DirectX is a generic APi. It covers Intel hardware, it covers Nvidia hardware and it covers ours. Being generic means that it will never be perfectly optimized for a particular piece of hardware, where with Mantle we think we can do a better job. The difference will dwindle as DX 12 arrives. I’m sure they’ll do a very good job of getting the CPU out of the way, but we’ll still have at least corner cases where we can deliver better performance, measurably better performance.

He basically concedes here that Mantle is NOT a generic API, and is cutting a few corners here and there because it only has to support GCN-based hardware (after all, if both DX12 and Mantle were designed to be equally generic (as the original claims about Mantle were: it would run on Intel and nVidia hardware), then there would be no corners to cut, and no extra (measurable, note that word) CPU overhead to avoid. The only thing they are avoiding here is the abstraction overhead that is in DX12, which allows it to support GPU architectures from multiple vendors/generations.

And, if we were to just apply some basic logic here: AMD is not *capable* of designing a generic API on their own. DirectX is designed with a committee with all IHVs involved, so as soon as someone proposes some kind of feature or API construct that will not work on some IHV’s hardware, the IHV will jump in. So in the end everything that is in the API will work on all hardware, and any incompatible features have been dropped.

Even if we were to assume that AMD would be fair and impartial to other IHVs in their design, they simply don’t have full knowledge of their competitor’s inner workings and limitations. So the thought of AMD (or any other IHV) designing a cutting-edge graphics API that is generic enough to be compatible with other IHVs is quite ridiculous anyway.

So, that leaves virtually none of the original claims about Mantle… We’ve already seen earlier that Mantle would not be a console API, and now it is not going to be a generic API either, but it will remain specific to AMD.

Huddy still claims that Mantle is what inspired DX12 though… At the same time he admits that some of the DX12 features are not supported on Mantle and AMD hardware yet:

They are pixel synchronization, which let you do some cool transparency effects and lighting transparent substances which is very, very hard on the current API. There’s something called bindless resources which is a major efficiency improvement again in how the GPU is running, making sure it’s not stalling waiting for the CPU to tell it about some of the changes that are needed.

The point about pixel synchronization…. I believe that is actually a reference to the order-independent transparency, which actually comes from Intel, and is known as PixelSync.

As for bindless resources… As I already said earlier, nVidia has been doing OpenGL extensions for bindless resources since 2009.

So these are some DX12-features that have clearly not originated from AMD, but from its competitors.

Posted in Software development, Direct3D, Software news | Tagged , , , , , , , , , , , , , | Leave a comment

Just keeping it real… bugfixing like it’s 1991

As you may have noticed, the 1991 donut intro did not have any music. I did not cover it in the previous blog, but there was some music planned for this small intro. I chose to use EdLib, because AdLib was one of the few sound cards available for PC back in 1991, and the EdLib replayer is relatively light on CPU, and contains only 16-bit code, so it would work on a 286 like the rest of my code.

However, at the time I was having problems getting the EdLib code to work together with the rest of the intro. I could get the EdLib code to work in a standalone program, but the whole system would crash when I called the same EdLib routines from the intro. I tried to debug it at the party place, but I could not pinpoint the cause at the time, or find a good workaround.

Over the weekend, I tried to give it another look. I had already arrived at the point where I suspected that malloc()’s heap was getting corrupted. And it seemed unlikely that the EdLib code was causing this, since there is nothing suspect going on in the EdLib code. It doesn’t make heavy use of the stack, and it does not call any kind of DOS or BIOS interrupts either, certainly nothing to do with allocating memory. Besides, it would often crash on the first call into the player code, so it looked like the player code was getting corrupted by something else.

I have had a discussion over the Second Reality code with some other demo coders recently. And one of the things we discussed was that they constructed a sort of ‘loader’, which provided various functions to other programs, including the music, through an interrupt handler. So I figured I could apply that idea here: if I write a loader in asm, I know 100% sure that it is not doing anything weird to the memory. If that loader then loads my C program, the C program should stay within its own memory as well, and the two would not corrupt eachother.

So I tried that… but I introduced a different bug there, which I overlooked. Initially I thought it was the same bug, and I was just blind to it, as I’d been looking at this code for far too long. So I asked Andrew Jenner and Trixter if they could have a look, because I was about to give up. Andrew found the bug, I forgot to call the function that set up the int handler. Apparently that got deleted while I was experimenting and fixing other things, and I overlooked that. Once I put the line back in there, things started working as expected: the intro code would just call into the music player once a frame, and the two processes would live happily side-by-side. Finally I had a way to play music for my intro!

However, we still had not found the actual bug, we merely had a workaround at this point. So we were not satisfied yet, there was still a challenge to overcome. I decided to set up a minimal program in C, which only loads the music from disk and plays it, to see if we could figure out exactly what goes wrong, and why. I then sent the program off to Andrew and Trixter with my initial analysis:

As far as I could trace it, it seems to be a bug in the linker.
I printed out the address of Player, and the address that malloc() returned.
I also printed out the bytes for the entry point of Player (not Player itself, but offset 0x62e that it jumps to).
This is what happens:
Player: 05110000
Player bytes: 1E060E1F
pSongData: 03D91704
Player bytes: 000C8019

So apparently it mallocs memory somewhat below the player… And after the fread(), the player got overwritten.
So we have 0×5110 for our player, and 0×5494 for our allocated memory. Which is slap-bang in the middle of the player code, right?
So obviously things die when you try to load your song there (or a torus for that matter).

So the question is: why is malloc() returning a block of memory that is part of the mplayer.obj in memory? The song I’m loading is only 4kb, and that is the only data there is. We have separate data and code segments, since it’s a small model, so in theory I should be able to allocate close to 64k before I need to worry about stack trouble.
So to me it looks like there is just something broken in the generated MZ header or something, causing malloc() to place the heap in the wrong place. It is probably placed after the code segment generated by the C compiler, but it does not seem to pay attention to the segment in other objs.
In which case I guess there are 3 possible locations for the bug:
1) The .obj file has an incorrect header, causing the linker to generate incorrect information -> assembling with the version of tasm included with TC++ 3.1 may solve that
2) The .obj file is correct, but the linker generates incorrect information anyway -> linking with a different linker (Microsoft?) may solve that
3) The headers are correct, but there is a bug in the libc causing malloc to interpret the headers wrongly -> roll your own malloc()?

I then tried to rebuild the code as stated, but that did not solve it. I also tried to use the Microsoft linker on the code, but although I managed to create a binary, it did not run. It would probably be quite a chore to figure out how exactly a Turbo C++ binary is set up. Then Andrew responded with his analysis:

The problem is as follows:
* The malloc() implementation looks at a variable called __brklvl to decide where to start allocating memory.
* The startup code (c0.asm) initializes the stack and __brklvl using the value of a symbol called edata@ which is in the segment _BSSEND.
* mplayer.obj doesn’t use the normal _TEXT and _DATA segments but instead has a single segment called MUSICPLAYER for both its code and its data. MUSICPLAYER has no segment class.
* tlink places segments without a segment class after the normal _TEXT, _DATA, _BSS, _BSSEND and _STACK segments – i.e. in the very place the startup code assumes is empty.

So the fix to the problem is a one-liner – in mplayer.asm just change the line that says:
musicplayer     segment public
to:
musicplayer     segment public ‘far_data’

Then MUSICPLAYER will be placed by the linker after _TEXT and before _DATA, so it won’t collide with anything. I suggest using ‘far_data’ instead of ‘code’ or ‘data’ so that MUSICPLAYER doesn’t take up any space in your normal code and data segments (which are limited to 64kB).

And there we have it! Our answer at last! After changing the segment class of the EdLib player, the code would finally work properly, and I can build a single-file intro with music. It seems I was on the right track with my initial analysis, but it seems that there is not really a conclusive answer as to what the bug really is. You could look at it from various angles:

  1. There was indeed wrong information in the .obj, because it was confusing the linker as to where the _BSSEND should be placed. Rebuilding it (after adding ‘far_data’) fixed it.
  2. The linker was in error, because if it had linked the segments in a different order, _BSSEND would have ended up in the right place after all.
  3. Libc is in error, since you cannot reliably assume that _BSSEND is the last bit of used memory in the binary. Perhaps it should have tried to parse the MZ header fields instead, to work out where the memory is.
  4. One should not link additional segments to a small model program, because that is not in line with the small model definition.

I personally don’t really agree with #4. In my interpretation the model only applies to the code that the compiler generates. Since you have far pointers and far function definitions, and even farmalloc() and farfree(), there should be no reason why you can’t interface code and data outside your own code and data segments.

I am leaning mostly towards #2 myself. Namely, #1 would not be a conclusive solution. If you are writing the code yourself, you have control over the segments. But in this case, MPLAYER.OBJ was supplied by a third party, and normally, changing the segment class would not be an option.
And if #2 is fixed, then the _BSSEND assumption will always be correct. The linker already seems to have some kind of predefined order for segments with known classes. If it would just place class-less segments at the front of that order, rather than at the end, the problem would be solved.
#3 would also be a robust solution, but it would make libc larger and more complex, so #2 would be preferred, especially in that era.

These are the most annoying, but at the same time interesting bugs. Bugs you just don’t see, because they aren’t in your code. You seem to be doing everything right, but it just does not work. Anyway, I may release a final version of the 1991 donut intro, with music, and perhaps a few other small tweaks. But at any rate, the next release WILL have music!

Posted in Oldskool/retro programming | Tagged , , , , , , , , , , , , , , , , , , , , , | 1 Comment

AMD fanboys posing as developers

The other day I found that this blog was linking back to one of my blogposts. Since I did not agree with the point he was trying to make by referring to my blog, I decided to comment. The discussion that ensued was rather sad.

This guy pretends to be knowledgeable about GPUs and APIs, and pretends to give a neutral, objective view of APIs… But when push comes to shove, it’s all GCN-this, GCN-that. He makes completely outrageous claims, without any kind of facts or technical arguments to back them up.

I mean, firstly, he tries to use C64 as an example of why being able to hack hardware at a low level is good… But it isn’t. Even though we know a lot more about the hardware now than we did at its release in 1982, the hardware is still very limited, and no match for even the Amiga hardware from 1985. Hacking only gets you so far.

He also tries to claim that GCN is relevant, and how the consoles were a huge win for AMD. But they weren’t. On the PC, nVidia still out-sells AMD by about 2:1. Only about 12-17% of the installed base of DX11-capable GPUs is GCN.

Also he makes claims of “orders of magnitude” gains of performance by knowing the architecture. Absolute nonsense! Yes, some intricate knowledge of the performance characteristics of a given architecture can gain you some performance… But orders of magnitude? Not in this day and age.

As I said in the comments there: It doesn’t make sense to design a GPU that suddenly has completely different performance characteristics from everything that went before. That would also mean that all legacy applications would be unoptimized for this architecture. A PC is not a console, and should not be treated as such. A PC is a device that speaks x86 and D3D/OGL, and CPUs and GPUs should be designed to handle x86 and D3D/OGL code as efficiently as possible.

Because this is how things work in practice, ‘vanilla’ code will generally run very well out-of-the-box on all GPUs. You could win something here and there by tweaking, but generally that’d be in the order of 0-20%, certainly not ‘orders of magnitude’. In most cases, PC software just has a single x86 codepath and just a single D3D/OGL path (or at least, it has a number of detail levels, but only one implementation of each, rather than per-GPU optimized variations). Per-GPU optimizations are generally left to the IHVs, who can apply application-specific optimizations (such as shader replacements) in their drivers, at a level that the application programmer has no control over anyway.

It’s just sad that so many developers spread nonsense about AMD/GCN/Mantle these days. Don’t let them fool you: GCN/Mantle are niche products. And Mantle doesn’t even gain all that much on most systems, as discussed earlier. So if a fully Mantle-optimized game is only a few percent faster than a regular D3D one (where we know AMD’s D3D drivers aren’t as good as nVidia’s D3D drivers anyway), then certainly ‘orders of magnitude’ of gains by doing GCN-specific optimizations is a bit of a stretch. Especially since virtually all of the gains with Mantle come from the CPU-side. The gains do not come from the GPU running more efficient shaders or anything.

Posted in Direct3D, Hardware news, OpenGL, Software development | Tagged , , , , , , | 13 Comments

Fifty Years of BASIC, the Programming Language That Made Computers Personal

Scali:

50 years of BASIC. I grew up with BASIC-powered home computers myself, so I recognize a lot in this great article. And indeed, these computers invited you to program, that’s how I got started. Things just aren’t quite the same anymore.

Originally posted on TIME:

Knowing how to program a computer is good for you, and it’s a shame more people don’t learn to do it.

For years now, that’s been a hugely popular stance. It’s led to educational initiatives as effortless sounding as the Hour of Code (offered by Code.org) and as obviously ambitious as Code Year (spearheaded by Codecademy).

Even President Obama has chimed in. Last December, he issued a YouTube video in which he urged young people to take up programming, declaring that “learning these skills isn’t just important for your future, it’s important for our country’s future.”

I find the “everybody should learn to code” movement laudable. And yet it also leaves me wistful, even melancholy. Once upon a time, knowing how to use a computer was virtually synonymous with knowing how to program one. And the thing that made it possible was a programming language called BASIC.

John Kemeny John…

View original 8,109 more words

Posted in Uncategorized | Leave a comment

Who was first, DirectX 12 or Mantle? nVidia or AMD?

There has been quite a bit of speculation on which API and/or which vendor was first… I will just list a number of facts, and then everyone can decide for themselves.

  • Microsoft’s first demonstrations of working DX12 software (3DMark and a Forza demo, Forza being a port from the AMD-powered Xbox One), were running on an nVidia GeForce Titan card, not AMD (despite the Xbox One connection and the low-level API work done there).
  • For these two applications to be ported to DX12, the API and drivers had to have been reasonably stable for a few months before the demonstration. Turn 10, developers of Forza, claimed that the port to DX12 was done in about 4 man-months.
  • nVidia has been working on lowering CPU-overhead with things like bindless resources in OpenGL since 2009 at least.
  • AMD has yet to reveal the Mantle API to the general public. Currently only insiders know exactly what the API looks like. So far AMD has only given a rough global overview in some presentations, which were released only a few months ago. And actual beta drivers have only been around since January 30th. Microsoft/nVidia could only have copied its design through corporate espionage and/or reverse engineering in an unrealistically short timeframe.
  • AMD was a part of all DX12 development, and was intimately familiar with the API details and requirements.
  • DX12 will be supported on all DX11 hardware from nVidia, from Fermi and up. DX12 will only be supported on GCN-based hardware from AMD.
  • The GCN-architecture marked a remarkable change of direction for AMD, moving their architecture much closer to nVidia’s Fermi.

Update: This article at Tech Report also gives some background on DirectX 12 and Mantle development: http://techreport.com/review/26239/a-closer-look-at-directx-12

Posted in Direct3D, Hardware news, OpenGL, Software development, Software news | Tagged , , , , , , , , , , , , , , | 25 Comments

DirectX 12: A first quick look

Today, Microsoft has presented the first information on DirectX 12 at the Game Developers Conference, and also published a blog on DirectX 12. nVidia also responded with a blog on DirectX 12.

In Direct3D 12, the idea of deferred contexts for preparing workloads on other threads/cores is refined a lot further than what Direct3D 11 offered. This should improve the efficiency and reduce the CPU load. Another change is that the state is distributed along even less state objects than before, which should make state calculation for the native GPU even more efficient in the driver.

Other changes include a more lightweight way of binding resources (probably similar to the ‘bindless resources’ that nVidia introduced in OpenGL extensions last year), and dynamic indexing of resources inside shaders. That sounds quite interesting, as it should make new rendering algorithms possible.

And to prove that isn’t just marketing talk, they ported 3DMark from D3D11 to D3D12 in order to demonstrate the improved CPU scaling and utilization. The CPU time was roughly cut in half. Also, this demonstration seems to imply that porting code from D3D11 to D3D12 will not be all that much work.

But most importantly: the API will work on existing hardware! (and apparently it already does, since they demonstrated 3DMark and a demo of Forza Motorsport 5 running under D3D12).

NVIDIA will support the DX12 API on all the DX11-class GPUs it has shipped; these belong to the Fermi, Kepler and Maxwell architectural families.

As an aside, nVidia also paints a more realistic picture about API developments than AMD does with Mantle:

Developers have been asking for a thinner, more efficient API that allows them to control hardware resources more directly. Despite significant efficiency improvements delivered by continuous advancement of existing API implementations, next-generation applications want to extract all possible performance from multi-core systems.

Indeed, it’s not like Direct3D and OpenGL have completely ignored the CPU-overhead problems. But you have to be realistic: Direct3D 11 dates back to 2009, so it was designed for entirely different hardware. You can’t expect an API from 2009 to take full advantage of the CPUs and GPUs in 2014. Microsoft did however introduce multiple contexts in D3D11, allowing for multi-core optimizations.

As nVidia also says:

Our work with Microsoft on DirectX 12 began more than four years ago with discussions about reducing resource overhead.

Which is actually BEFORE we heard anything about Mantle. The strange thing is that AMD actually claimed, less than a year ago, that there would not be a DirectX 12

For the past year, NVIDIA has been working closely with the DirectX team to deliver a working design and implementation of DX12 at GDC.

That’s right, they have a working implementation already. Which would be impossible to pull off if they had just copied Mantle (which itself is not even out of beta yet). There simply would not have been enough time, especially since AMD has not even publicly released any documentation, let alone an SDK.

Sadly, we still have to wait quite a while though:

We are targeting Holiday 2015 games

Although… a preview of DirectX 12 is ‘due out later this year’

Update: AMD also wrote a press release on DirectX 12. Their compatibility only includes all GCN-based cards:

AMD revealed that it will support DirectX® 12 on all AMD Radeon™ GPUs that feature the Graphics Core Next (GCN) architecture.

Posted in Direct3D, Software development, Software news | Tagged , , , , , , , , , , , , | 6 Comments