Just keeping it real, part 6

It has been a while, but you know I’ve been keeping it real in the meantime. Firstly I’d like to mention a demo that I have cooperated on, which is loosely related to this series. This demo is called The Chalcogens, by Quebarium, DESiRE and TRSI. It won first place at the Recursion 2012 demo competition. There is a Windows port included as well, and a YouTube capture, but I prefer the video I’ve recorded on a real GP2X, which was generously donated to me by Matja:

My contribution to this demo is the 3D Part. Since the GP2X does not have any 3D acceleration, a software renderer was required. In that sense the GP2X is still quite oldskool, although the device dates from 2005. The GP2X does not have a floating point unit either, much like the 286. So it requires a fixed point approach, just like the 286 and 486 (although the 486 has an FPU, it is quite slow, so I used fixed point for better performance). I created this software renderer out of the oldskool 286 and 486 polygon routines that I’ve discussed on this blog earlier.

The Power of VR

A while ago, I already mentioned my old VideoLogic Apocalypse 3Dx card, equipped with a PowerVR PCX2 3d acceleration chip. VideoLogic being the old brand name of the company that develops the PowerVR line of 3d accelerators. These days they go by the name of Imagination Technologies, but the PowerVR name has stuck. Currently they only design the core logic and license the designs to other companies to integrate them in their chips. They are very popular in combination with ARM CPU cores on System-on-a-Chip designs, for smartphones, tablets and various other embedded applications. You can find their designs in iPhones, iPads, various Android smartphones, and they also power Intel’s Atom line of CPUs.

So, that is quite a success story. But in other areas of the market, they have not been so successful. There have been only a handful of graphics cards for PC with a PowerVR chip. It started with an OEM part for Compaq Presario machines, only known by its codename ‘Midas 3’. Then came the PCX1, codenamed ‘Midas 4’, which was more or less a single-chip version of Midas 3. This chip could be found on the VideoLogic 3D add-on 3d accelerator card. This chip was then updated to the PCX2, codenamed ‘Midas 5’, adding some features such as bilinear texture filtering. Aside from VideoLogic’s 3Dx card, it was also sold as the Matrox m3D card (the only time Matrox ever sold a card that was not based around one of their own chips, showing just how tough the competition was in those early days of 3d acceleration). These three form the ‘Series 1’ generation. After that, there was a very obscure Neon 250 card, which was ‘Series 2’. Then the moderately successful Kyro and Kyro II (‘Series 3’), before PowerVR dropped out of the PC market forever.

There has also been one console powered by a Series 2 PowerVR chip, namely the Sega DreamCast (the PowerVR chip was actually chosen over a 3dfx VooDoo2 chip). Sega also used PowerVR chips in some of their arcade machines at the time. But neither the PC market nor Sega were doing very well, and PowerVR focused on mobile and embedded devices from then on. A wise choice, as it would seem. They are arguably the largest player in the phone/tablet market now. But why did they fail in the PC market, and why did they succeed in the mobile/embedded market? It all has a lot to do with their unique approach to 3d acceleration.

Tile-based deferred rendering

The designers of the PowerVR line of products have always believed in working smarter, not harder. 3dfx went for a bruteforce approach, by just processing every pixel in every triangle thrown at it, and rejecting it at the last moment by buffering the depth-values of drawn pixels in a z-buffer and comparing. Everyone else followed their example, but not PowerVR. They figured that texturing and shading was an expensive, bandwidth-intensive process. So if texturing and shading could be avoided for (parts of) triangles that were obscured by other triangles, they could save a lot of bandwidth and processing power. This is known as ‘deferred rendering’, as in: texturing and shading is deferred until the problem of visibility has been solved.

These days, ‘deferred rendering’ is mostly known as a software approach, applied to conventional z-buffered accelerators: in a first pass, the geometry is rendered only to a z-buffer. In a second pass, the geometry is rendered again, this time with shaders and textures enabled, and rendering a pixel only when the z-value is equal to the value stored in the z-buffer. Since modern accelerators have at least some z-buffer optimizations (making them somewhat of a hybrid between brute-force and PowerVR’s full TBDR), this approach is generally faster than a single-pass approach, because the shaders and textures are not applied for pixels that are not visible.

PowerVR tackled this problem with hardware however, and eliminated the need for a z-buffer in videomemory altogether. This is where the tiles come in. The screen is divided into a grid of tiles, and the tiles are processed by the chip one at a time (these tiles were 32×16 pixels large, at least with the early generations of PowerVR chips). The chip contains an on-chip buffer of high-speed memory which is just large enough to store the framebuffers for a single tile (think of it as a sort of L1-cache). So there is a very high-speed z-buffer available on-chip. The hardware can perform something similar to the two-pass deferred rendering mentioned above, where first only the z-values are determined, and then the shading and texturing is applied only to visible pixels. The hardware keeps track of which material to apply to each pixel. After a tile has completed rendering, it is copied back to a framebuffer in video memory (which also means that the device can render correctly without having a backbuffer, with the tiles ‘racing the beam’). The z-values can be discarded, as they are no longer required. So no z-buffer is required at all in video memory.

This TBDR then is both PowerVR’s strong point, and its weak point. The strong point is obvious: No z-buffer is required in videomemory at all, saving both memory and bandwidth, and more memory bandwidth is saved because textures are only applied when required. The weak point however is that the tile-based approach is not an ‘immediate mode’ rendering approach. APIs such as Direct3D and OpenGL are based on the assumption that each triangle is rendered immediately. But because of the tile-based approach, rendering of a tile cannot start until all triangles have been batched up.

The OpenGL API does not have specific markers to indicate when a frame starts and ends, so determining when all triangles for a frame have been received can be a bit troublesome. With Direct3D, there are the BeginScene() and EndScene() markers. However, since these appear not to do anything on most hardware, developers are often sloppy with their usage, and liberally make use of the z-buffer outside of these boundaries, where the API does not guarantee that the z-buffer has valid contents. The PowerVR hardware can actually store its depth information into a z-buffer in video memory if so required, but the code has to be written correctly for it to do so.

This z-buffer can also be used as a workaround for another problem with TBDR: the hardware can only batch up so many triangles in a frame. Since there is no limit to a regular immediate mode renderer, OpenGL and Direct3D don’t pose any limits on the number of triangles you can emit per frame. Each triangle is rendered immediately anyway. But a TBDR needs to batch up all triangles first, before it can start sorting them per-tile, and render. So if the triangle buffer overflows, it will have to render the partial buffer first, then spill to a z-buffer in memory, and then batch up more triangles, and re-use the saved z-information from the previous pass(es).

So, the result was that OpenGL and Direct3D applications were quite buggy on PowerVR cards, rendering bugs and poor performance were quite common. Well, only the Kyro had true OpenGL support anyway. The earlier generations were still in the MiniGL era, where the drivers were more or less custom-made for a single game anyway. This poor compatibility played a big role in the failure of PowerVR on the PC platform. Like most early 3D accelerators, the Series 1 and Series 2 also had their own API though: PowerSGL. In the handful of games that had been ported to PowerSGL, the cards performed quite well, and image quality was generally quite good for the time. Here is a PCX2 card running GLQuake with the custom-made MiniGL driver (which runs on PowerSGL):

The Series 2 chip was also quite successful in the Sega DreamCast. Since all games would just target the hardware directly, there were no issues with rendering bugs and performance. And if the hardware is used to its full potential, the results are quite nice. Another nice feature of the on-chip tile was that PowerVR could use 32-bit rendering and alphablending early on. The blending only had to be done inside the tile anyway, so once again there was no need for high-bandwidth video memory. The Kyro generation also added 8 texture stages, in a time where the competition only supported 2 or 3 stages. It was simple to roll in multiple texture passes when you were already doing deferred rendering anyway. Another trick the Kyro could do was on-the-fly mipmap generation for trilinear filtering. It would only fetch texels from the largest mipmap, and then generate the smaller mipmap directly, rather than having to fetch from two separate mipmaps in different memory locations. So, full of clever tricks then.

Sadly, one trick that PowerVR had trouble with, was hardware T&L. Because the triangles had to be batched up before rendering, it was quite difficult to integrate this with hardware T&L, let alone programmable shaders. So Kyro II was still a software T&L card, which had to go up against the GeForce2’s and Radeons of this world.

For low-power/mobile devices, the efficient TBDR which didn’t require a lot of fast and powerhungry videomemory was a very valuable feature. So PowerVR GPUs became very successful in mobile devices. Over time, the TBDR technology was refined further, and problems with hardware T&L,  programmable shaders and limited triangle batching were overcome. The current PowerVR GPUs are fully capable of OpenGL ES 2.0, OpenGL 3.0 and Direct3D 10.1. PowerSGL has long been abandoned as well. Series 2 was the last to support it.

PowerSGL in practice

Well, after this short history lesson, let’s get back to the PowerVR card that I have here, my VideoLogic Apocalypse 3Dx, with the Series 1 PCX2 chip on it. This was the first 3d accelerator that I had. I did not have the PowerVR SDK at the time though, so I did not have access to the PowerSGL API. So my only option was Direct3D. I did my first Direct3D code on the card, rendering a simple cube. I kept my old code around, and when I recently started playing around with that old box again, I decided that I wanted to try to render the cube with all versions of Direct3D up to 7, as that was the highest version supported by the PCX2. Then I upgraded the cube object to a torus, which would show off proper hidden surface removal performed by the card, and would also show smooth specular highlights. This led to the video that I showed earlier:

This time however, I had access to the PowerVR SDK for this card. So I wanted to familiarize myself with PowerSGL. I also wanted to see if it could render faster than the ~47 fps that I was getting from Direct3D (which I had already optimized to use quads instead of triangles, and other minor tweaks I found in the SDK documentation. The object consists of 1400 quads by the way). After skimming through the SDK documentation, I figured it was easiest to use the PowerSGL Direct API first. This is a simple lowlevel API, where you feed triangles in screenspace directly to the driver. I could just build a simple transform pipeline myself, and pass the triangles, without having to study the API in-depth.

The result was slightly disappointing… Using PowerSGL Direct did not make it faster than Direct3D:

At this point I was wondering if perhaps my naive transform pipeline might be suboptimal, as I just processed all vertices, and left the backface culling to PowerSGL. In theory it might be faster to cull backfaces in object space, and only send the frontfaces to PowerSGL. But, I figured I would try the highlevel PowerSGL API first. Being a highlevel interface, where the transform and lighting is handled by the API itself, I figured it would also perform optimizations for backfaces and such.

It took a while to get the hang of the API. Everything is list-based, and you can recursively build lists from sub-lists. Using ‘named items’ (basically just unique integers, much like in OpenGL), you could access objects inside these lists, such as lights, transforms, cameras and polygons.

But, after some reading and a bit of trial-and-error, I managed to define the donut as a PowerSGL mesh object, built from quads:

And alas, again the dreaded ~47 fps performance. At this point I was reasonably convinced that I was entirely limited by the speed of the rasterizer. However, I still optimized the PowerSGL Direct pipeline, just for fun. But, as I suspected, discarding the backfaces before passing the geometry to PowerSGL did very little for performance. I did see ~48 fps, but that’s in the margin of error.

There is one other way to render geometry with PowerSGL however…

Infinite Planes

Yes, infinite planes. This term has been circling around in PowerVR circles for quite some time, but if you look at old reviews, it seems that the reviewers generally seem to have a lot of trouble explaining what it entails exactly.

Well, it is not that difficult really. As you might know, mathematically speaking, a plane in a 3d space is infinitely large, and divides the space in two half-spaces. And if you are somewhat familiar with constructive solid geometry, you may already know that you can define objects as the union of various objects/spaces. Infinite Planes is exactly that: you define a convex polyhedron by defining a number of planes. Each plane will have a ‘positive’ and ‘negative’ side/half-space. The union of all positive half-spaces defines your convex polyhedron. Or, as the PowerVR documentation says: Think of it as a large block of stone, where each plane slices off a piece. A picture is worth more than a thousand words here:

Infinite Planes

At the time of Series 1, the PowerVR people thought this was an interesting idea. And indeed, for lowpoly convex polyhedra it is a very elegant way to define geometry. A cube would only need 6 planes for example. Early accelerated 3d games were very plane/cube-oriented anyway. Think of Tomb Raider or Quake for example.

The problem here is obviously that the modeling tools need to be able to work with infinite planes. Since this wasn’t the case, the use of infinite planes was very limited. Besides, as geometry became more detailed, infinite planes became more cumbersome to use, and just standard triangle meshes are a better solution.

Is a donut a good case for infinite planes? Not at all! A donut is not a convex shape, obviously (which is why it’s such an interesting object to test with, as I mentioned earlier, regarding the sorting problems for my 286). Can it be rendered with infinite planes at all? Well yes. You can construct a donut from ‘tube sections’, where each tube section itself is a convex polyhedron, and as such, you can construct it from infinite planes.

Another problem presents itself here: a convex polyhedron is limited to 100 planes in the PowerSGL API. My donut is constructed of 40 tube-sections of 35 faces each. So for a single tube-section I already need 35 planes to construct the cylinder, and another two planes for the top and bottom. So I can only define the donut as a list of convex objects, not as a single object. So, just how good are these infinite planes then? Well, not very good:

We have pretty much arrived in the single digit framerates. The exact implementation of the hardware has always remained somewhat vague. Some people seem to think that the infinite planes are actually the ‘native’ format for the rasterizer, and that triangles and quads are actually constructed from 4 or 5 infinite planes respectively (one plane for the polygon, and then 3 or 4 planes that act as the edges). But I don’t think this is the case. My tube-section definition should be more efficient than just defining each quad separately with 5 planes (which would lead to no less than 7000 planes, whereas my solution has 1480 planes, vs 1400 quads) . Yet the quad-based donut renders a lot faster.

So I think the rasterizer is quite conventional. It may make use of half-spaces to define polygon edges in 2d, but that is very common in hardware implementations anyway. Perhaps the hardware is capable of handling more than just 3 or 4 edges at a time, allowing it to render convex n-gons efficiently. But that’s about as far as I think it goes. The infinite planes may make it slightly more efficient to find these edges in 2d, at least for simple convex polyhedra. But the donut experiment seems to indicate that infinite planes certainly don’t always win over conventional polygon meshes. Therefore it is easy to conclude that the hardware does not have some kind of magic rasterizer that handles these infinite planes natively.

Real-time shadows in 1996

One special feature of these convex polyhedra however, is that they can also cast light or shadow volumes. Which was unique, and way ahead of its time. Shadow volumes weren’t commonly used until years later, when Doom 3 popularized them (2004). However, Doom 3 just uses a conventional approach with a stencil buffer (which wasn’t available yet on other 3d hardware around 1996, when the PCX1 was introduced). Like the PowerVR has no conventional z-buffer, it has no stencil buffer either. Instead, because the hardware can already defer texturing and shading until after the depth-sorting has been performed, it is not very difficult to extend this deferred shading to have special cases for pixels that are in light or in shadow.

If you use the low-level PowerSGL Direct API however, you can specify light or shadow volumes directly as polygon meshes, so you are not restricted to using infinite planes (another indication that the hardware does not use infinite planes natively, but more conventional polygon rasterization).

I also tried to get some OpenGL code running on the GLQuake MiniGL driver… but that experiment did not last very long, as the driver seems extremely limited and hardwired to the Quake engine. I could get polygons on screen in the right place, but that’s about it. There appeared to be lots of culling/sorting issues, making polygons randomly disappear. And shading and texturing did not appear to work as expected at all. So I gave that up quickly.

Quality and precision

PowerVR used three levels of certification to indicate the compatibility of games: Accelerated, Enhanced and Extreme.

As they themselves say about these certifications:

The PowerVR Extreme™ Certification

The ultimate in the PowerVR certification supported titles will be reserved for games of the highest quality, not only in technical capabilities but also in game play and market appeal. Games in this category must achieve a minimum frame rate and resolution of 30fps at 800×600. All of this must be done with a minimum of 16.7 million colors and should take advantage of all advanced features of PowerVR. Examples of this type of game include Virtua On™ by Sega Entertainment, Rave Racer™, Tekken 2™ and Air Combat 22™ by NAMCO and Ultim@te Race™ by Kalisto Entertainment.

The PowerVR Enhanced™ Certification

Titles more advanced than basic Direct3D but not in the Extreme category due to lower screen resolution, frame rate and color depth will be found in this category. These titles include features and capabilities not found in other 3D hardware.

Minimum requirements for a title in this category include a frame rate of 30fps, resolution of 640×480 and a minimum color depth of 16 bits. A number of titles fall into this category and will make up the bulk that take advantage of the PowerVR technology. This may mean taking advantage of either high-resolution, translucencies, shadows, light volumes, fogging, filtering, among others. Examples of games in this category include Ubi Soft Entertainment’s POD™, MechWarrior™ 2 by Activision and Wipeout XL™ from Psygnosis.

The PowerVR Accelerated™ Certification

All other PowerVR architecture compatible Direct3D titles not in the Extreme or Enhanced categories will be found here. The minimum level of 3D support required is 20fps and 640×480 resolution with a minimum color depth of 16 bits.

For 1996/1997 these were quite lofty standards. VooDoo cards were not even capable of rendering with 16.7 million colours at all, back then. Getting 30 fps at a resolution of 800×600 was no small feat either. It goes further than that even. Namely, various early 3D games were originally written with software renderers, without subpixel correction (Quake being one of the exceptions, where the software renderer is highly accurate as well). They were then patched to include accelerated support. Even though most early 3D accelerators already had subpixel accuracy, the games were patched in a simple way, and the geometry processing was not rewritten to have subpixel accuracy, so the accelerated version had the same shaky unstable rasterization as the software version.

Tomb Raider is a well-known offender as far as poor accuracy goes. The VooDoo Glide patch does not seem to have subpixel correction either. The PowerVR version does, however, transforming the game into a perfectly smooth and stable experience. The same seems to go for MechWarrior 2, where the PowerVR version seems to be the only one to have subpixel correction.

“Framerate is life” was a common phrase in those early days, but PowerVR clearly cared about more than just framerate. High accuracy rasterizing, true colour rendering, and various effects such as fog, shadows and light volumes were also given plenty of attention.

This entry was posted in Direct3D, Oldskool/retro programming, Software development and tagged , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , . Bookmark the permalink.

11 Responses to Just keeping it real, part 6

  1. Numb_Thumb says:

    Hi Scalib
    Thank for this informative blog, I have been learning lots of things from your blog posts. I am just wondering that, what do you think about Nvidia’s Maximus Technology in general? Sorry I couldn’t find contact page so I asked here.

    • Scali says:

      They’re just workstations with an nVidia Quadro and Tesla card…

      • Numb_Thumb says:

        This was very general 😀 but yeah, Thank you for sharing your thoughts.

      • Scali says:

        Well, what else is there to say about it? Clearly it’s a rather specific type of workstation, which may or may not improve your productivity. It all depends on how you use your computer.
        I’m not in the target demographic, that’s all I can say.

  2. wfw311 says:

    I’ve been thinking about the performance of the implementation using infinite planes.
    AFAIK the PCX2 can check up to 32 polygons/planes/whatever per pixel for visibility. (I can’t find a source for this right now but it should be safe to assume that there is a limit.) Since it can render one pixel per clock hidden surface removal is basically free as long as there are no more than 32 planes to check.
    Now, the PowerSGL API for convex polyhedra (thanks for the hint about the SDK, got it) doesn’t allow you to specify a bounding box, and I suppose it would be quite expensive to calculate a bounding box, you would have to solve a lot of equations. (The SGLDirect API for shadow volumes allows you to specify a bounding box.)
    So if I assume that all 35*40=1400 planes are checked globally the chip would need ceiling(1400/32)=44 clocks per pixel to check visibility. If I take the fill rate of about 66MP/s and divide it by 44 I get a remaining fill rate of 1.5MP/s which might roughly match the speed you get. (Hard to say since you didn’t specify the speed exactly and I would have to guess about the screen resolution and the size of the window.)

    So, if you would render a single convex polyhedron using less than 32 infinite planes (it shouldn’t matter if it fills the whole window) it should be faster to maximum performance.

    • Scali says:

      Hum, the 32 polygons/planes/pixels figure doesn’t quite make sense.
      Firstly, I am not sure what kind of hardware implementation would allow you to do 32 plane checks in a single cycle. That would be some highly advanced circuitry, and I doubt the PCX2 is anywhere near that complex.
      Secondly, my window is 640×480, and my donut fills nearly the entire screen.
      Now, let’s assume that tiling is done before hidden surface removal. After all, tiling is part of the divide-and-conquer approach they use for HSR. If they could solve HSR for the whole screen at once, there would be no need for tiling, as they’d only have to perform the deferred shading after that.
      Since each tile is only 32×16 pixels large, you generally have only one or two tube-section objects inside a tile. Each tube-section having 37 planes. Now, in most cases, most of these planes will not be in this tile. So I think it’s a safe assumption that you generally have less than 32 planes per tile.

      I could experiment by lowering the polycount, so that you have less than 16 planes per tube-section, and as such, virtually always less than 32 planes per tile. But I doubt that this is the problem. After all: how would it handle the quad-based version so efficiently, if it translated the quads to an infinite plane format as well (as some people claim it does)?

      The bounding volume for convex polyhedra is implicit, so why would you specify a bounding box at all?
      The reason why you can limit shadow volumes, is because they extend up to ‘infinity’. So you can use some extra planes to limit the shadows only to inside a room or such, and save fillrate (nVidia later implemented a Z-scissor extension, used in Doom 3, as a similar solution).

      Another thing I noticed is that invisible planes are more expensive than visible ones. That again wouldn’t make sense if it was able to do plane visibility checks implicitly.
      It does make sense however, if you think of it as a more conventional z-buffer-ish renderer: When you apply an invisible plane, this effectively disables backface culling, so now backfaces are also participating, and you get more overdraw.
      The ‘invisible’ plane is also participating in the rasterizing, because it is being used for some per-pixel stencil-like trickery to perform the intersection properly.
      Which would explain why invisible planes are more expensive than visible ones.
      In the case of the donut, I could use either one, since the ‘visible’ planes would be inside the donut anyway, and wouldn’t change the appearance of the object. Invisible planes gave me ~8 fps, where visible planes gave me ~11 fps.

      Anyway, that’s my theory of what is happening: the convex polyhedra are processed to a mesh of convex n-gons first (on the CPU?), and then rendered the same as directly feeding triangles or quads to the device.
      Another small thing was that I could see slight speckling on screen. The infinite plane donut wasn’t always pixel-perfect. It was most noticeable when I didn’t specify any texture for my cap planes. I could see white pixels everywhere in the donut. Which shouldn’t really happen, as the documentation specifies that it always renders the last pixel with a given z-value. And I specified the cap planes first in my object.
      With invisible planes, these pixels became ‘transparent’, and you could see texels from the inside of the donut shining through.
      So I opted for visible, textured cap planes. That way, at least the problematic pixels were not very noticeable.
      But it reeks of conventional rasterizing.
      I have also seen some very conventional-looking z-fighting in some instances.

  3. Pingback: What is a platform anyway? | Scali's OpenBlog™

  4. Pingback: Just keeping it real, part 10 | Scali's OpenBlog™

  5. Pingback: Just keeping it Metro | Scali's OpenBlog™

  6. Pingback: SteamOS and Mantle | Scali's OpenBlog™

  7. Pingback: When old and new meet: lessons from the past | Scali's OpenBlog™

Leave a comment