Pixomatic Rendering Technology
Pixomatic is our advanced software rasterizer for x86 PCs. Pixomatic was acquired by Intel Corporation at the end of 2005, but RAD still handles licensing, support and updates. We have two versions of Pixomatic available for licensing: Pixomatic 2 and Pixomatic 3.
Pixomatic 3 Features - Software DX9
Pixomatic 3 is a drop-in DX9 class software rasterizer with all the bells and whistles, from full support for both Shader Model 2.0 and 3.0 pixel and vertex shading to anisotropic texture filtering to stencil buffers to per-pixel mipmapping, in both 32- and 64-bit versions. Porting to Pixo 3 is painless, consisting mostly of implementing simple optimizations, such as reduced LOD and smaller textures, that improve software rendering performance. It is hundreds of times faster than the reference rasterizer from Microsoft.
The only difference between Pixomatic 3 and a DX9-class video card is speed - DX9 is fully supported.
Pixomatic 2 Features - Software DX7
Pixomatic 2 is directly comparable to a high-end DX7-class graphics accelerator in terms of features, and performs high-quality rendering on any x86-compatible Windows or Linux machine that supports MMX. It does two-texture multitexture, Gouraud, specular, fog, alpha testing, alpha blending, 16- and 24-bit z, bilinear filtering, texture transforms, and projected textures. It handles transformation, clipping, and projection of trilists, tristrips, trifans, quadlists, polygons, pointsprites, linelists, and linestrips, drawn through begin/end primitives or indexed or non-indexed streams. It performs perspective-correct rasterization, with per-triangle mipmapping, subpixel and subtexel accuracy, and 32-bit color depth. It clears, fills, copies, stretches, antialiases, and dithers. And while it's nowhere near as fast as the latest consoles and 3D cards, it's fast enough to run Quake II at 28 fps on a PIII/733 MHz, 67 fps on a P4/2.2 GHz, and 108 fps on a P4/3.3 GHz.
Pixomatic 2 does not support cubemaps, programmable pixel shaders, per-pixel mipmapping, or trilinear filtering, because those features don't lend themselves well to efficient implementation in software. Pixomatic also does not provide APIs for lighting or programmable vertex shaders, as these can be done just as efficiently - and considerably more flexibly - by the game itself rather than by Pixomatic. (Pixomatic 2 provides a per-vertex user callback to make it easy for the game to implement both features.)
You can find a more complete list on the Pixomatic features page, but the above should be enough to give you an idea of what Pixomatic can and can't do.
Software Rendering - What?
Given that 3D hardware is commonplace nowadays, what value does Pixomatic aim to deliver? Pixomatic can deliver value in either of two main ways. The first is by providing reliable and consistent technology for games that need or could use 3D, but don't require complex pixel shaders, huge numbers of triangles, or massive fill rate, and want to avoid the the complications and costs of dealing with PC hardware. While that may sound modest, in fact the benefits in terms of time and money saved, resources freed, and potential increase in market size can be substantial, as discussed in detail on the "Why Pixomatic?" page. The second way in which Pixomatic can add value is by serving as a fallback renderer for games that require hardware to look their best, but would still like to be able to run on every machine regardless of hardware acceleration, in order to broaden their market and reduce technical support costs and returns.
Let's take a closer look at how Pixomatic works and why we designed it that way.
We knew from the start that it had to be easy to write games on top of Pixomatic, or to port games to it; thus, Pixomatic's feature set maps closely to that of hardware that's in the same performance ballpark. For efficiency, Pixomatic uses a custom API, but one that is designed for easy porting from industry standards. Pixomatic has none of the quirks you might expect from a software rasterizer, and places no unusual demands on your game design; there's no retaining of world information, no BSP trees or span lists or edge lists or any of the many clever tricks software rasterizers have relied on the past. Pixomatic is a very straightforward implementation of the classic 3D pipeline: polygons in any of several forms go in one end, and, after z and stencil testing, the rasterized, pixel-shaded result comes out the other end and is written to the frame buffer. All the rasterization features work together in almost any combination (the only exception is that no frame buffer or z buffer drawing can occur when the stencil buffer is being written to). If you're familiar with either OpenGL or DX, you'll have no problems with learning to use Pixomatic.
Our second requirement was to produce the highest-quality results possible, within the constraints of ease of use and the performance limitations of software rasterization. We did this by implementing our core feature set with a minimum of 8 bits for each color component throughout the pipeline, doing subpixel rasterization and homogeneous clipping, and doing perspective-correct, subtexel-accurate texture mapping. Then we added all the rendering features we could get to run fast enough, still with the same color depth and accuracy: dot3 per-pixel lighting; antialiasing; stencil shadows; optional 24-bit z; and bilinear filtering, plus two faster filters that work well for light maps.
Our final objective was, of course, performance. It's not all that hard to write a software rasterizer that produces high-quality output; the hard part is writing one that does it fast. We knew from the start that performance would be our biggest challenge with Pixomatic, so we designed Pixomatic 2 from scratch for the best possible performance on PIII- and P4-class machines, using every optimization technique we could think of that wouldn't degrade rendering quality. For Pixomatic 3, our top priority was creating a DX9-compliant renderer, but performance was a close second; the key new performance factor in Pixomatic 3 is support for parallel processing across up to 16 cores.
Pixomatic 2 Performance
At the heart of Pixomatic 2's performance and quality is what we call the welder, the software that compiles the pixel pipeline on the fly whenever the rasterization state changes, producing code equivalent to hand-tuned assembly language. (In fact, it effectively is hand-tuned assembly code; the compilation involves intelligent stitching together and fixing up of hand-optimized code fragments.) The welded pixel pipeline uses all 8 MMX registers and all 8 general-purpose registers to keep dynamic variables in registers at almost all times (there are a few save/restore cases when Gouraud, specular, and both textures are in use together with bilinear filtering). The texel lookup itself requires a mere 5 instructions, thanks to careful use of MMX. Only one branch - the loop branch - is performed per pixel, apart from the z, stencil, and alpha tests, if they're enabled. The pipeline early-outs on z or stencil failure.
Moving up a level, the triangle pipeline, which drives the pixel pipeline, works by generating a list of spans to be drawn for each triangle; each span is no longer than 16 pixels, with floating-point perspective-correct texture calculations performed at each end. The pixel pipeline then draws each span in turn, interpolating linearly from one end to the other, producing results indistinguishable from performing the perspective divide for every pixel, but with much better performance. The span generator automatically uses SSE or 3DNow if either is present, and the SSE version is written entirely in hand-tuned assembly language, with 7 general-purpose registers, 8 MMX registers, and 6 XMM registers in use simultaneously. Z prefetching is used to improve effective memory latency when prefetch instructions are available.
In short, the rasterization pipeline is designed to provide the best performance we know how to wring out of an x86 processor with MMX and optionally SSE or 3DNow, while still maintaining 32-bit, perspective-correct, subpixel- and subtexel-precise rendering quality.
The geometry pipeline is similarly constructed for maximum performance. A combination of C, hand-tuned assembly code, and compiled-on-the-fly code is used to handle the many combinations of interface types, primitive types, and stream configurations as efficiently as possible. We even developed a custom preprocessor so we could reuse tuned code across the full range of configurations. Again, SSE and 3DNow are used if available.
A variety of MMX-optimized clears and fills are provided, along with blts to 32-, 24-, and 16-bit targets, the latter with dithering. These too are assembly code in some places and compiled-on-the-fly in others, as appropriate. Prefetching is used to accelerate the clears, fills, and blts whenever it's available.
Finally, Pixomatic offers buffer management and screen-update functions, and, if you'd like, can automatically figure out the fastest way to update the front buffer, choosing between GDI blts and its own MMX-optimized blts.
Pixomatic 3 Performance
Pixomatic 3 is about 4 times slower than Pixomatic 2, due to its full DX9 compatibility. However, Pixo 3 was also designed with multi-CPU support built right in! With near-perfect linear scaling, Pixomatic takes full advantage of multi-core and multi-CPU machines.
At the heart of Pixomatic 3 is an Intel-optimized SSE shader compiler. This is the same compiler that drives the vertex shading on some Intel hardware GPUs; we have optimized it further and added pixel shading features. Your DX9 games will run, right out of the box, just a bit slower!
How did our drive for performance work out?
We're pleased with the results, which we feel are reasonably close to the best that can be achieved on current x86 CPUs with a general-purpose 3D API. Pixomatic's performance is more than adequate for Quake games, and would have been fine for many recent bestselling games.
In short, Pixomatic implements a standard, core set of 3D functionality that's well suited to implementation on a CPU, with a familiar, easy-to-use API and the best performance we could produce without sacrificing rendering quality or ease of use. It doesn't try to do things that can't be done well given the limitations of CPU performance, but what it does, it does efficiently, accurately, and reliably.