duck2.tech who? rt gpu

gpu 2: Software rasterizer, part 3

2020-11-27

(Warning: The animations in this page make your CPU do a lot of work and they may drain your phone battery.)

In our journey to the hypothetical universe where JS was invented but GPUs were not, we have made a software rasterizer which can draw a teapot on a 192x192 canvas at 25 fps (on my laptop). In this post we will try to make it faster.

Non-rasterizer stuff

A lot of code besides the rasterizer runs to draw the teapot. Namely, we have the vertex processing step before the rasterizer and the canvas painting step after it. When they are slow, they block us from seeing results from rasterizer optimizations, so I wanted to optimize them and get them out of the way first.

After a first profiling run, I realized my put_pixel() function is still a call to canvas.fillRect() which keeps Firefox busy compositing a thousand one pixel rectangles. Probably better to set pixels on canvas image data.

I ended up with a problem after fixing that: Even my laptop could draw a single 192x192 rotating teapot at 60 fps. Plus, both Firefox and Chromium profilers showed that most of the time was being spent in vertex processing. I want to showcase rasterizer speedups, so I would like to increase time spent in the rasterizer. How to do that?

My first try was adding more teapots (-> triangles) and pixels. However, adding teapots increases vertex processing time and adding more pixels increases canvas drawing time. After some experimentation, I settled on drawing four teapots to a 640x480 off-screen canvas, which is then scaled down to 320x240 (so it can be visible on phones).

To speed up vertex processing, I mostly chased down new allocations: reuse vertex and triangle arrays and stop returning new arrays from vector functions.

One tricky thing was the Z buffer. With a simple nested array, Z buffer checks and writes took more than 50% of the time spent in draw_trig. I changed it to use a flat Float32Array. (A flat "untyped" array also works but it takes too much time to zero out every frame.)

The last thing to optimize was the canvas interface. My trick was to get a single ImageData object from the offscreen canvas via createImageData and keep using it. Its data can be transferred to the smaller canvas by first blitting to the offscreen canvas via putImageData and then calling drawImage to get the browser to scale it down to the smaller one.

As a final result, the rasterizer (draw_trig()) was taking 50% of the frame drawing time. Not the ideal value, but it's the best I could do while drawing an actual 3D model. Here is the program in action:

A javascript canvas.

Stepping

As we are done with the JS bureaucracy, we can get to the main element.

The first optimization is exploiting the linearity of the edge functions as explained in "Optimizing the basic rasterizer". The edge equations can be written in this form:

This means we can compute once for minX and minY and then traverse the triangle up to maxX and maxY by adding and to increment x and y. That's not very hard to add to the rasterizer.

Let's go through the code:

function draw_trig(v0, v1, v2){
    /* doubles emulating x.8 fixed point */
    var x0 = Math.round(v0[0] * 256);
    var x1 = Math.round(v1[0] * 256);
    var x2 = Math.round(v2[0] * 256);
    var y0 = Math.round(v0[1] * 256);
    var y1 = Math.round(v1[1] * 256);
    var y2 = Math.round(v2[1] * 256);

    /* triangle bounding box */
    var minX = Math.min(x0, x1, x2);
    var minY = Math.min(y0, y1, y2);
    var maxX = Math.max(x0, x1, x2);
    var maxY = Math.max(y0, y1, y2);

    /* clip to screen coords */
    minX = Math.max(minX, 0);
    minY = Math.max(minY, 0);
    maxX = Math.min(maxX, (W-1)<<8);
    maxY = Math.min(maxY, (H-1)<<8);

    /* replace fractional part with 128 to sample from center */
    minX = (minX & ~255) + 128;
    minY = (minY & ~255) + 128;
    maxX = (maxX & ~255) + 128;
    maxY = (maxY & ~255) + 128;

This part stays the same. We convert coordinates and triangle bounds to fixed point values.

    /* F = Ax + By + C */
    var A01 = y0 - y1;
    var A12 = y1 - y2;
    var A20 = y2 - y0;
    var B01 = x1 - x0;
    var B12 = x2 - x1;
    var B20 = x0 - x2;
    var C01 = x0*y1 - y0*x1;
    var C12 = x1*y2 - y1*x2;
    var C20 = x2*y0 - y2*x0;

    /* fill rules */
    var bias0 = (A12 > 0 || A12 == 0 && B12 < 0) ? 0 : -1;
    var bias1 = (A20 > 0 || A20 == 0 && B20 < 0) ? 0 : -1;
    var bias2 = (A01 > 0 || A01 == 0 && B01 < 0) ? 0 : -1;

Compute the constants and use them to determine whether the fill rule applies.

    /* get initial edge function values
     * we will advance in only 1 pixel steps after
     * here, so drop the fractional part */
    var w0_row = (C12 + A12*minX + B12*minY + bias0) >> 8;
    var w1_row = (C20 + A20*minX + B20*minY + bias1) >> 8;
    var w2_row = (C01 + A01*minX + B01*minY + bias2) >> 8;

    /* convert min-max values to integral pixels */
    minX = minX >> 8;
    minY = minY >> 8;
    maxX = maxX >> 8;
    maxY = maxY >> 8;

Evaluate the initial value using minX and minY. Note that the multiplication of two X.8 fixed point numbers results in 2X.16, so we have to discard the bottom 8 bits to be able to step by adding A and B values which are still X.8. Also see ryg's comment about this.

    /* traverse triangle */
    for(var y=minY; y<=maxY; y++){
        var w0 = w0_row;
        var w1 = w1_row;
        var w2 = w2_row;
        for(var x=minX; x<=maxX; x++){
            if((w0 | w1 | w2) >= 0){
                var k = 1 / (w0 + w1 + w2);
                var wr = v0[3]*w0*k + v1[3]*w1*k + v2[3]*w2*k;
                if(!Z_buffer[y*W+x] || Z_buffer[y*W+x] < wr){
                    put_pixel(x, y, L, color);
                    Z_buffer[y*W+x] = wr;
                }
            }
            w0 += A12;
            w1 += A20;
            w2 += A01;
        }
        w0_row += B12;
        w1_row += B20;
        w2_row += B01;
    }

Step through the pixels. Note that we changed the w0 >= 0 && w1 >= 0 && w2 >= 0 check to (w0 | w1 | w2) >= 0. As ryg explains, this check is equivalent to "not (is any of the sign bits 1?)" and can be done faster by ORing the numbers and checking the final sign bit.

I replaced the draw_trig function with this code and profiled both versions in the Chromium JS profiler. My procedure roughly went like this:

  1. Open the page in a new tab
  2. Click play on the animation
  3. Wait for 5 seconds
  4. Record for ~10 seconds
  5. Repeat 5 times

When looking at the profiling results, I have noted the percentage of time spent in functions draw_trig and draw. draw_trig is the rasterizer while draw is the main function called by window.requestAnimationFrame. Here are the results:

Start draw_trig 50.29 50.77 51.20 50.86 49.59
draw 94.84 94.91 94.99 95.21 95.69
Stepping draw_trig 46.48 45.00 47.92 47.89 45.94
draw 93.05 93.38 93.35 93.55 92.95

There is some improvement, but it's not very dramatic. My experience from the FPS counter agrees with that: I was seeing 44 fps on average and now it's increased to 50 fps.

A javascript canvas.

Culling small triangles

After here, ryg's post goes into parallelization via SIMD, but we don't have access to SIMD in the browser as of 2020. I also found and tried to implement block-based traversal as ryg demonstrated in this 2012 pouet thread, but it didn't result in any improvements here. The scene model and the lack of any parallelism makes it hard to benefit from a block-based approach.

Luckily, there are other possible improvements for our rasterizer. For instance, the teapot model used here is made of 3752 triangles, many of which occupy little to no space on the screen. It would be really good if we could reject them as quickly as possible.

Here is how they did it in "High-Performance Software Rasterization on GPUs":

Multiple culling tests are performed for each triangle. If the triangle is degenerate, i.e., has zero area, it is culled, as well as if the area is negative and backface culling is enabled. If the AABB of the triangle falls between the sample positions, we also cull the triangle. This test is very effective in culling thin horizontal and vertical triangles that often result when viewing distant axis-aligned geometry in perspective. Finally, if the AABB is small enough to contain only one or two samples, we calculate their coverage, and if no samples are covered, we cull the triangle. This ensures that for densely tessellated surfaces, we output at most one triangle per sample, which can be much fewer than the number of input triangles.

It's easy to cull back-facing triangles. Recall that the edge equations are barycentric coordinates scaled by 2 and the sum of them should yield two times the area. So we can just check the sum of the initial wX_row values.

It's also a good idea to store the 1/A value (actually 1/2A) for use in depth interpolation.

    ...
    var A = w0_row + w1_row + w2_row;
    if(A <= 0)
        return;
    var k = 1/A;
    ...

To find if the bounding box of the triangle falls between sample positions, we can just snap x1, y1, x2 and y2 to the nearest X and Y positions. If x1 == x2 or y1 == y2, this triangle does not enclose any sample points and can be ignored. (See slide 53 of this presentation.)

    /* check if bounding box includes a sample */
    var minX_corner = minX + 127 & ~255;
    var minY_corner = minY + 127 & ~255;
    var maxX_corner = maxX + 128 & ~255;
    var maxY_corner = maxY + 128 & ~255;
    if(minX_corner == maxX_corner || minY_corner == maxY_corner)
        return;

Here, min values are biased differently because I would like to avoid culling when a triangle's left edge lies on a pixel center, where it should generate a pixel due to the top-left rule. (Not sure if this is the right way to do it, though.)

I think the third cull test in the GPU paper is better suited for a rasterizer with block traversal. We already traverse triangles point by point, so testing for one sample point coverage is more or less the same thing as rasterizing the triangle.

As a final optimization, I moved X and Y rounding out of the rasterizer and into the vertex processing step. A lot of vertices are shared in the mesh and there is no need to round the same values over and over.

A javascript canvas.

I can see a peak of 58 fps in my Firefox tab whereas Chrome can run this at a solid 60 fps now. I find it surprising, because Firefox was apparently better when the code was not optimized yet. Does that mean that Firefox is better adapted to the real world? :)

To be complete, here are the chrome profiler numbers found with the same method as above:

Stepping draw_trig 46.48 45.00 47.92 47.89 45.94
draw 93.05 93.38 93.35 93.55 92.95
Culling draw_trig 33.67 32.98 33.80 33.74 33.23
draw 91.13 90.69 91.42 91.34 91.30

The speedup in draw_trig is unclear here since we moved a bunch of work from there to draw.

Conclusion

We finally have a SW rasterizer which can draw four teapots (15K triangles) at 60 fps to a 640x480 screen. I don't know how to convert this to a performance value which can be compared with graphics hardware or other rasterizers, but it probably isn't very high.

This post concludes the software rasterizer series. In the next post I am planning to collect from the internet decide on the shader instruction set and maybe simulate the instructions, or build up some logic. I think the single shader unit may be a good starting point in a DIY (design-it-yourself?) GPU since it will probably go under a lot of revisions depending on how the rest of the system rolls out, and it's better to start earlier.

See you until then!