Graphics Card Performance
CS 481 Guest Lecture,
Dr. Lawlor, 2006/04/20
Most OpenGL programs don't take advantage of the performance inherent in the graphics card.
For example, take a typical simple OpenGL program:
for (int iy=0;iy<n;iy++)
for (int ix=0;ix<n;ix++) {
glBegin(GL_TRIANGLES);
double x=ix*sz-0.5, y=iy*sz-0.5;
glVertex3fv(makeVertex(x,y));
glVertex3fv(makeVertex(x,y+sz));
glVertex3fv(makeVertex(x+sz,y));
glEnd();
}
If "makeVertex" does a few flops (say, two cosine calls), this might
take 700ns per triangle. But if we just precalculate and save the
vertex locations (avoiding the floating point calculations in
"makeVertex"), and draw like this:
for (unsigned int i=0;i<vertices.size();)
{
glBegin(GL_TRIANGLES);
glVertex3fv(vertices[i++].loc);
glVertex3fv(vertices[i++].loc);
glVertex3fv(vertices[i++].loc);
glEnd();
}
Suddenly we're twice as fast--just 300ns per triangle. But doing
a glBegin/glEnd for every triangle is pretty slow, because the driver
has to do extra work on each glBegin (checking the texture, flushing
the pipe to the graphics card, etc.). So if we just do:
glBegin(GL_TRIANGLES);
for (unsigned int i=0;i<vertices.size();)
{
glVertex3fv(vertices[i++].loc);
glVertex3fv(vertices[i++].loc);
glVertex3fv(vertices[i++].loc);
}
glEnd();
Now we're down to 160ns per triangle. If we use "vertex arrays",
which just point OpenGL at our array of vertices like this:
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(3,GL_FLOAT,
sizeof(myVertex),vertices[0].loc);
glDrawArrays(GL_TRIANGLES, 0, vertices.size());
glDisableClientState(GL_VERTEX_ARRAY);
This doesn't actually speed things up. At all. But it's at
least easy to do. We can then switch to triangle strips by just
generating our vertices in the proper order, and then do:
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(3,GL_FLOAT,
sizeof(myVertex),vertices[0].loc);
glDrawArrays(GL_TRIANGLE_STRIP, 0, vertices.size());
glDisableClientState(GL_VERTEX_ARRAY);
This takes just 54ns per triangle--a 3x speedup! It's faster because triangle strips share
vertices between adjacent triangles, which means less work per
triangle. We can decrease the time per vertex even further by
using a very simple vertex shader like:
void main(void)
{
gl_Position = ftransform();
}
(For a simple or fixed camera, even this can be improved by
precalculation!) Because the vertex shader doesn't set anything
up, we can only do really silly work in the fragment shader like this:
void main(void)
{
gl_FragColor=vec4(0,1,0,1);
}
With these vertex and fragment shaders, the same code takes just 32ns
per triangle. But we can do even better yet! If we build a
"vertex buffer object", as described in the ARB_vertex_buffer_object extension spec, like this (do this once at startup):
vertex_buffer=0;
glGenBuffersARB(1, &vertex_buffer);
glBindBufferARB(GL_ARRAY_BUFFER_ARB, vertex_buffer);
glBufferDataARB(GL_ARRAY_BUFFER_ARB,
sizeof(myVertex)*vertices.size(),
&vertices[0], GL_STATIC_DRAW_ARB);
glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);
This actually copies the vertex locations (the "vertices" array) into
graphics card memory, which is much faster than CPU memory (which can
only be accessed over the AGP or PCI bus). You can now render the
vertices in a fashion similar to vertex arrays:
glBindBufferARB(GL_ARRAY_BUFFER_ARB, vertex_buffer);
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(3,GL_FLOAT,
sizeof(myVertex),0);
glDrawArrays(GL_TRIANGLE_STRIP, 0, vertices.size());
glDisableClientState(GL_VERTEX_ARRAY);
glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);
Vertex buffer objects take our time per triangle down to 16ns per
triangle--another factor of two improvement. So we can now draw 2
million triangles per frame at 27fps! Of course, they're
flat-shaded, untextured polygons due to the very simple vertex shader,
but now we can add back in just the features we need...
So overall we got:
- A factor of two (720ns->300ns) by saving work on the CPU instead of regenerating it each frame.
- A factor of two (300ns->160ns) by moving glBegin up and glEnd down.
- Nothing by switching to display lists or vertex arrays (after the previous optimizations, that is!)
- A factor of three (160ns -> 54ns) by switching to triangle strips from isolated triangles.
- A factor of two (54ns -> 32ns) by simplifying the vertex shader.
- A factor of two (32ns -> 16ns) by using vertex buffer objects.
For a combined speedup of 50x!
See the example code (Directory, Zip, Tar-gzip)
Further speedup
Any graphics program will be limited by one of three components:
- The CPU
- Doing complicated calculations in the CPU per vertex can kill
performance. If the Windows Task Manager (or UNIX "top") show
your program pegged at 100% CPU utilization, you're CPU limited.
- Speed up CPU-limited code by compiling with optimization (or using a faster compiler), using SSE instructions,
replacing expensive operations (like cosine) with cheap operations
(like arithmetic), precalculating stuff into tables, and doing more
work on the graphics card in the vertex and fragment shader.
- The Vertex Shader
- Vertex processing is highly dependent on the exact routine you use to create your geometry--see
above for an example where there's a 50-fold difference between plain
old glVertex calls and triangle-stripped Vertex Buffer Objects.
- The next best way to avoid vertex work is to draw fewer triangles--simplify geometry, improve your early culling, etc.
- The last thing you should try is shortening your vertex
shader. Sometimes this means moving functionality to the CPU or
to the fragment shader.
- Any card using any interface should push at least 1 million
triangles
per second. glVertex won't get much more than that on any
card. A newer card using Vertex Buffer
Objects should get dozens to hundreds of millions of triangles per
second. Note that this is much slower than fragment processing,
so moving work to the fragment shader may be a win.
- The Fragment Shader (or Pixel Shader in DirectX terminology)
- If you're fragment-shader limited, changing the window size
(changing the number of pixels drawn) dramatically impacts your
framerate.
- Drawing fewer pixels will always help a fragment-limited
program. This means getting rid of overdraw (drawing the same
part of the screen over and over).
- Turning on Z-test and drawing front-to-back will help, because with "Early Z Culling" the card won't run the fragment shader on pixels that are behind other stuff (where the Z test fails).
- Making your fragment shader simpler will always help a fragment-limited program.
- Doing less texturing, or texturing at lower resolution or color
fidelity, can help a fragment shader that's limited by texture memory
bandwidth.
- Any modern graphics card will be able to do a few billion
fragments
per second. Really high-end cards might get a few dozen billion
fragments per second. Unlike with the vertex shader, the fragment
shader's
performance isn't influenced as much by the exact routine you call to
render stuff.
Generally speaking, "drawing less crap" will help any program.
"Crap" is stuff that's offscreen, too small to see, or uninteresting.
- Distance-dependent Level of Detail (LOD) is a common trick--use
simpler models for faraway stuff. A mesh simplification tool like
QSlim
can help generate simpler models from complicated ones. This
usually won't improve a fragment-shader limited program, but helps a
lot with CPU and vertex use.
- Portals, BSP trees, and other view-culling data structures help
you only render stuff that might be visible. This can decrease
the use of all rendering resources.
- "Occlusion Culling" is when you render the bounding box of an
object you're thinking about drawing. If none of the pixels of
the bounding box passed the Z test, the object is behind other
stuff. Occlusion Query is the extension to use here.
You can sometimes simplify your programs by checking out the assembly
language generated for your GLSL shaders. The only way to get
close to this right now is using the Cg compiler:
cgc -oglsl -profile arbfp1 myShader.glsl -o myShader.afp
(You can download Cg for any platform and card from nVidia.) Generally, longer programs take more time, since most assembly instructions take one clock cycle--see my per-instruction timings here.