Graphics Card Performance

CS 481 Guest Lecture, Dr. Lawlor, 2006/04/20

Most OpenGL programs don't take advantage of the performance inherent in the graphics card.

For example, take a typical simple OpenGL program:

	for (int iy=0;iy<n;iy++)
	for (int ix=0;ix<n;ix++) {
		glBegin(GL_TRIANGLES);
		double x=ix*sz-0.5, y=iy*sz-0.5;
		glVertex3fv(makeVertex(x,y));
		glVertex3fv(makeVertex(x,y+sz));
		glVertex3fv(makeVertex(x+sz,y));
		glEnd();
	}

If "makeVertex" does a few flops (say, two cosine calls), this might take 700ns per triangle. But if we just precalculate and save the vertex locations (avoiding the floating point calculations in "makeVertex"), and draw like this:

	for (unsigned int i=0;i<vertices.size();)
	{
		glBegin(GL_TRIANGLES);
		glVertex3fv(vertices[i++].loc);
		glVertex3fv(vertices[i++].loc);
		glVertex3fv(vertices[i++].loc);
		glEnd();
	}

Suddenly we're twice as fast--just 300ns per triangle. But doing a glBegin/glEnd for every triangle is pretty slow, because the driver has to do extra work on each glBegin (checking the texture, flushing the pipe to the graphics card, etc.). So if we just do:

	glBegin(GL_TRIANGLES);
	for (unsigned int i=0;i<vertices.size();)
	{
		glVertex3fv(vertices[i++].loc);
		glVertex3fv(vertices[i++].loc);
		glVertex3fv(vertices[i++].loc);
	}
	glEnd();

Now we're down to 160ns per triangle. If we use "vertex arrays", which just point OpenGL at our array of vertices like this:

	glEnableClientState(GL_VERTEX_ARRAY);	
	glVertexPointer(3,GL_FLOAT,
		sizeof(myVertex),vertices[0].loc);
	glDrawArrays(GL_TRIANGLES, 0, vertices.size());
	glDisableClientState(GL_VERTEX_ARRAY);

This doesn't actually speed things up. At all. But it's at least easy to do. We can then switch to triangle strips by just generating our vertices in the proper order, and then do:

	glEnableClientState(GL_VERTEX_ARRAY);	
	glVertexPointer(3,GL_FLOAT,
		sizeof(myVertex),vertices[0].loc);
	glDrawArrays(GL_TRIANGLE_STRIP, 0, vertices.size());
	glDisableClientState(GL_VERTEX_ARRAY);

This takes just 54ns per triangle--a 3x speedup! It's faster because triangle strips share vertices between adjacent triangles, which means less work per triangle. We can decrease the time per vertex even further by using a very simple vertex shader like:

void main(void) 
{
	gl_Position = ftransform();
}

(For a simple or fixed camera, even this can be improved by precalculation!) Because the vertex shader doesn't set anything up, we can only do really silly work in the fragment shader like this:

void main(void)
{
	gl_FragColor=vec4(0,1,0,1);
}

With these vertex and fragment shaders, the same code takes just 32ns per triangle. But we can do even better yet! If we build a "vertex buffer object", as described in the ARB_vertex_buffer_object extension spec, like this (do this once at startup):

	vertex_buffer=0;
	glGenBuffersARB(1, &vertex_buffer);
	glBindBufferARB(GL_ARRAY_BUFFER_ARB, vertex_buffer);
	glBufferDataARB(GL_ARRAY_BUFFER_ARB,
		 sizeof(myVertex)*vertices.size(), 
		 &vertices[0], GL_STATIC_DRAW_ARB);
	glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);

This actually copies the vertex locations (the "vertices" array) into graphics card memory, which is much faster than CPU memory (which can only be accessed over the AGP or PCI bus). You can now render the vertices in a fashion similar to vertex arrays:

	glBindBufferARB(GL_ARRAY_BUFFER_ARB, vertex_buffer);
	glEnableClientState(GL_VERTEX_ARRAY);	
	glVertexPointer(3,GL_FLOAT,
		sizeof(myVertex),0);
	glDrawArrays(GL_TRIANGLE_STRIP, 0, vertices.size());
	glDisableClientState(GL_VERTEX_ARRAY);
	glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);

Vertex buffer objects take our time per triangle down to 16ns per triangle--another factor of two improvement. So we can now draw 2 million triangles per frame at 27fps! Of course, they're flat-shaded, untextured polygons due to the very simple vertex shader, but now we can add back in just the features we need...

So overall we got:

A factor of two (720ns->300ns) by saving work on the CPU instead of regenerating it each frame.
A factor of two (300ns->160ns) by moving glBegin up and glEnd down.
Nothing by switching to display lists or vertex arrays (after the previous optimizations, that is!)
A factor of three (160ns -> 54ns) by switching to triangle strips from isolated triangles.
A factor of two (54ns -> 32ns) by simplifying the vertex shader.
A factor of two (32ns -> 16ns) by using vertex buffer objects.

For a combined speedup of 50x!
See the example code (Directory, Zip, Tar-gzip)

Further speedup

Any graphics program will be limited by one of three components:

The CPU

Doing complicated calculations in the CPU per vertex can kill performance. If the Windows Task Manager (or UNIX "top") show your program pegged at 100% CPU utilization, you're CPU limited.
Speed up CPU-limited code by compiling with optimization (or using a faster compiler), using SSE instructions, replacing expensive operations (like cosine) with cheap operations (like arithmetic), precalculating stuff into tables, and doing more work on the graphics card in the vertex and fragment shader.

The Vertex Shader

Vertex processing is highly dependent on the exact routine you use to create your geometry--see above for an example where there's a 50-fold difference between plain old glVertex calls and triangle-stripped Vertex Buffer Objects.
The next best way to avoid vertex work is to draw fewer triangles--simplify geometry, improve your early culling, etc.
The last thing you should try is shortening your vertex shader. Sometimes this means moving functionality to the CPU or to the fragment shader.
Any card using any interface should push at least 1 million triangles per second. glVertex won't get much more than that on any card. A newer card using Vertex Buffer Objects should get dozens to hundreds of millions of triangles per second. Note that this is much slower than fragment processing, so moving work to the fragment shader may be a win.

The Fragment Shader (or Pixel Shader in DirectX terminology)

If you're fragment-shader limited, changing the window size (changing the number of pixels drawn) dramatically impacts your framerate.
Drawing fewer pixels will always help a fragment-limited program. This means getting rid of overdraw (drawing the same part of the screen over and over).
Turning on Z-test and drawing front-to-back will help, because with "Early Z Culling" the card won't run the fragment shader on pixels that are behind other stuff (where the Z test fails).
Making your fragment shader simpler will always help a fragment-limited program.
Doing less texturing, or texturing at lower resolution or color fidelity, can help a fragment shader that's limited by texture memory bandwidth.
Any modern graphics card will be able to do a few billion fragments per second. Really high-end cards might get a few dozen billion fragments per second. Unlike with the vertex shader, the fragment shader's performance isn't influenced as much by the exact routine you call to render stuff.

Generally speaking, "drawing less crap" will help any program. "Crap" is stuff that's offscreen, too small to see, or uninteresting.

Distance-dependent Level of Detail (LOD) is a common trick--use simpler models for faraway stuff. A mesh simplification tool like QSlim can help generate simpler models from complicated ones. This usually won't improve a fragment-shader limited program, but helps a lot with CPU and vertex use.
Portals, BSP trees, and other view-culling data structures help you only render stuff that might be visible. This can decrease the use of all rendering resources.
"Occlusion Culling" is when you render the bounding box of an object you're thinking about drawing. If none of the pixels of the bounding box passed the Z test, the object is behind other stuff. Occlusion Query is the extension to use here.

You can sometimes simplify your programs by checking out the assembly language generated for your GLSL shaders. The only way to get close to this right now is using the Cg compiler:

cgc -oglsl -profile arbfp1 myShader.glsl -o myShader.afp

(You can download Cg for any platform and card from nVidia.) Generally, longer programs take more time, since most assembly instructions take one clock cycle--see my per-instruction timings here.