Graphics Card Performance

CS 481 Guest Lecture, Dr. Lawlor, 2006/04/20

Most OpenGL programs don't take advantage of the performance inherent in the graphics card.

For example, take a typical simple OpenGL program:
	for (int iy=0;iy<n;iy++)
for (int ix=0;ix<n;ix++) {
glBegin(GL_TRIANGLES);
double x=ix*sz-0.5, y=iy*sz-0.5;
glVertex3fv(makeVertex(x,y));
glVertex3fv(makeVertex(x,y+sz));
glVertex3fv(makeVertex(x+sz,y));
glEnd();
}
If "makeVertex" does a few flops (say, two cosine calls), this might take 700ns per triangle.  But if we just precalculate and save the vertex locations (avoiding the floating point calculations in "makeVertex"), and draw like this:
	for (unsigned int i=0;i<vertices.size();)
{
glBegin(GL_TRIANGLES);
glVertex3fv(vertices[i++].loc);
glVertex3fv(vertices[i++].loc);
glVertex3fv(vertices[i++].loc);
glEnd();
}
Suddenly we're twice as fast--just 300ns per triangle.  But doing a glBegin/glEnd for every triangle is pretty slow, because the driver has to do extra work on each glBegin (checking the texture, flushing the pipe to the graphics card, etc.).  So if we just do:
	glBegin(GL_TRIANGLES);
for (unsigned int i=0;i<vertices.size();)
{
glVertex3fv(vertices[i++].loc);
glVertex3fv(vertices[i++].loc);
glVertex3fv(vertices[i++].loc);
}
glEnd();
Now we're down to 160ns per triangle.  If we use "vertex arrays", which just point OpenGL at our array of vertices like this:
	glEnableClientState(GL_VERTEX_ARRAY);	
glVertexPointer(3,GL_FLOAT,
sizeof(myVertex),vertices[0].loc);
glDrawArrays(GL_TRIANGLES, 0, vertices.size());
glDisableClientState(GL_VERTEX_ARRAY);
This doesn't actually speed things up.  At all.  But it's at least easy to do.  We can then switch to triangle strips by just generating our vertices in the proper order, and then do:
	glEnableClientState(GL_VERTEX_ARRAY);	
glVertexPointer(3,GL_FLOAT,
sizeof(myVertex),vertices[0].loc);
glDrawArrays(GL_TRIANGLE_STRIP, 0, vertices.size());
glDisableClientState(GL_VERTEX_ARRAY);
This takes just 54ns per triangle--a 3x speedup!  It's faster because triangle strips share vertices between adjacent triangles, which means less work per triangle.  We can decrease the time per vertex even further by using a very simple vertex shader like:
void main(void) 
{
gl_Position = ftransform();
}
(For a simple or fixed camera, even this can be improved by precalculation!)  Because the vertex shader doesn't set anything up, we can only do really silly work in the fragment shader like this:
void main(void)
{
gl_FragColor=vec4(0,1,0,1);
}
With these vertex and fragment shaders, the same code takes just 32ns per triangle.  But we can do even better yet!  If we build a "vertex buffer object", as described in the ARB_vertex_buffer_object extension spec, like this (do this once at startup):
	vertex_buffer=0;
glGenBuffersARB(1, &vertex_buffer);
glBindBufferARB(GL_ARRAY_BUFFER_ARB, vertex_buffer);
glBufferDataARB(GL_ARRAY_BUFFER_ARB,
sizeof(myVertex)*vertices.size(),
&vertices[0], GL_STATIC_DRAW_ARB);
glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);
This actually copies the vertex locations (the "vertices" array) into graphics card memory, which is much faster than CPU memory (which can only be accessed over the AGP or PCI bus).  You can now render the vertices in a fashion similar to vertex arrays:
	glBindBufferARB(GL_ARRAY_BUFFER_ARB, vertex_buffer);
glEnableClientState(GL_VERTEX_ARRAY);
glVertexPointer(3,GL_FLOAT,
sizeof(myVertex),0);
glDrawArrays(GL_TRIANGLE_STRIP, 0, vertices.size());
glDisableClientState(GL_VERTEX_ARRAY);
glBindBufferARB(GL_ARRAY_BUFFER_ARB, 0);
Vertex buffer objects take our time per triangle down to 16ns per triangle--another factor of two improvement.  So we can now draw 2 million triangles per frame at 27fps!  Of course, they're flat-shaded, untextured polygons due to the very simple vertex shader, but now we can add back in just the features we need...

So overall we got:
For a combined speedup of 50x!
See the example code (Directory, Zip, Tar-gzip)

Further speedup

Any graphics program will be limited by one of three components:
Generally speaking, "drawing less crap" will help any program.  "Crap" is stuff that's offscreen, too small to see, or uninteresting.
You can sometimes simplify your programs by checking out the assembly language generated for your GLSL shaders.  The only way to get close to this right now is using the Cg compiler:

    cgc -oglsl -profile arbfp1 myShader.glsl  -o myShader.afp

(You can download Cg for any platform and card from nVidia.)   Generally, longer programs take more time, since most assembly instructions take one clock cycle--see my per-instruction timings here.