cublas vs. gotoblas
Here are my results with sgemm(sigle) and dgemm(double) on a GTX280 and a Intel Xeon E5430(2.66Ghz) from gris pools. The GFlops are calculated as time/(2*width^3 - width^2) /10^9 and I used some big square matrices, where cublas performs best. The time is the pure calculation time (no copying/allocation).
GTX: ~372 GFlops vs. Quad-core: ~79 GFlops
~75 GFlops vs. ~40 GFlops
On double precision it's not that sensational, but single precision is good.
One has to mention that the gotoblas is individually optimized in assembler for every different processor type and that cublas is written by a NVIDIA Forums user (right?). So there may be great improvement in the future. Also the GTX280(260) were/are the first double precision cards.
I'm happy for some comments on this to have a fair comparision.