cublas vs. gotoblas

Moderator: Programming Massively Parallel Processors

everyon
Neuling
Neuling
Beiträge: 4
Registriert: 22. Apr 2009 09:19

cublas vs. gotoblas

Beitrag von everyon » 30. Apr 2009 10:54

Since the matrix multiplication(with the three nested loops) ran terribly slow on my CPU versus the one on the GPU, I tested one of the best implementations available for the CPU against the one that is shipped with cuda in cublas. Fortunately the gotoblas is already parallized.

Here are my results with sgemm(sigle) and dgemm(double) on a GTX280 and a Intel Xeon E5430(2.66Ghz) from gris pools. The GFlops are calculated as time/(2*width^3 - width^2) /10^9 and I used some big square matrices, where cublas performs best. The time is the pure calculation time (no copying/allocation).

Single Precision:
GTX: ~372 GFlops vs. Quad-core: ~79 GFlops

Double Precision:
~75 GFlops vs. ~40 GFlops

On double precision it's not that sensational, but single precision is good.

One has to mention that the gotoblas is individually optimized in assembler for every different processor type and that cublas is written by a NVIDIA Forums user (right?). So there may be great improvement in the future. Also the GTX280(260) were/are the first double precision cards.

I'm happy for some comments on this to have a fair comparision.

Zurück zu „Programming Massively Parallel Processors“