## Exercise 1

franzel
Erstie
Beiträge: 21
Registriert: 6. Apr 2009 20:22

### Exercise 1

Exercise 1 is now available. The submission deadline is May 12th, 2pm.

There will be one bonus point if and only if you solve all items of the exercise. Make sure to include a textfile that presents your results from 3)

Sylvester
Neuling
Beiträge: 6
Registriert: 22. Apr 2009 20:51

### Re: Exercise 1

I have a question about time measerument: i tested my implementation and cublas with the cutTimer.
No matter how big i choose the matrix dimension, say, 3200x3200 i get (kernel) times like 0.07ms. I sure believe CUDA can be fast, but that seems improbable. I use the timer like this:
cutStartTimer(timer);
kernel or cublasSgemm;
cutStopTimer(timer);
Is this not the proper way to use the timer?

franzel
Erstie
Beiträge: 21
Registriert: 6. Apr 2009 20:22

### Re: Exercise 1

Hello Sylvester,

your code looks right. You stumbled over a certain problem in CUDA. Your program exited prematurely without any error message. This can easily be verified by looking at the result of your computation.

I came across the same problem after adjusting my matrix multiplication dimensions. When i choose matrix dimensions over a certain size, the computation ended faster than expected. Under CUDA 1.x the "debug" compiler flag didn't raise any error message. Maybe this is one of the things that got fixed in CUDA 2.x.

Sylvester
Neuling
Beiträge: 6
Registriert: 22. Apr 2009 20:51

### Re: Exercise 1

I forgot to mention, after the computation by the graphics card i always compare the result with the cpu computation, and (i'm going to check that routines again) the differences were moderate (maximal difference in one entry 1e-4), so either my comparison function is incorrect or the results are reasonable.

franzel
Erstie
Beiträge: 21
Registriert: 6. Apr 2009 20:22

### Re: Exercise 1

did you reduce your matrix size? if this helps you definitely know where this is coming from.

i can recall a similar situation. it is possible that you get the same interval of device memory allocated, every time you run your program. now if you don't set a different seed for your pseudo-random number generator each time you run your program, you will get the same numbers for your input matrices every time. these two circumstances combined will give you the result of the multiplication (from an earlier run) before you even started the computation.

everyon
Neuling
Beiträge: 4
Registriert: 22. Apr 2009 09:19

### Re: Exercise 1

Did you copy your results back to the host memory or did you set a syncthreads?
If not, after calling the kernel, the host gets back the control immedeately and doensn't wait for the device to finish. With syncthreads you tell him to wait and the copy instruction does this implicitly.

Sylvester
Neuling
Beiträge: 6
Registriert: 22. Apr 2009 20:51

### Re: Exercise 1

Well, my intension was to compare only the kernels of my implementation and cublas. As far as i know, a syncthreads cannot be called from the host. I guess i'll just measure the time with copying...

baecher
Neuling
Beiträge: 8
Registriert: 6. Apr 2009 14:42

### Re: Exercise 1

Note that there is __syncthreads() (which is used within the kernel) and cudaThreadSynchronize() which is used on the host side since the kernel executes asynchronously. Additionally, after cudaThreadSynchronize(), you might want to call cudaGetLastError() and test the return value for cudaSuccess. See section 3.2.9 in the programming guide for further reading on this.

Hope this helps.

everyon
Neuling
Beiträge: 4
Registriert: 22. Apr 2009 09:19

### Re: Exercise 1

yeah, that's what i meant...mixed up the names...

franzel
Erstie
Beiträge: 21
Registriert: 6. Apr 2009 20:22

### Re: Exercise 1

in fact you do want to compare the benefit of cuda over a cpu implementation. this includes the necessary transfer and setup operations.

it is true that you cannot synchronize your host code with the syncthreads statement. the appropriate function is cudaThreadSynchronize(), which can be found in the CudaReferenceManual or in the CudaProgrammingGuide (4.5.1.5)

Sylvester
Neuling
Beiträge: 6
Registriert: 22. Apr 2009 20:51

### Re: Exercise 1

OK. Anyhow, i wanted to compare just the kernels... I managed to overlook the cudaThreadSynchronize while searching the Dev Guide for Sync, i'll try that. And then, finally, get some times with copying/setup
EDIT: Ok, now everything makes sense. GPU is reasonable fast and my implementation is not faster than cublas.

Here are some results (with copy times):
Error shows the absolute summed difference to CPU result/maximum absolute difference in an entry of the GPU result matrix to CPU result. The input matrices had entries in [-1,1].
$$\begin{matrix} &dim &CPU (ms) &GPU (ms) &CUBLAS (ms) &GPU error&CUBLAS error \\ &16 &0.004 &0.31 &0.18 &2.2e-5/1e-6 &2.2e-5/1e-6 \\ &160 &5.83 &0.643 &0.821 &2e-2/1e-5 &1.2e-2/5e-6\\ &800 &1823.37 &23.86 &16.04 &2.28/5/1e-5 &1.1/1.7e-5\\ &1024 &4577.5 &35.56 &17.71 &4.75/9.0e-5 &2.2/2.3e-5\\ &1600 &21619.8 &160.21 &57.49 &17.94/1.3e-4 &7.41/3.4e-5\\ &3200 &254570.9 &1310.23 &368.37 &141.4/2.5e-4 &49.7/6.1e-5\\ &4000 &450939.7 &2620.3 &1065.0 &271.92/3.7e-4 &90.54/7.0e-5\\ \end{matrix}$$

Calculations where done on a GRIS pool computer, i think it was one with a 9800. Results from others would be interesting.