To debug the kernel, you can directly use printf() function like C inside cuda kernel, instead of calling cuprintf() in cuda 4.2. However, I noticed that there is a limit of trace to print out to the stdout, around 4096 records, thought you may have N, e.g. 50K, threads running on the device.
To pre-allocate a data structure to install these info is a safer and better solution.
How? because you will get an error of calling a __host__ function("printf") from a __global__ function.
ReplyDelete