Wednesday, April 15, 2015

MM_Prefetching

http://www.seas.upenn.edu/~cis565/Lectures2011S/Lecture12.pdf

http://simulationcorner.net/index.php?page=fastmatrixvector



Sunday, February 22, 2015

cuda ptx

1) binary utilities
http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#axzz3SUgZUVcD

binary in elf-format
nvcc embeds cubin files into the host executable file
they can be generated separately by using "-cubin" option

cuobjdump: cubin and host binaries
nvdisasm: cubinfiles , which can support control flow analysis and output

For example, I have matrix multiplication app called matrixMul.

% cuda elf sections
cuobjdump -elf matrixMul

%cuda assembly
cuobjdump -sass matrixMul

%extract ptx from elf
cuobjdump matrixMul -ptx -sass

% list different cubin files for different architecture
cuobjdump a.out -lelf

% extract all the cubins from the binary
cuobjdump matrixMul -xelf all


Assume, I want to analysis the architecture with cuda capability 3.0.
The previous cubin is matrixMul.sm_30.cubin

% extract the control flow graph of a kernel
nvdisasm -cfg matrixMul.sm_30.cubin

% to generate DOT graph description language
sudo apt-get install graphviz
nvdisasm -cfg matrixMul.sm_30.cubin | dot -o cfg.png -Tpng


% to shwo the register liveness range information
nvdisasm -plr matrixMul.sm_30.cubin

Monday, February 16, 2015

Instruction Level Parallelism and Thread Level Parallelism

A good tutorial from Prof. John Owen from UCD.
http://www.nvidia.com/content/cudazone/cudau/courses/ucdavis/lectures/tlp1.pdf


TLP: many restaurants with one boss and one chef
ILP: one restaurant with one boss and many chefs

Interesting matrix multiplication in CUDA 7.0 SDK

In the CUDA 7.0 SDK, for the matrix multiplication benchmark, the input A is 320 x 320 and input B is 640 x 320. It calculates output C using A x B!

(320 x 320)    x    (640 x 320)
        A                          B

It doesn't make sense!

Sunday, February 15, 2015

Exercises using gpuocelot

http://www.ieap.uni-kiel.de/et/people/kruse/tutorials/cuda/tutorial01o/web01o/tutorial01o.html


Thursday, January 29, 2015

embed assembly inside cuda kernel

If you know specific asm, you could potentially just right the kernel assebmly by your own.
Ha, too much work!

Here is two lines I found in the SDK.


    unsigned lane_mask_lt;
    asm("mov.u32 %0, %%lanemask_lt;" : "=r"(lane_mask_lt));


error while loading shared libraries: libcudart.so.5.5: cannot open shared object file: No such file or directory

It happens when the system can't find the dynamic linker bindings, though you probably set every environment correctly and compiled the program successfully.

Here is the solution.

32-bit: sudo ldconfig /usr/local/cuda/lib

64-bit: sudo ldconfig /usr/local/cuda/lib64

Thanks.(http://stackoverflow.com/questions/10808958/why-cant-libcudart-so-4-be-found-when-compiling-the-cuda-samples-under-ubuntu)