Sunday, February 22, 2015

cuda ptx

1) binary utilities
http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#axzz3SUgZUVcD

binary in elf-format
nvcc embeds cubin files into the host executable file
they can be generated separately by using "-cubin" option

cuobjdump: cubin and host binaries
nvdisasm: cubinfiles , which can support control flow analysis and output

For example, I have matrix multiplication app called matrixMul.

% cuda elf sections
cuobjdump -elf matrixMul

%cuda assembly
cuobjdump -sass matrixMul

%extract ptx from elf
cuobjdump matrixMul -ptx -sass

% list different cubin files for different architecture
cuobjdump a.out -lelf

% extract all the cubins from the binary
cuobjdump matrixMul -xelf all


Assume, I want to analysis the architecture with cuda capability 3.0.
The previous cubin is matrixMul.sm_30.cubin

% extract the control flow graph of a kernel
nvdisasm -cfg matrixMul.sm_30.cubin

% to generate DOT graph description language
sudo apt-get install graphviz
nvdisasm -cfg matrixMul.sm_30.cubin | dot -o cfg.png -Tpng


% to shwo the register liveness range information
nvdisasm -plr matrixMul.sm_30.cubin

Monday, February 16, 2015

Instruction Level Parallelism and Thread Level Parallelism

A good tutorial from Prof. John Owen from UCD.
http://www.nvidia.com/content/cudazone/cudau/courses/ucdavis/lectures/tlp1.pdf


TLP: many restaurants with one boss and one chef
ILP: one restaurant with one boss and many chefs

Interesting matrix multiplication in CUDA 7.0 SDK

In the CUDA 7.0 SDK, for the matrix multiplication benchmark, the input A is 320 x 320 and input B is 640 x 320. It calculates output C using A x B!

(320 x 320)    x    (640 x 320)
        A                          B

It doesn't make sense!

Sunday, February 15, 2015

Exercises using gpuocelot

http://www.ieap.uni-kiel.de/et/people/kruse/tutorials/cuda/tutorial01o/web01o/tutorial01o.html