CUDA: February 2015

Sunday, February 22, 2015

cuda ptx

1) binary utilities
http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#axzz3SUgZUVcD

binary in elf-format
nvcc embeds cubin files into the host executable file
they can be generated separately by using "-cubin" option

cuobjdump: cubin and host binaries
nvdisasm: cubinfiles , which can support control flow analysis and output

For example, I have matrix multiplication app called matrixMul.

% cuda elf sections
cuobjdump -elf matrixMul

%cuda assembly
cuobjdump -sass matrixMul

%extract ptx from elf
cuobjdump matrixMul -ptx -sass

% list different cubin files for different architecture
cuobjdump a.out -lelf

% extract all the cubins from the binary
cuobjdump matrixMul -xelf all

Assume, I want to analysis the architecture with cuda capability 3.0.
The previous cubin is matrixMul.sm_30.cubin

% extract the control flow graph of a kernel
nvdisasm -cfg matrixMul.sm_30.cubin

% to generate DOT graph description language
sudo apt-get install graphviz
nvdisasm -cfg matrixMul.sm_30.cubin | dot -o cfg.png -Tpng

% to shwo the register liveness range information
nvdisasm -plr matrixMul.sm_30.cubin

Monday, February 16, 2015

Instruction Level Parallelism and Thread Level Parallelism

A good tutorial from Prof. John Owen from UCD.
http://www.nvidia.com/content/cudazone/cudau/courses/ucdavis/lectures/tlp1.pdf

TLP: many restaurants with one boss and one chef
ILP: one restaurant with one boss and many chefs

Interesting matrix multiplication in CUDA 7.0 SDK

In the CUDA 7.0 SDK, for the matrix multiplication benchmark, the input A is 320 x 320 and input B is 640 x 320. It calculates output C using A x B!

(320 x 320) x (640 x 320)
A B

It doesn't make sense!

CUDA

Sunday, February 22, 2015

cuda ptx

Monday, February 16, 2015

Instruction Level Parallelism and Thread Level Parallelism

Interesting matrix multiplication in CUDA 7.0 SDK

Sunday, February 15, 2015

Exercises using gpuocelot