CUDA

Saturday, September 19, 2015

2d array on gpu

You can dynamically allocate multi-dimensional array on cpu and free it before exiting the program.
However this is not the case for gpu.
When you first allocate pointer, it works fine. Then when you cudamalloc a space for each pointer, it will trigger a segmentation fault.

One solution is creating a contiguous trunk of memory and find a index for each dimension.
Another solution for 2d is use pitch memory.

printf inside kernel

To debug the kernel, you can directly use printf() function like C inside cuda kernel, instead of calling cuprintf() in cuda 4.2. However, I noticed that there is a limit of trace to print out to the stdout, around 4096 records, thought you may have N, e.g. 50K, threads running on the device.

To pre-allocate a data structure to install these info is a safer and better solution.

Thursday, August 20, 2015

full recipe to install cuda 7 on ubuntu 14.04

cuda driver and toolkit can be downloaded from here.

step 1

$sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx-lts-trusty libglu1-mesa libglu1-mesa-dev -y

step 2

blacklist the required modules (so that they don’t interfere with the driver installation)

$sudo vim /etc/modprobe.d/blacklist.conf

add the end with following lines:

blacklist amd76x_edac

blacklist vga16fb

blacklist nouveau

blacklist rivafb

blacklist nvidiafb

blacklist rivatv

step 3

get rid of nvidia residuals

$sudo apt-get remove --purge nvidia-*

step 4

reboot

gksudo gedit /etc/default/grub

GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nomodeset"

sudo update-grub

step 4

reboot

sudo service lightdm stop

chmod +x cuda_*.run

sudo ./cuda_7.0.28_linux.run

step 5

in your ~/.bashrc, set up the env

export PATH=/usr/local/cuda-7.0/bin:$PATH

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-7.0/lib64

sudo reboot

HOOOOOOOOORAY~~~~!!!

If you updated your linux kernel and could log in by typing the right password, you may need to consider the following approach by removing the new linux image.

to check the installed linux image list

dpkg --list | grep linux-image

remove the targeted kernel image (http://askubuntu.com/questions/375844/kernel-downgrade-after-update-broke-my-system) . Here is an example.

sudo apt-get remove --purge linux-image-3.8.0-33-generic

Sunday, May 24, 2015

Unified Memory

On the discrete GPU, the allocation is on the device.
To profile, use the following command.
nvprof --print-gpu-trace --unified-memory-profiling per-process-device ./application_name

For example,
$ nvprof --print-gpu-trace --unified-memory-profiling per-process-device ./um

==28981== Profiling application: ./um

==28981== Profiling result:

Start Duration Unified Memory Name

170.14ms 7.4300us 4096 B [Unified Memory Memcpy DtoH]

170.18ms 4.1800us 4096 B [Unified Memory Memcpy HtoD]

170.23ms 875.94us - Kernel(float*) [96]

171.11ms 4.7740us 4096 B [Unified Memory Memcpy DtoH]

The data transfer is at the granularity of 4KB.

What about on the integrated GPU ?
Is it in the pinned buffer of system memory?
Unfortunately, Unified Memory Profiling is not supported on GK20A (Jetson TK1). Therefore, we need to find a workaround.

Methodology:
1) Benchmark the bandwidth with or without pinned memory,
2) Benchmark the bandwidth with UM,
3) Compare them and find the best match.

Here is a snapshot of part of host-to-device-bandwidth results.
Pinned mem is a limited resource. After around 6 MB, the Pageable excels it.

	Pinned	Pageable
Transfer Size (Bytes)	Bandwidth(MB/s)	Bandwidth(MB/s)
1024000	891.8	440.7
1126400	981.6	439.6
2174976	944.2	629.7
3223552	960.6	630.9
4272128	958.6	914.9
5320704	966	741.9
6369280	994.4	627.2
7417856	994.7	1077.8
8466432	994.9	1271.6
9515008	995.3	1367.7
10563584	985.4	1371.6
11612160	986.3	1428.5
12660736	983.5	1402.5
13709312	984.6	1410
14757888	995.7	1408.8
15806464	995.8	1414.7

Measuring UM is non-trivial, since it transfers the data without explicit API calls.
For example, when initialize host memory,

first, it copies to the host memory,
second, it updates the mem region,
third, it copies back to the device memory

Transfer Size (Bytes)	UM(remove the cpu computation)	pinned (h2d,d2h)	pageable(h2d,d2h)
1024	161.3	104.6	7.4
2048	259.7	96.7	16.7
3072	212.8	76.1	24.8
4096	292	113.8	33.2
5120	294.1	143.2	40.5
6144	301.5	175.3	49.2
7168	305.7	207	57.3
8192	308.9	236.7	65.3
9216	307.2	270.2	63.7
10240	310.6	295.3	71.1
11264	342.7	329.7	78.3
12288	291.3	491.2	83.3
13312	290.8	518.3	89
14336	315.3	577.3	92.5
15360	300.6	630.8	104.5
16384	299.6	577.1	100.1
17408	312.5	327.8	108.1
18432	301	980.8	88.4
19456	310.5	600.6	109.5
20480	378.8	791.9	116.8
22528	304.3	811.5	122.3

According to the results, UM performance is close to pinned memory performance.
The data should located in the pinned memory buffer.
There are still some interesting observations from the analysis. I will keep posted.

The source code is based on the CUDA SDK.

Wednesday, April 15, 2015

MM_Prefetching

http://www.seas.upenn.edu/~cis565/Lectures2011S/Lecture12.pdf

http://simulationcorner.net/index.php?page=fastmatrixvector

Sunday, February 22, 2015

cuda ptx

1) binary utilities
http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#axzz3SUgZUVcD

binary in elf-format
nvcc embeds cubin files into the host executable file
they can be generated separately by using "-cubin" option

cuobjdump: cubin and host binaries
nvdisasm: cubinfiles , which can support control flow analysis and output

For example, I have matrix multiplication app called matrixMul.

% cuda elf sections
cuobjdump -elf matrixMul

%cuda assembly
cuobjdump -sass matrixMul

%extract ptx from elf
cuobjdump matrixMul -ptx -sass

% list different cubin files for different architecture
cuobjdump a.out -lelf

% extract all the cubins from the binary
cuobjdump matrixMul -xelf all

Assume, I want to analysis the architecture with cuda capability 3.0.
The previous cubin is matrixMul.sm_30.cubin

% extract the control flow graph of a kernel
nvdisasm -cfg matrixMul.sm_30.cubin

% to generate DOT graph description language
sudo apt-get install graphviz
nvdisasm -cfg matrixMul.sm_30.cubin | dot -o cfg.png -Tpng

% to shwo the register liveness range information
nvdisasm -plr matrixMul.sm_30.cubin

Monday, February 16, 2015

Instruction Level Parallelism and Thread Level Parallelism

A good tutorial from Prof. John Owen from UCD.
http://www.nvidia.com/content/cudazone/cudau/courses/ucdavis/lectures/tlp1.pdf

TLP: many restaurants with one boss and one chef
ILP: one restaurant with one boss and many chefs