Saturday, September 19, 2015

2d array on gpu

You can dynamically allocate multi-dimensional array on cpu and free it before exiting the program.
However this is not the case for gpu.
When you first allocate pointer, it works fine. Then when you cudamalloc a space for each pointer, it will trigger a segmentation fault.

One solution is creating a contiguous trunk of memory and find a index for each dimension.
Another solution for 2d is use pitch memory.

printf inside kernel

To debug the kernel, you can directly use printf() function like C inside cuda kernel, instead of calling cuprintf() in cuda 4.2. However, I noticed that there is a limit of trace to print out to the stdout, around 4096 records, thought you may have N, e.g. 50K, threads running on the device.

To pre-allocate a data structure to install these info is a safer and better solution.

Thursday, August 20, 2015

full recipe to install cuda 7 on ubuntu 14.04

cuda driver and toolkit can be downloaded from here.

step 1
$sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx-lts-trusty libglu1-mesa libglu1-mesa-dev -y

step 2
blacklist the required modules (so that they don’t interfere with the driver installation)
$sudo vim  /etc/modprobe.d/blacklist.conf
add the end with following lines:

blacklist amd76x_edac
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv


step 3
get rid of nvidia residuals
$sudo apt-get remove --purge nvidia-*


step 4
reboot
gksudo gedit /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nomodeset"
sudo update-grub


step 4
reboot
sudo service lightdm stop
chmod +x cuda_*.run
sudo ./cuda_7.0.28_linux.run


step 5
in your ~/.bashrc, set up the env
export PATH=/usr/local/cuda-7.0/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-7.0/lib64

sudo reboot

HOOOOOOOOORAY~~~~!!!

If you updated your linux kernel and could log in by typing the right password, you may need to consider the following approach by removing the new linux image.

to check the installed linux image list
dpkg --list | grep linux-image

remove the targeted kernel image (http://askubuntu.com/questions/375844/kernel-downgrade-after-update-broke-my-system) . Here is an example.
sudo apt-get remove --purge linux-image-3.8.0-33-generic

Sunday, May 24, 2015

Unified Memory

On the discrete GPU, the allocation is on the device.
To profile, use the following command.
nvprof --print-gpu-trace  --unified-memory-profiling per-process-device ./application_name

For example,
$ nvprof --print-gpu-trace  --unified-memory-profiling per-process-device ./um
==28981== Profiling application: ./um
==28981== Profiling result:
   Start  Duration           Unified Memory           Name
170.14ms  7.4300us      4096 B                          [Unified Memory Memcpy DtoH]
170.18ms  4.1800us      4096 B                          [Unified Memory Memcpy HtoD]
170.23ms  875.94us          -                                Kernel(float*) [96]
171.11ms  4.7740us      4096 B                          [Unified Memory Memcpy DtoH]

The data transfer is at the granularity of 4KB.


What about on the integrated GPU ?
Is it in the pinned buffer of system memory?
Unfortunately, Unified Memory Profiling is not supported on GK20A (Jetson TK1). Therefore, we need to find a workaround.

Methodology:
1) Benchmark the bandwidth with or without pinned memory,
2) Benchmark the bandwidth with UM,
3) Compare them and find the best match.

Here is a snapshot of part of host-to-device-bandwidth results.
Pinned mem is a limited resource. After around 6 MB, the Pageable excels it.


Pinned Pageable
Transfer Size (Bytes) Bandwidth(MB/s) Bandwidth(MB/s)
1024000 891.8 440.7
1126400 981.6 439.6
2174976 944.2 629.7
3223552 960.6 630.9
4272128 958.6 914.9
5320704 966 741.9
6369280 994.4 627.2
7417856 994.7 1077.8
8466432 994.9 1271.6
9515008 995.3 1367.7
10563584 985.4 1371.6
11612160 986.3 1428.5
12660736 983.5 1402.5
13709312 984.6 1410
14757888 995.7 1408.8
15806464 995.8 1414.7

Measuring UM is non-trivial, since it transfers the data without explicit API calls.
For example, when initialize host memory,

  1. first, it copies to the host memory,
  2. second, it updates the mem region,
  3. third, it copies back to the device memory

Transfer Size (Bytes) UM(remove the cpu computation) pinned (h2d,d2h) pageable(h2d,d2h)
1024 161.3 104.6 7.4
2048 259.7 96.7 16.7
3072 212.8 76.1 24.8
4096 292 113.8 33.2
5120 294.1 143.2 40.5
6144 301.5 175.3 49.2
7168 305.7 207 57.3
8192 308.9 236.7 65.3
9216 307.2 270.2 63.7
10240 310.6 295.3 71.1
11264 342.7 329.7 78.3
12288 291.3 491.2 83.3
13312 290.8 518.3 89
14336 315.3 577.3 92.5
15360 300.6 630.8 104.5
16384 299.6 577.1 100.1
17408 312.5 327.8 108.1
18432 301 980.8 88.4
19456 310.5 600.6 109.5
20480 378.8 791.9 116.8
22528 304.3 811.5 122.3

According to the results, UM performance is close to pinned memory performance.
The data should located in the pinned memory buffer.
There are still some interesting observations from the analysis. I will keep posted.

The source code is based on the CUDA SDK.

Wednesday, April 15, 2015

MM_Prefetching

http://www.seas.upenn.edu/~cis565/Lectures2011S/Lecture12.pdf

http://simulationcorner.net/index.php?page=fastmatrixvector



Sunday, February 22, 2015

cuda ptx

1) binary utilities
http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#axzz3SUgZUVcD

binary in elf-format
nvcc embeds cubin files into the host executable file
they can be generated separately by using "-cubin" option

cuobjdump: cubin and host binaries
nvdisasm: cubinfiles , which can support control flow analysis and output

For example, I have matrix multiplication app called matrixMul.

% cuda elf sections
cuobjdump -elf matrixMul

%cuda assembly
cuobjdump -sass matrixMul

%extract ptx from elf
cuobjdump matrixMul -ptx -sass

% list different cubin files for different architecture
cuobjdump a.out -lelf

% extract all the cubins from the binary
cuobjdump matrixMul -xelf all


Assume, I want to analysis the architecture with cuda capability 3.0.
The previous cubin is matrixMul.sm_30.cubin

% extract the control flow graph of a kernel
nvdisasm -cfg matrixMul.sm_30.cubin

% to generate DOT graph description language
sudo apt-get install graphviz
nvdisasm -cfg matrixMul.sm_30.cubin | dot -o cfg.png -Tpng


% to shwo the register liveness range information
nvdisasm -plr matrixMul.sm_30.cubin

Monday, February 16, 2015

Instruction Level Parallelism and Thread Level Parallelism

A good tutorial from Prof. John Owen from UCD.
http://www.nvidia.com/content/cudazone/cudau/courses/ucdavis/lectures/tlp1.pdf


TLP: many restaurants with one boss and one chef
ILP: one restaurant with one boss and many chefs