You can dynamically allocate multi-dimensional array on cpu and free it before exiting the program.
However this is not the case for gpu.
When you first allocate pointer, it works fine. Then when you cudamalloc a space for each pointer, it will trigger a segmentation fault.
One solution is creating a contiguous trunk of memory and find a index for each dimension.
Another solution for 2d is use pitch memory.
CUDA
Saturday, September 19, 2015
printf inside kernel
To debug the kernel, you can directly use printf() function like C inside cuda kernel, instead of calling cuprintf() in cuda 4.2. However, I noticed that there is a limit of trace to print out to the stdout, around 4096 records, thought you may have N, e.g. 50K, threads running on the device.
To pre-allocate a data structure to install these info is a safer and better solution.
To pre-allocate a data structure to install these info is a safer and better solution.
Thursday, August 20, 2015
full recipe to install cuda 7 on ubuntu 14.04
$sudo apt-get install freeglut3-dev build-essential libx11-dev libxmu-dev libxi-dev libgl1-mesa-glx-lts-trusty libglu1-mesa libglu1-mesa-dev -y
step 2
blacklist the required modules (so that they don’t interfere with the driver installation)
$sudo vim /etc/modprobe.d/blacklist.conf
add the end with following lines:
blacklist amd76x_edac
blacklist vga16fb
blacklist nouveau
blacklist rivafb
blacklist nvidiafb
blacklist rivatv
step 3
get rid of nvidia residuals
$sudo apt-get remove --purge nvidia-*
step 4
reboot
gksudo gedit /etc/default/grub
GRUB_CMDLINE_LINUX_DEFAULT="quiet splash nomodeset"
sudo update-grub
step 4
reboot
sudo service lightdm stop
chmod +x cuda_*.run
sudo ./cuda_7.0.28_linux.run
step 5
in your ~/.bashrc, set up the env
export PATH=/usr/local/cuda-7.0/bin:$PATH
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-7.0/lib64
sudo reboot
HOOOOOOOOORAY~~~~!!!
If you updated your linux kernel and could log in by typing the right password, you may need to consider the following approach by removing the new linux image.
to check the installed linux image list
dpkg --list | grep linux-image
remove the targeted kernel image (http://askubuntu.com/questions/375844/kernel-downgrade-after-update-broke-my-system) . Here is an example.
sudo apt-get remove --purge linux-image-3.8.0-33-generic
Sunday, May 24, 2015
Unified Memory
On the discrete GPU, the allocation is on the device.
To profile, use the following command.
nvprof --print-gpu-trace --unified-memory-profiling per-process-device ./application_name
For example,
$ nvprof --print-gpu-trace --unified-memory-profiling per-process-device ./um
To profile, use the following command.
nvprof --print-gpu-trace --unified-memory-profiling per-process-device ./application_name
For example,
$ nvprof --print-gpu-trace --unified-memory-profiling per-process-device ./um
==28981== Profiling application: ./um
==28981== Profiling result:
Start Duration Unified Memory Name
170.14ms 7.4300us 4096 B [Unified Memory Memcpy DtoH]
170.18ms 4.1800us 4096 B [Unified Memory Memcpy HtoD]
170.23ms 875.94us - Kernel(float*) [96]
171.11ms 4.7740us 4096 B [Unified Memory Memcpy DtoH]
The data transfer is at the granularity of 4KB.
What about on the integrated GPU ?
Is it in the pinned buffer of system memory?
Unfortunately, Unified Memory Profiling is not supported on GK20A (Jetson TK1). Therefore, we need to find a workaround.
Methodology:
1) Benchmark the bandwidth with or without pinned memory,
2) Benchmark the bandwidth with UM,
3) Compare them and find the best match.
Here is a snapshot of part of host-to-device-bandwidth results.
Pinned mem is a limited resource. After around 6 MB, the Pageable excels it.
Measuring UM is non-trivial, since it transfers the data without explicit API calls.
For example, when initialize host memory,
According to the results, UM performance is close to pinned memory performance.
The data should located in the pinned memory buffer.
There are still some interesting observations from the analysis. I will keep posted.
The source code is based on the CUDA SDK.
What about on the integrated GPU ?
Is it in the pinned buffer of system memory?
Unfortunately, Unified Memory Profiling is not supported on GK20A (Jetson TK1). Therefore, we need to find a workaround.
Methodology:
1) Benchmark the bandwidth with or without pinned memory,
2) Benchmark the bandwidth with UM,
3) Compare them and find the best match.
Here is a snapshot of part of host-to-device-bandwidth results.
Pinned mem is a limited resource. After around 6 MB, the Pageable excels it.
Pinned | Pageable | |
Transfer Size (Bytes) | Bandwidth(MB/s) | Bandwidth(MB/s) |
1024000 | 891.8 | 440.7 |
1126400 | 981.6 | 439.6 |
2174976 | 944.2 | 629.7 |
3223552 | 960.6 | 630.9 |
4272128 | 958.6 | 914.9 |
5320704 | 966 | 741.9 |
6369280 | 994.4 | 627.2 |
7417856 | 994.7 | 1077.8 |
8466432 | 994.9 | 1271.6 |
9515008 | 995.3 | 1367.7 |
10563584 | 985.4 | 1371.6 |
11612160 | 986.3 | 1428.5 |
12660736 | 983.5 | 1402.5 |
13709312 | 984.6 | 1410 |
14757888 | 995.7 | 1408.8 |
15806464 | 995.8 | 1414.7 |
Measuring UM is non-trivial, since it transfers the data without explicit API calls.
For example, when initialize host memory,
- first, it copies to the host memory,
- second, it updates the mem region,
- third, it copies back to the device memory
Transfer Size (Bytes) | UM(remove the cpu computation) | pinned (h2d,d2h) | pageable(h2d,d2h) |
1024 | 161.3 | 104.6 | 7.4 |
2048 | 259.7 | 96.7 | 16.7 |
3072 | 212.8 | 76.1 | 24.8 |
4096 | 292 | 113.8 | 33.2 |
5120 | 294.1 | 143.2 | 40.5 |
6144 | 301.5 | 175.3 | 49.2 |
7168 | 305.7 | 207 | 57.3 |
8192 | 308.9 | 236.7 | 65.3 |
9216 | 307.2 | 270.2 | 63.7 |
10240 | 310.6 | 295.3 | 71.1 |
11264 | 342.7 | 329.7 | 78.3 |
12288 | 291.3 | 491.2 | 83.3 |
13312 | 290.8 | 518.3 | 89 |
14336 | 315.3 | 577.3 | 92.5 |
15360 | 300.6 | 630.8 | 104.5 |
16384 | 299.6 | 577.1 | 100.1 |
17408 | 312.5 | 327.8 | 108.1 |
18432 | 301 | 980.8 | 88.4 |
19456 | 310.5 | 600.6 | 109.5 |
20480 | 378.8 | 791.9 | 116.8 |
22528 | 304.3 | 811.5 | 122.3 |
According to the results, UM performance is close to pinned memory performance.
The data should located in the pinned memory buffer.
There are still some interesting observations from the analysis. I will keep posted.
The source code is based on the CUDA SDK.
Wednesday, April 15, 2015
MM_Prefetching
http://www.seas.upenn.edu/~cis565/Lectures2011S/Lecture12.pdf
http://simulationcorner.net/index.php?page=fastmatrixvector
http://simulationcorner.net/index.php?page=fastmatrixvector
Sunday, February 22, 2015
cuda ptx
1) binary utilities
http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#axzz3SUgZUVcD
binary in elf-format
nvcc embeds cubin files into the host executable file
they can be generated separately by using "-cubin" option
cuobjdump: cubin and host binaries
nvdisasm: cubinfiles , which can support control flow analysis and output
For example, I have matrix multiplication app called matrixMul.
% cuda elf sections
cuobjdump -elf matrixMul
%cuda assembly
cuobjdump -sass matrixMul
%extract ptx from elf
cuobjdump matrixMul -ptx -sass
% list different cubin files for different architecture
cuobjdump a.out -lelf
% extract all the cubins from the binary
cuobjdump matrixMul -xelf all
Assume, I want to analysis the architecture with cuda capability 3.0.
The previous cubin is matrixMul.sm_30.cubin
% extract the control flow graph of a kernel
nvdisasm -cfg matrixMul.sm_30.cubin
% to generate DOT graph description language
sudo apt-get install graphviz
nvdisasm -cfg matrixMul.sm_30.cubin | dot -o cfg.png -Tpng
% to shwo the register liveness range information
nvdisasm -plr matrixMul.sm_30.cubin
http://docs.nvidia.com/cuda/cuda-binary-utilities/index.html#axzz3SUgZUVcD
binary in elf-format
nvcc embeds cubin files into the host executable file
they can be generated separately by using "-cubin" option
cuobjdump: cubin and host binaries
nvdisasm: cubinfiles , which can support control flow analysis and output
For example, I have matrix multiplication app called matrixMul.
% cuda elf sections
cuobjdump -elf matrixMul
%cuda assembly
cuobjdump -sass matrixMul
%extract ptx from elf
cuobjdump matrixMul -ptx -sass
% list different cubin files for different architecture
cuobjdump a.out -lelf
% extract all the cubins from the binary
cuobjdump matrixMul -xelf all
Assume, I want to analysis the architecture with cuda capability 3.0.
The previous cubin is matrixMul.sm_30.cubin
% extract the control flow graph of a kernel
nvdisasm -cfg matrixMul.sm_30.cubin
% to generate DOT graph description language
sudo apt-get install graphviz
nvdisasm -cfg matrixMul.sm_30.cubin | dot -o cfg.png -Tpng
% to shwo the register liveness range information
nvdisasm -plr matrixMul.sm_30.cubin
Monday, February 16, 2015
Instruction Level Parallelism and Thread Level Parallelism
A good tutorial from Prof. John Owen from UCD.
http://www.nvidia.com/content/cudazone/cudau/courses/ucdavis/lectures/tlp1.pdf
TLP: many restaurants with one boss and one chef
ILP: one restaurant with one boss and many chefs
http://www.nvidia.com/content/cudazone/cudau/courses/ucdavis/lectures/tlp1.pdf
TLP: many restaurants with one boss and one chef
ILP: one restaurant with one boss and many chefs
Subscribe to:
Posts (Atom)