To profile, use the following command.
nvprof --print-gpu-trace --unified-memory-profiling per-process-device ./application_name
For example,
$ nvprof --print-gpu-trace --unified-memory-profiling per-process-device ./um
==28981== Profiling application: ./um
==28981== Profiling result:
Start Duration Unified Memory Name
170.14ms 7.4300us 4096 B [Unified Memory Memcpy DtoH]
170.18ms 4.1800us 4096 B [Unified Memory Memcpy HtoD]
170.23ms 875.94us - Kernel(float*) [96]
171.11ms 4.7740us 4096 B [Unified Memory Memcpy DtoH]
The data transfer is at the granularity of 4KB.
What about on the integrated GPU ?
Is it in the pinned buffer of system memory?
Unfortunately, Unified Memory Profiling is not supported on GK20A (Jetson TK1). Therefore, we need to find a workaround.
Methodology:
1) Benchmark the bandwidth with or without pinned memory,
2) Benchmark the bandwidth with UM,
3) Compare them and find the best match.
Here is a snapshot of part of host-to-device-bandwidth results.
Pinned mem is a limited resource. After around 6 MB, the Pageable excels it.
Measuring UM is non-trivial, since it transfers the data without explicit API calls.
For example, when initialize host memory,
According to the results, UM performance is close to pinned memory performance.
The data should located in the pinned memory buffer.
There are still some interesting observations from the analysis. I will keep posted.
The source code is based on the CUDA SDK.
What about on the integrated GPU ?
Is it in the pinned buffer of system memory?
Unfortunately, Unified Memory Profiling is not supported on GK20A (Jetson TK1). Therefore, we need to find a workaround.
Methodology:
1) Benchmark the bandwidth with or without pinned memory,
2) Benchmark the bandwidth with UM,
3) Compare them and find the best match.
Here is a snapshot of part of host-to-device-bandwidth results.
Pinned mem is a limited resource. After around 6 MB, the Pageable excels it.
Pinned | Pageable | |
Transfer Size (Bytes) | Bandwidth(MB/s) | Bandwidth(MB/s) |
1024000 | 891.8 | 440.7 |
1126400 | 981.6 | 439.6 |
2174976 | 944.2 | 629.7 |
3223552 | 960.6 | 630.9 |
4272128 | 958.6 | 914.9 |
5320704 | 966 | 741.9 |
6369280 | 994.4 | 627.2 |
7417856 | 994.7 | 1077.8 |
8466432 | 994.9 | 1271.6 |
9515008 | 995.3 | 1367.7 |
10563584 | 985.4 | 1371.6 |
11612160 | 986.3 | 1428.5 |
12660736 | 983.5 | 1402.5 |
13709312 | 984.6 | 1410 |
14757888 | 995.7 | 1408.8 |
15806464 | 995.8 | 1414.7 |
Measuring UM is non-trivial, since it transfers the data without explicit API calls.
For example, when initialize host memory,
- first, it copies to the host memory,
- second, it updates the mem region,
- third, it copies back to the device memory
Transfer Size (Bytes) | UM(remove the cpu computation) | pinned (h2d,d2h) | pageable(h2d,d2h) |
1024 | 161.3 | 104.6 | 7.4 |
2048 | 259.7 | 96.7 | 16.7 |
3072 | 212.8 | 76.1 | 24.8 |
4096 | 292 | 113.8 | 33.2 |
5120 | 294.1 | 143.2 | 40.5 |
6144 | 301.5 | 175.3 | 49.2 |
7168 | 305.7 | 207 | 57.3 |
8192 | 308.9 | 236.7 | 65.3 |
9216 | 307.2 | 270.2 | 63.7 |
10240 | 310.6 | 295.3 | 71.1 |
11264 | 342.7 | 329.7 | 78.3 |
12288 | 291.3 | 491.2 | 83.3 |
13312 | 290.8 | 518.3 | 89 |
14336 | 315.3 | 577.3 | 92.5 |
15360 | 300.6 | 630.8 | 104.5 |
16384 | 299.6 | 577.1 | 100.1 |
17408 | 312.5 | 327.8 | 108.1 |
18432 | 301 | 980.8 | 88.4 |
19456 | 310.5 | 600.6 | 109.5 |
20480 | 378.8 | 791.9 | 116.8 |
22528 | 304.3 | 811.5 | 122.3 |
According to the results, UM performance is close to pinned memory performance.
The data should located in the pinned memory buffer.
There are still some interesting observations from the analysis. I will keep posted.
The source code is based on the CUDA SDK.