Sunday, May 24, 2015

Unified Memory

On the discrete GPU, the allocation is on the device.
To profile, use the following command.
nvprof --print-gpu-trace  --unified-memory-profiling per-process-device ./application_name

For example,
$ nvprof --print-gpu-trace  --unified-memory-profiling per-process-device ./um
==28981== Profiling application: ./um
==28981== Profiling result:
   Start  Duration           Unified Memory           Name
170.14ms  7.4300us      4096 B                          [Unified Memory Memcpy DtoH]
170.18ms  4.1800us      4096 B                          [Unified Memory Memcpy HtoD]
170.23ms  875.94us          -                                Kernel(float*) [96]
171.11ms  4.7740us      4096 B                          [Unified Memory Memcpy DtoH]

The data transfer is at the granularity of 4KB.


What about on the integrated GPU ?
Is it in the pinned buffer of system memory?
Unfortunately, Unified Memory Profiling is not supported on GK20A (Jetson TK1). Therefore, we need to find a workaround.

Methodology:
1) Benchmark the bandwidth with or without pinned memory,
2) Benchmark the bandwidth with UM,
3) Compare them and find the best match.

Here is a snapshot of part of host-to-device-bandwidth results.
Pinned mem is a limited resource. After around 6 MB, the Pageable excels it.


Pinned Pageable
Transfer Size (Bytes) Bandwidth(MB/s) Bandwidth(MB/s)
1024000 891.8 440.7
1126400 981.6 439.6
2174976 944.2 629.7
3223552 960.6 630.9
4272128 958.6 914.9
5320704 966 741.9
6369280 994.4 627.2
7417856 994.7 1077.8
8466432 994.9 1271.6
9515008 995.3 1367.7
10563584 985.4 1371.6
11612160 986.3 1428.5
12660736 983.5 1402.5
13709312 984.6 1410
14757888 995.7 1408.8
15806464 995.8 1414.7

Measuring UM is non-trivial, since it transfers the data without explicit API calls.
For example, when initialize host memory,

  1. first, it copies to the host memory,
  2. second, it updates the mem region,
  3. third, it copies back to the device memory

Transfer Size (Bytes) UM(remove the cpu computation) pinned (h2d,d2h) pageable(h2d,d2h)
1024 161.3 104.6 7.4
2048 259.7 96.7 16.7
3072 212.8 76.1 24.8
4096 292 113.8 33.2
5120 294.1 143.2 40.5
6144 301.5 175.3 49.2
7168 305.7 207 57.3
8192 308.9 236.7 65.3
9216 307.2 270.2 63.7
10240 310.6 295.3 71.1
11264 342.7 329.7 78.3
12288 291.3 491.2 83.3
13312 290.8 518.3 89
14336 315.3 577.3 92.5
15360 300.6 630.8 104.5
16384 299.6 577.1 100.1
17408 312.5 327.8 108.1
18432 301 980.8 88.4
19456 310.5 600.6 109.5
20480 378.8 791.9 116.8
22528 304.3 811.5 122.3

According to the results, UM performance is close to pinned memory performance.
The data should located in the pinned memory buffer.
There are still some interesting observations from the analysis. I will keep posted.

The source code is based on the CUDA SDK.

No comments:

Post a Comment