CUDA: May 2015

On the discrete GPU, the allocation is on the device.
To profile, use the following command.
nvprof --print-gpu-trace --unified-memory-profiling per-process-device ./application_name

For example,
$ nvprof --print-gpu-trace --unified-memory-profiling per-process-device ./um

==28981== Profiling application: ./um

==28981== Profiling result:

Start Duration Unified Memory Name

170.14ms 7.4300us 4096 B [Unified Memory Memcpy DtoH]

170.18ms 4.1800us 4096 B [Unified Memory Memcpy HtoD]

170.23ms 875.94us - Kernel(float*) [96]

171.11ms 4.7740us 4096 B [Unified Memory Memcpy DtoH]

The data transfer is at the granularity of 4KB.

What about on the integrated GPU ?
Is it in the pinned buffer of system memory?
Unfortunately, Unified Memory Profiling is not supported on GK20A (Jetson TK1). Therefore, we need to find a workaround.

Methodology:
1) Benchmark the bandwidth with or without pinned memory,
2) Benchmark the bandwidth with UM,
3) Compare them and find the best match.

Here is a snapshot of part of host-to-device-bandwidth results.
Pinned mem is a limited resource. After around 6 MB, the Pageable excels it.

	Pinned	Pageable
Transfer Size (Bytes)	Bandwidth(MB/s)	Bandwidth(MB/s)
1024000	891.8	440.7
1126400	981.6	439.6
2174976	944.2	629.7
3223552	960.6	630.9
4272128	958.6	914.9
5320704	966	741.9
6369280	994.4	627.2
7417856	994.7	1077.8
8466432	994.9	1271.6
9515008	995.3	1367.7
10563584	985.4	1371.6
11612160	986.3	1428.5
12660736	983.5	1402.5
13709312	984.6	1410
14757888	995.7	1408.8
15806464	995.8	1414.7

Measuring UM is non-trivial, since it transfers the data without explicit API calls.
For example, when initialize host memory,

first, it copies to the host memory,
second, it updates the mem region,
third, it copies back to the device memory

Transfer Size (Bytes)	UM(remove the cpu computation)	pinned (h2d,d2h)	pageable(h2d,d2h)
1024	161.3	104.6	7.4
2048	259.7	96.7	16.7
3072	212.8	76.1	24.8
4096	292	113.8	33.2
5120	294.1	143.2	40.5
6144	301.5	175.3	49.2
7168	305.7	207	57.3
8192	308.9	236.7	65.3
9216	307.2	270.2	63.7
10240	310.6	295.3	71.1
11264	342.7	329.7	78.3
12288	291.3	491.2	83.3
13312	290.8	518.3	89
14336	315.3	577.3	92.5
15360	300.6	630.8	104.5
16384	299.6	577.1	100.1
17408	312.5	327.8	108.1
18432	301	980.8	88.4
19456	310.5	600.6	109.5
20480	378.8	791.9	116.8
22528	304.3	811.5	122.3

According to the results, UM performance is close to pinned memory performance.
The data should located in the pinned memory buffer.
There are still some interesting observations from the analysis. I will keep posted.

The source code is based on the CUDA SDK.

Sunday, May 24, 2015

Unified Memory