NUMA with shared memory | Andrew Jameson

If you have multiple threads/applications that demand high bandwidth memory I/O on a server with a NUMA architecture, then you can achieve real benefits from using numa aware memory andbinding processing to CPU cores on that numa node.

I performed these tests on a Supermicro 7048GR-TR with 2 Intel Xeon E5-2623 CPUs and 128 GB DDR4 RAM in 8 x 16 GB DIMMs. The absolute performance in these tests is arbitrary – I’m mostly interested in the relative performance. Here I am using PSRDada – a library developed for Pulsar Radio Astronomy that is used for data acquisition and transport around servers. Specifically I’m using the shared memory ring buffer implementation this library provides.

In each test I have a writer and reader that operate on shared memory. Standard memcpy is used to fill the ring buffer with data (in the writer) and the to read from the ring buffer (in the reader). This results in 2 transactions in and out of application memory for each byte of data – analogous to many realistic scenarios.

The total amount of data in each test is 50 GiB (50 x 10^9 Bytes) – the results are shown below as we can see the performance is about double when the memory, read and write processes are all on the same NUMA node.

Test	Memory	Writer	Reader	Performance (MB/s)
Single Stream	0	0	0	4850
Single Stream	0	0	1	3600
Single Stream	0	1	0	2050
Single Stream	1	1	1	1300

The next test involved 2 parallel streams to see the effects of NUMA alignment on streams where data does not need to cross the NUMA boundary.

Test	A Config	B Config	A	B	Sum
Isolated 0 and 1	RAM=0 Write=0 Read=0	RAM=1 Write=1 Read=1	4800	4900	9700
Adjacent on 0	RAM=0 Write=0 Read=0	RAM=0 Write=0 Read=0	3050	3050	7100
Adjacent on 1	RAM=1 Write=1 Read=1	RAM=1 Write=1 Read=1	3050	3100	7150
Crossover	RAM=1 Write=0 Read=0	RAM=0 Write=1 Read=1	1700	1700	3400

The 3rd test was to change the reader from reading into application memory to reading into GPU memory. Since a GPU is attached to a PCI bus that is closer to a NUMA node, this is also an interesting and slightly different test.

	A	B	C	D	SUM
Aligned	4000	4000	3950	4050	16000
Opposed	850	850	850	850	3400
Random 1	950	950	1550	4850	8300
Random 2	1400	2450	1150	1000	6000
Random 3	1450	2850	700	1150	6150

Aligned is an optimal NUMA arrangement where computation and memory transfer never leave the NUMA node. Opposed is a worst case scenario where every byte written and read crosses the NUMA boundaries