NUMA with shared memory

If you have multiple threads/applications that demand high bandwidth memory I/O on a server with a NUMA architecture, then you can achieve real benefits from using numa aware memory andbinding processing to CPU cores on that numa node.

I performed these tests on a Supermicro 7048GR-TR with 2 Intel Xeon E5-2623 CPUs and 128 GB DDR4 RAM in 8 x 16 GB DIMMs. The absolute performance in these tests is arbitrary – I’m mostly interested in the relative performance. Here I am using PSRDada – a library developed for Pulsar Radio Astronomy that is used for data acquisition and transport around servers. Specifically I’m using the shared memory ring buffer implementation this library provides.

In each test I have a writer and reader that operate on shared memory. Standard memcpy is used to fill the ring buffer with data (in the writer) and the to read from the ring buffer (in the reader). This results in 2 transactions in and out of application memory for each byte of data – analogous to many realistic scenarios.

The total amount of data in each test is 50 GiB (50 x 10^9 Bytes) – the results are shown below as we can see the performance is about double when the memory, read and write processes are all on the same NUMA node.

TestMemory Writer Reader Performance (MB/s)
Single Stream0004850
Single Stream0013600
Single Stream0102050
Single Stream1111300

The next test involved 2 parallel streams to see the effects of NUMA alignment on streams where data does not need to cross the NUMA boundary.

Test A Config B ConfigABSum
Isolated 0 and 1RAM=0 Write=0 Read=0RAM=1 Write=1 Read=1480049009700
Adjacent on 0RAM=0 Write=0 Read=0RAM=0 Write=0 Read=0305030507100
Adjacent on 1RAM=1 Write=1 Read=1RAM=1 Write=1 Read=1305031007150
CrossoverRAM=1 Write=0 Read=0RAM=0 Write=1 Read=1170017003400

The 3rd test was to change the reader from readingĀ  into application memory to reading into GPU memory. Since a GPU is attached to a PCI bus that is closer to a NUMA node, this is also an interesting and slightly different test.

 ABCDSUM
Aligned400040003950405016000
Opposed8508508508503400
Random 1950950155048508300
Random 214002450115010006000
Random 31450285070011506150

Aligned is an optimal NUMA arrangement where computation and memory transfer never leave the NUMA node. Opposed is a worst case scenario where every byte written and read crosses the NUMA boundaries