If you have multiple threads/applications that demand high bandwidth memory I/O on a server with a NUMA architecture, then you can achieve real benefits from using numa aware memory andbinding processing to CPU cores on that numa node.
I performed these tests on a Supermicro 7048GR-TR with 2 Intel Xeon E5-2623 CPUs and 128 GB DDR4 RAM in 8 x 16 GB DIMMs. The absolute performance in these tests is arbitrary – I’m mostly interested in the relative performance. Here I am using PSRDada – a library developed for Pulsar Radio Astronomy that is used for data acquisition and transport around servers. Specifically I’m using the shared memory ring buffer implementation this library provides.
In each test I have a writer and reader that operate on shared memory. Standard memcpy is used to fill the ring buffer with data (in the writer) and the to read from the ring buffer (in the reader). This results in 2 transactions in and out of application memory for each byte of data – analogous to many realistic scenarios.
The total amount of data in each test is 50 GiB (50 x 10^9 Bytes) – the results are shown below as we can see the performance is about double when the memory, read and write processes are all on the same NUMA node.
Test | Memory | Writer | Reader | Performance (MB/s) |
---|---|---|---|---|
Single Stream | 0 | 0 | 0 | 4850 |
Single Stream | 0 | 0 | 1 | 3600 |
Single Stream | 0 | 1 | 0 | 2050 |
Single Stream | 1 | 1 | 1 | 1300 |
The next test involved 2 parallel streams to see the effects of NUMA alignment on streams where data does not need to cross the NUMA boundary.
Test | A Config | B Config | A | B | Sum |
---|---|---|---|---|---|
Isolated 0 and 1 | RAM=0 Write=0 Read=0 | RAM=1 Write=1 Read=1 | 4800 | 4900 | 9700 |
Adjacent on 0 | RAM=0 Write=0 Read=0 | RAM=0 Write=0 Read=0 | 3050 | 3050 | 7100 |
Adjacent on 1 | RAM=1 Write=1 Read=1 | RAM=1 Write=1 Read=1 | 3050 | 3100 | 7150 |
Crossover | RAM=1 Write=0 Read=0 | RAM=0 Write=1 Read=1 | 1700 | 1700 | 3400 |
The 3rd test was to change the reader from readingĀ into application memory to reading into GPU memory. Since a GPU is attached to a PCI bus that is closer to a NUMA node, this is also an interesting and slightly different test.
A | B | C | D | SUM | |
---|---|---|---|---|---|
Aligned | 4000 | 4000 | 3950 | 4050 | 16000 |
Opposed | 850 | 850 | 850 | 850 | 3400 |
Random 1 | 950 | 950 | 1550 | 4850 | 8300 |
Random 2 | 1400 | 2450 | 1150 | 1000 | 6000 |
Random 3 | 1450 | 2850 | 700 | 1150 | 6150 |
Aligned is an optimal NUMA arrangement where computation and memory transfer never leave the NUMA node. Opposed is a worst case scenario where every byte written and read crosses the NUMA boundaries