ROACH + Myricom 10G-PCIE-8A

10GbE and IB Networking

Here are some results of some network bandwidth testing done at Swinburne. The aim of the testing was to see how fast we could capture UDP packets on a 10GbE PCIe card. We also performed somes tests to see how fast data could be streamed between 2 servers and written to a GPU on the receiving server.

ROACH to Server using UDP + 10GbE

A simple ROACH design was created that generates 8KB UDP packets at a configurable data rate. The UDP packets are just packed with a 1024 x 64-bit unsigned integers that correspond to the packet sequence number. The first 10GbE port on the ROACH was directly cabled to a 10GbE PCIe Card (Myricom 10G-PCIE-8A). A simple UDP capture program, written using the PSRDADA library, was used to read packets from the socket and write them to a shared memory ring buffer. The machine used for the test was an Intel Server with 4 x X7560 CPUs. The Interrupt Coalsesing for the RX queue (rx-usecs) was left at the default of 75 usecs. The capture program was bound to a core on the second CPU as the first CPU seemed to be rather busy processing interrupts, etc. The highest rate sustained rate with no packet loss (for 120s) was ~1008 MB/s (7.88 Gb/s). In this instance the capture process consumed ~92% of 1 core.

Server -> IB -> Server -> GPU

In this benchmark, we tested the Infiniband and GPU bandwidth in a streaming scenario. Each of the 2 servers used had the following specifications:

  • BOARD: Tylersburg IOH and the ICH10
  • NIC: QLogic QDR IB (Infiniband) QLE7340
  • GPU: Nvidia C2070 (Fermi)
  • CPU: Dual Intel Xeon X5650’s (2.66 GHz, 6 core, 12ML3)
  • RAM: 48 GB DDR3 (1333 MHz)

The two servers were connected together via 40Gbps QDR IB via an IB switch. The details of the tests were:

Server A:

  • Process A: generating a data stream, writing it to shared memory ring buffer
  • Process B: reading from shared memory ring buffer and writing via RDMA RC (Reliable Connection) IB Verbs to Server B

Server B:

  • Process A: receives the RDMA RC data stream from Server A, writing the data into shared memory ring buffer
  • Process B: reading data from shared memory ring buffer, and transferring (cudaAsyncMemcpy) to GPU

The sustainable rate for this was 3215 MB/s (25.1 Gb/s)