NUMA nodes in HFT
Some machines have more than one CPU socket, called NUMA nodes.
HFT is all about predictable speed of execution.
One potential source of unpredictability (jitter): the operating system changes the core where a thread is executing. If a thread moves to a different CPU socket, then memory access time changes.
NUMA (Non-Uniform Memory Access) means RAM local to a CPU socket is ~3–4x faster than memory on a remote socket (~80ns vs ~300ns).
The hardware picture:
Motherboard
├── Socket 0 (NUMA Node 0)
│ ├── Cores 0–15
│ ├── L3 Cache (shared within socket)
│ └── Local DDR5 RAM ← ~80ns access
│
├── Socket 1 (NUMA Node 1)
│ ├── Cores 16–31
│ ├── L3 Cache
│ └── Local DDR5 RAM
│
└── QPI/UPI Interconnect (Socket 0 ↔ Socket 1) ← ~300ns cross-socket
How to control this?
1. Pin threads to cores
Prevent the OS from migrating your thread across NUMA boundaries:
// Linux-specific
cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(3, &cpuset);
pthread_setaffinity_np(thread.native_handle(), sizeof(cpuset), &cpuset);
2. Allocate memory on the local NUMA node
new and malloc are NUMA-unaware. Pre-allocate at startup while pinned:
//Linux with libnuma installed
#include <numa.h>
void* buf = numa_alloc_local(size);
Some latency figures to have in mind
| Level | Latency |
|---|---|
| L1 cache | ~1ns |
| L2 cache | ~4ns |
| L3 cache | ~13ns |
| Local DRAM | ~80ns |
| Remote DRAM | ~300ns ❌ |
A single cross-NUMA cache miss adds ~220ns — catastrophic on a 1–2μs round-trip target.
Write a comment