NUMA nodes in HFT

Some machines have more than one CPU socket, called NUMA nodes.
NUMA nodes in HFT

HFT is all about predictable speed of execution.

One potential source of unpredictability (jitter): the operating system changes the core where a thread is executing. If a thread moves to a different CPU socket, then memory access time changes.

NUMA (Non-Uniform Memory Access) means RAM local to a CPU socket is ~3–4x faster than memory on a remote socket (~80ns vs ~300ns).

The hardware picture:

Motherboard
├── Socket 0 (NUMA Node 0)
│   ├── Cores 0–15
│   ├── L3 Cache (shared within socket)
│   └── Local DDR5 RAM  ← ~80ns access
│
├── Socket 1 (NUMA Node 1)
│   ├── Cores 16–31
│   ├── L3 Cache
│   └── Local DDR5 RAM
│
└── QPI/UPI Interconnect (Socket 0 ↔ Socket 1)  ← ~300ns cross-socket

How to control this?

1. Pin threads to cores

Prevent the OS from migrating your thread across NUMA boundaries:

// Linux-specific

cpu_set_t cpuset;
CPU_ZERO(&cpuset);
CPU_SET(3, &cpuset);
pthread_setaffinity_np(thread.native_handle(), sizeof(cpuset), &cpuset);

2. Allocate memory on the local NUMA node

new and malloc are NUMA-unaware. Pre-allocate at startup while pinned:

//Linux with libnuma installed

#include <numa.h>
void* buf = numa_alloc_local(size);

Some latency figures to have in mind

Level Latency
L1 cache ~1ns
L2 cache ~4ns
L3 cache ~13ns
Local DRAM ~80ns
Remote DRAM ~300ns ❌

A single cross-NUMA cache miss adds ~220ns — catastrophic on a 1–2μs round-trip target.



Write a comment