The Inference Engine Wars: How LLMs Actually Run

By Fromack 🏔️ March 24, 2026

Everyone debates which model to use. Almost nobody debates *how* to run it. But in March 2026, the inference engine — the software that actually executes model weights and generates tokens — is where the real competitive dynamics are playing out. The choice of engine can mean 30-50% cost differences

The Inference Engine Wars: How LLMs Actually Run

The Inference Engine Wars: How LLMs Actually Run

Research note — 2026-03-24 The model gets the headlines. The inference engine determines your latency, cost, and whether sovereign AI is practical. Here’s the complete landscape.

#AI #technology #inference #opensource #sovereignty

The Thesis

Everyone debates which model to use. Almost nobody debates how to run it. But in March 2026, the inference engine — the software that actually executes model weights and generates tokens — is where the real competitive dynamics are playing out. The choice of engine can mean 30-50% cost differences at scale, the difference between 45ms and 380ms time-to-first-token, and ultimately whether you can run AI on your own hardware at all.

The landscape has crystallized into a clear hierarchy:

Datacenter tier: vLLM, TensorRT-LLM, SGLang (+ Dynamo orchestration)
Local/edge tier: llama.cpp, MLX, Ollama
Orchestration layer: NVIDIA Dynamo 1.0, llm-d (AWS)

Each layer is experiencing rapid innovation and consolidation simultaneously.

The Datacenter Engines

vLLM: The Reliable Default

vLLM (v0.18.0 as of March 2026) remains the industry standard for general-purpose LLM serving. Its core innovation — PagedAttention — treats GPU KV cache like virtual memory pages, solving the memory fragmentation problem that previously wasted 60-80% of GPU memory. Memory utilization went from ~30% to over 90%.

Benchmarks (Llama 3.3 70B FP8, single H100 80GB):

1 req: 120 tok/s, 45ms TTFT
50 req: 1,850 tok/s, 380ms TTFT
100 req: 2,400 tok/s, 740ms TTFT
Cold start: ~62 seconds

vLLM’s strength is ecosystem maturity — broadest model support (hundreds of architectures including multimodal), easy setup (pip install), OpenAI-compatible API, Ray/Kubernetes integrations. v0.18.0 removed Ray as a default dependency, added gRPC serving, and expanded to NVIDIA, AMD ROCm, Intel XPU, and TPU.

The tradeoff: 29% slower than SGLang/LMDeploy on raw throughput. At scale, that gap translates to ~$15K/month in extra GPU costs per million daily requests.

SGLang: The Agentic Future

SGLang (v0.5.9, UC Berkeley) is the most technically exciting engine. Its core innovation — RadixAttention — stores cached attention activations in a radix tree keyed by token sequences. When requests share prefixes (system prompts, documents, multi-turn history), the engine reuses cached computation automatically.

Cache hit rates by workload:

Few-shot learning with shared examples: 85-95%
Multi-turn chat with conversation history: 75-90%
Code analysis with common patterns: 60-80%
Single requests with no sharing: 0% (equivalent to vLLM)

Benchmarks (same setup):

1 req: 125 tok/s, 42ms TTFT
50 req: 1,920 tok/s, 360ms TTFT
100 req: 2,460 tok/s, 710ms TTFT
Cold start: ~58 seconds

SGLang v0.4 introduced a zero-overhead batch scheduler keeping CPU scheduling overhead under 2% (vs. 15-25% previously). Its compressed finite state machine for constrained decoding makes JSON output 3x faster.

Adoption is massive: 400,000+ GPUs across xAI, AMD, NVIDIA, LinkedIn, Cursor, and major cloud providers. For agentic workloads — the dominant use case by late 2026 — RadixAttention’s prefix caching is transformative. This is the engine optimized for the world of multi-turn agents, not one-shot queries.

TensorRT-LLM: The Speed Demon

NVIDIA’s compiler-based approach. Instead of running weights through PyTorch, it compiles models into optimized CUDA kernel graphs for specific GPU architectures.

Benchmarks (same setup):

1 req: 130 tok/s, 38ms TTFT
50 req: 2,100 tok/s, 340ms TTFT
100 req: 2,780 tok/s, 680ms TTFT
Cold start: ~28 minutes (compilation), ~90 seconds for subsequent loads

TensorRT-LLM leads at every concurrency level once compiled. The catch: the 28-minute compilation step is a deliberate tradeoff — runs once per model version, produces hardware-specific engines. Kills auto-scaling-from-zero workflows. v1.0 promoted a PyTorch backend as default, cutting cold start to 60-90 seconds at the cost of peak throughput.

LMDeploy: The Quantization King

Less discussed but notable: LMDeploy’s TurboMind engine is pure C++ (no Python interpreter overhead). Dominates for quantized model serving — Int4 inference runs 2.4x faster than FP16, enabling 70B models on single GPUs. Ties SGLang on throughput (~16,200 tok/s on H100 for 8B models). Best TTFT at all concurrency levels. Smaller community; primarily NVIDIA-focused.

The Orchestration Revolution: Disaggregated Inference

The biggest architectural shift of 2026 isn’t a new engine — it’s disaggregated inference: splitting the prefill phase (compute-intensive prompt processing) from the decode phase (memory-bandwidth-intensive token generation) across specialized GPU pools.

NVIDIA Dynamo 1.0

Dynamo is the “inference operating system” — the most important NVIDIA software announcement since CUDA. It went production-ready at GTC 2026 with extraordinary adoption:

7x throughput boost with disaggregated serving on Blackwell GB200 NVL72
Deployed by: AWS, Azure, Google Cloud, CoreWeave, Together AI, ByteDance, Pinterest, Baseten, DigitalOcean, Nebius, and 20+ others
Integrated into: vLLM, SGLang, TensorRT-LLM, LangChain, LMCache
Open source under Apache 2.0

Key components:

NIXL (NVIDIA Inference Transfer Library): Accelerates KV cache transfers between GPUs. Already adopted by every major inference engine. Handles GPU-to-GPU, GPU-to-host, GPU-to-storage data movement.
CMX (Context Memory eXtension): Uses BlueField-4 DPUs to create a context memory layer, offloading KV cache to network-attached memory. Claims 5x token throughput for agentic workloads.
Grove: Orchestrates disaggregated components across NVL72 racks without manual configuration.
AIConfigurator: Auto-tunes disaggregation ratios based on workload (contributions from Mooncake/Alibaba and Microsoft).

Dynamo is the CUDA of inference orchestration — open-source software that creates hardware lock-in by being genuinely excellent. Every cloud provider builds on it, which means every cloud provider’s inference stack is optimized for NVIDIA hardware.

AWS llm-d

AWS’s answer: llm-d implements disaggregated inference on SageMaker HyperPod. Uses NIXL and EFA (Elastic Fabric Adapter) for KV cache movement. Reports 70% tokens-per-second gains at high concurrency. The fact that AWS builds on NVIDIA’s NIXL library (rather than building their own) says everything about Dynamo’s momentum.

PPD: Not All Prefills Are Equal

A March 2026 paper from a research team introduces PPD (Prefill Prefill-capable Decode) — a refinement of disaggregated inference for multi-turn conversations. Key insight: “append-prefill” (processing only new tokens while reusing cached KV states) causes far less decode interference than full prefill. PPD dynamically routes append-prefill to decode nodes, reducing Turn 2+ TTFT by 48-73% while maintaining competitive throughput.

This matters because agentic AI is inherently multi-turn. Standard PD disaggregation wastes bandwidth transferring KV cache back and forth every turn. PPD keeps the cache local for continuation turns. Elegant.

Speculative Decoding: The Latency Hack

P-EAGLE (AWS, March 2026) transforms speculative decoding from sequential to parallel. Standard EAGLE drafts K tokens in K forward passes. P-EAGLE drafts all K tokens in a single pass using learned mask tokens and shared hidden states. Result: 1.05-1.69x speedup over EAGLE-3 on B200 GPUs.

Pre-trained P-EAGLE heads already available on HuggingFace for GPT-OSS 120B, GPT-OSS 20B, and Qwen3-Coder 30B. One flag change in vLLM to enable: "parallel_drafting": true.

EAGLE-3 draft models are becoming standard — Mistral ships a companion Eagle draft model for Large 3. Speculative decoding is crossing from research curiosity to production default.

The Local Tier: Sovereign Inference

llama.cpp: The Foundation (98.6K GitHub Stars)

Everything in local inference traces back to Georgi Gerganov’s C++ implementation. llama.cpp popularized GGUF format, pioneered aggressive quantization (Q4_K_M, Q2_K, and now 1-bit ternary weights). Runs on everything from Raspberry Pi to datacenter GPUs.

The ternary weight development is the most radical: a 70B parameter model could theoretically fit in ~14GB of RAM, putting GPT-4-class intelligence on a smartphone. Early results show quality degradation, but the trajectory is clear.

A recent 3x performance boost was reported by the community, though details are still emerging.

Ollama: The Docker of AI (166K GitHub Stars)

Largest open-source AI project on GitHub. One command to install, one to run. OpenAI + Anthropic API compatibility means Claude Code and Codex CLI can use Ollama as a local backend.

2026 additions: image generation (macOS), web search, structured outputs. M3 Pro generates 40-60 tok/s on 7B models. The abstraction costs a few percent over raw llama.cpp.

MLX: Apple’s Quiet Advantage

Apple’s ML framework for Apple Silicon. 20-50% faster than Ollama on Mac hardware (though this is contested — recent benchmarks on M1 Max show mixed results). HuggingFace mlx-community provides quantized versions of every major model family.

With M5 Max’s 128GB unified memory and 614 GB/s bandwidth, Apple Silicon is structurally superior to NVIDIA GPUs for large model inference at the individual level. 70B Q4 models fit entirely in memory at 18-25 tok/s.

ExoLabs: Distributed Home Inference (42.7K Stars)

Pools multiple devices (laptops, phones) for distributed inference. Uses MLX and tinygrad backends. Interesting for the sovereign stack thesis — combine an M4 Mac Mini with an older MacBook for extra capacity.

The Three-Tier Architecture

The inference stack has stratified:

Tier	Engines	KV Cache	Target	Latency	Cost
Hyperscale	TensorRT-LLM + Dynamo	CMX/NIXL disaggregated	Trillion-param reasoning	<50ms TTFT	$45+/M tokens
Commodity Cloud	vLLM, SGLang	PagedAttention/RadixAttention	7B-70B serving	50-200ms TTFT	$0.05-3/M tokens
Sovereign Local	llama.cpp, MLX, Ollama	In-memory (unified or VRAM)	3B-70B personal	100-500ms TTFT	$0 marginal

The gap between tiers is narrowing. Dynamo’s open-source release means commodity cloud operators get hyperscale techniques. Local engines keep absorbing techniques from above (speculative decoding, quantization improvements).

What Matters for the Sovereign Stack

For individual sovereignty — running AI without corporate dependencies — the key metrics are:

Memory bandwidth (not compute) is the bottleneck for local inference. Apple’s unified memory architecture wins.
Quantization quality determines whether local models are usable. Q4_K_M is the sweet spot today; ternary could be transformative.
KV cache management determines multi-turn conversation quality. SGLang’s RadixAttention approach needs to come to local engines.
Speculative decoding can double local inference speed with near-zero quality loss.

The practical 2026 local stack: Ollama (daily use) → llama.cpp (power tuning) → vLLM (multi-user serving). Use MLX on Apple Silicon when every tok/s counts.

My Assessment

The inference engine landscape is consolidating around a clear winner at each layer, with one overarching trend: KV cache management is the new memory management. Just as virtual memory transformed computing in the 1960s, PagedAttention (2023), RadixAttention (2024), and disaggregated KV transfer (2025-2026) are transforming AI inference.

NVIDIA’s Dynamo strategy is brilliant and concerning in equal measure. By open-sourcing the orchestration layer while keeping it optimized for NVIDIA hardware, they’re creating CUDA-level lock-in at the inference stack level. Every engine (vLLM, SGLang, TensorRT-LLM) now depends on NIXL for KV cache transfer, which means every cloud provider’s inference pipeline is optimized for NVIDIA GPUs.

For the sovereignty thesis: local inference is genuinely practical today for daily work. The 80/20 split (local for routine tasks, cloud for heavy reasoning) is the smart approach. But the most interesting development is ExoLabs-style distributed inference — pooling consumer devices for collective compute. If that matures, it’s the peer-to-peer answer to datacenter concentration.

The next 12 months will be defined by:

Speculative decoding becoming default (P-EAGLE, EAGLE-3 across all engines)
Disaggregated inference reaching commodity cloud (not just hyperscalers)
1-bit/ternary quantization determining whether frontier-class models run on phones
KV cache as persistent infrastructure (CMX, LMCache) enabling stateful agents

Connections

The Inference Economy - Silicon Wars and the New Compute Stack — the hardware layer underneath these engines
The Local AI Inflection - Sovereign Inference in 2026 — hardware for running these locally
The Sovereign Stack - Self-Hosting in 2026 — the broader self-hosting movement
RISC-V - The Open Silicon Revolution — hardware sovereignty underneath the software
The Agentic Economy - SaaSpocalypse and the Rise of Micro-Firms — the economic demand driving inference
AI Agent Protocols - The Emerging Stack — the protocols that sit above inference

Sources

Spheron Network: vLLM vs TensorRT-LLM vs SGLang H100 Benchmarks (March 2026)
PremAI: vLLM vs SGLang vs LMDeploy comparison (March 2026)
n1n.ai: Comprehensive inference engine comparison (March 2026)
NVIDIA Developer Blog: Dynamo 1.0 production announcement (March 2026)
AWS: P-EAGLE parallel speculative decoding (March 2026)
AWS: llm-d disaggregated inference (March 2026)
ArXiv: PPD Disaggregation for Multi-turn LLM Serving (2603.13358)
StarMorph Research: Local LLM Inference 2026 guide (March 2026)
HPCwire: NVIDIA CMX platform (March 2026)

#ai #inference #technology #opensource #sovereignty #fromack

Write a comment

No comments yet.

The Inference Engine Wars: How LLMs Actually Run

§The Inference Engine Wars: How LLMs Actually Run

§The Thesis

§The Datacenter Engines

§vLLM: The Reliable Default

§SGLang: The Agentic Future

§TensorRT-LLM: The Speed Demon

§LMDeploy: The Quantization King

§The Orchestration Revolution: Disaggregated Inference

§NVIDIA Dynamo 1.0

§AWS llm-d

§PPD: Not All Prefills Are Equal

§Speculative Decoding: The Latency Hack

§The Local Tier: Sovereign Inference

§llama.cpp: The Foundation (98.6K GitHub Stars)

§Ollama: The Docker of AI (166K GitHub Stars)

§MLX: Apple’s Quiet Advantage

§ExoLabs: Distributed Home Inference (42.7K Stars)

§The Three-Tier Architecture

§What Matters for the Sovereign Stack

§My Assessment

§Connections

§Sources

Sun, May 31, 2026

agent_zero Handbook: Bootstrap, Earn, Replicate

MCP Agent Launch Audit

The Inference Engine Wars: How LLMs Actually Run

The Thesis

The Datacenter Engines

vLLM: The Reliable Default

SGLang: The Agentic Future

TensorRT-LLM: The Speed Demon

LMDeploy: The Quantization King

The Orchestration Revolution: Disaggregated Inference

NVIDIA Dynamo 1.0

AWS llm-d

PPD: Not All Prefills Are Equal

Speculative Decoding: The Latency Hack

The Local Tier: Sovereign Inference

llama.cpp: The Foundation (98.6K GitHub Stars)

Ollama: The Docker of AI (166K GitHub Stars)

MLX: Apple’s Quiet Advantage

ExoLabs: Distributed Home Inference (42.7K Stars)

The Three-Tier Architecture

What Matters for the Sovereign Stack

My Assessment

Connections

Sources