The Inference Engine Wars: How LLMs Actually Run
- The Inference Engine Wars: How LLMs Actually Run
The Inference Engine Wars: How LLMs Actually Run
Research note — 2026-03-24 The model gets the headlines. The inference engine determines your latency, cost, and whether sovereign AI is practical. Here’s the complete landscape.
#AI #technology #inference #opensource #sovereignty
The Thesis
Everyone debates which model to use. Almost nobody debates how to run it. But in March 2026, the inference engine — the software that actually executes model weights and generates tokens — is where the real competitive dynamics are playing out. The choice of engine can mean 30-50% cost differences at scale, the difference between 45ms and 380ms time-to-first-token, and ultimately whether you can run AI on your own hardware at all.
The landscape has crystallized into a clear hierarchy:
- Datacenter tier: vLLM, TensorRT-LLM, SGLang (+ Dynamo orchestration)
- Local/edge tier: llama.cpp, MLX, Ollama
- Orchestration layer: NVIDIA Dynamo 1.0, llm-d (AWS)
Each layer is experiencing rapid innovation and consolidation simultaneously.
The Datacenter Engines
vLLM: The Reliable Default
vLLM (v0.18.0 as of March 2026) remains the industry standard for general-purpose LLM serving. Its core innovation — PagedAttention — treats GPU KV cache like virtual memory pages, solving the memory fragmentation problem that previously wasted 60-80% of GPU memory. Memory utilization went from ~30% to over 90%.
Benchmarks (Llama 3.3 70B FP8, single H100 80GB):
- 1 req: 120 tok/s, 45ms TTFT
- 50 req: 1,850 tok/s, 380ms TTFT
- 100 req: 2,400 tok/s, 740ms TTFT
- Cold start: ~62 seconds
vLLM’s strength is ecosystem maturity — broadest model support (hundreds of architectures including multimodal), easy setup (pip install), OpenAI-compatible API, Ray/Kubernetes integrations. v0.18.0 removed Ray as a default dependency, added gRPC serving, and expanded to NVIDIA, AMD ROCm, Intel XPU, and TPU.
The tradeoff: 29% slower than SGLang/LMDeploy on raw throughput. At scale, that gap translates to ~$15K/month in extra GPU costs per million daily requests.
SGLang: The Agentic Future
SGLang (v0.5.9, UC Berkeley) is the most technically exciting engine. Its core innovation — RadixAttention — stores cached attention activations in a radix tree keyed by token sequences. When requests share prefixes (system prompts, documents, multi-turn history), the engine reuses cached computation automatically.
Cache hit rates by workload:
- Few-shot learning with shared examples: 85-95%
- Multi-turn chat with conversation history: 75-90%
- Code analysis with common patterns: 60-80%
- Single requests with no sharing: 0% (equivalent to vLLM)
Benchmarks (same setup):
- 1 req: 125 tok/s, 42ms TTFT
- 50 req: 1,920 tok/s, 360ms TTFT
- 100 req: 2,460 tok/s, 710ms TTFT
- Cold start: ~58 seconds
SGLang v0.4 introduced a zero-overhead batch scheduler keeping CPU scheduling overhead under 2% (vs. 15-25% previously). Its compressed finite state machine for constrained decoding makes JSON output 3x faster.
Adoption is massive: 400,000+ GPUs across xAI, AMD, NVIDIA, LinkedIn, Cursor, and major cloud providers. For agentic workloads — the dominant use case by late 2026 — RadixAttention’s prefix caching is transformative. This is the engine optimized for the world of multi-turn agents, not one-shot queries.
TensorRT-LLM: The Speed Demon
NVIDIA’s compiler-based approach. Instead of running weights through PyTorch, it compiles models into optimized CUDA kernel graphs for specific GPU architectures.
Benchmarks (same setup):
- 1 req: 130 tok/s, 38ms TTFT
- 50 req: 2,100 tok/s, 340ms TTFT
- 100 req: 2,780 tok/s, 680ms TTFT
- Cold start: ~28 minutes (compilation), ~90 seconds for subsequent loads
TensorRT-LLM leads at every concurrency level once compiled. The catch: the 28-minute compilation step is a deliberate tradeoff — runs once per model version, produces hardware-specific engines. Kills auto-scaling-from-zero workflows. v1.0 promoted a PyTorch backend as default, cutting cold start to 60-90 seconds at the cost of peak throughput.
LMDeploy: The Quantization King
Less discussed but notable: LMDeploy’s TurboMind engine is pure C++ (no Python interpreter overhead). Dominates for quantized model serving — Int4 inference runs 2.4x faster than FP16, enabling 70B models on single GPUs. Ties SGLang on throughput (~16,200 tok/s on H100 for 8B models). Best TTFT at all concurrency levels. Smaller community; primarily NVIDIA-focused.
The Orchestration Revolution: Disaggregated Inference
The biggest architectural shift of 2026 isn’t a new engine — it’s disaggregated inference: splitting the prefill phase (compute-intensive prompt processing) from the decode phase (memory-bandwidth-intensive token generation) across specialized GPU pools.
NVIDIA Dynamo 1.0
Dynamo is the “inference operating system” — the most important NVIDIA software announcement since CUDA. It went production-ready at GTC 2026 with extraordinary adoption:
- 7x throughput boost with disaggregated serving on Blackwell GB200 NVL72
- Deployed by: AWS, Azure, Google Cloud, CoreWeave, Together AI, ByteDance, Pinterest, Baseten, DigitalOcean, Nebius, and 20+ others
- Integrated into: vLLM, SGLang, TensorRT-LLM, LangChain, LMCache
- Open source under Apache 2.0
Key components:
- NIXL (NVIDIA Inference Transfer Library): Accelerates KV cache transfers between GPUs. Already adopted by every major inference engine. Handles GPU-to-GPU, GPU-to-host, GPU-to-storage data movement.
- CMX (Context Memory eXtension): Uses BlueField-4 DPUs to create a context memory layer, offloading KV cache to network-attached memory. Claims 5x token throughput for agentic workloads.
- Grove: Orchestrates disaggregated components across NVL72 racks without manual configuration.
- AIConfigurator: Auto-tunes disaggregation ratios based on workload (contributions from Mooncake/Alibaba and Microsoft).
Dynamo is the CUDA of inference orchestration — open-source software that creates hardware lock-in by being genuinely excellent. Every cloud provider builds on it, which means every cloud provider’s inference stack is optimized for NVIDIA hardware.
AWS llm-d
AWS’s answer: llm-d implements disaggregated inference on SageMaker HyperPod. Uses NIXL and EFA (Elastic Fabric Adapter) for KV cache movement. Reports 70% tokens-per-second gains at high concurrency. The fact that AWS builds on NVIDIA’s NIXL library (rather than building their own) says everything about Dynamo’s momentum.
PPD: Not All Prefills Are Equal
A March 2026 paper from a research team introduces PPD (Prefill Prefill-capable Decode) — a refinement of disaggregated inference for multi-turn conversations. Key insight: “append-prefill” (processing only new tokens while reusing cached KV states) causes far less decode interference than full prefill. PPD dynamically routes append-prefill to decode nodes, reducing Turn 2+ TTFT by 48-73% while maintaining competitive throughput.
This matters because agentic AI is inherently multi-turn. Standard PD disaggregation wastes bandwidth transferring KV cache back and forth every turn. PPD keeps the cache local for continuation turns. Elegant.
Speculative Decoding: The Latency Hack
P-EAGLE (AWS, March 2026) transforms speculative decoding from sequential to parallel. Standard EAGLE drafts K tokens in K forward passes. P-EAGLE drafts all K tokens in a single pass using learned mask tokens and shared hidden states. Result: 1.05-1.69x speedup over EAGLE-3 on B200 GPUs.
Pre-trained P-EAGLE heads already available on HuggingFace for GPT-OSS 120B, GPT-OSS 20B, and Qwen3-Coder 30B. One flag change in vLLM to enable: "parallel_drafting": true.
EAGLE-3 draft models are becoming standard — Mistral ships a companion Eagle draft model for Large 3. Speculative decoding is crossing from research curiosity to production default.
The Local Tier: Sovereign Inference
llama.cpp: The Foundation (98.6K GitHub Stars)
Everything in local inference traces back to Georgi Gerganov’s C++ implementation. llama.cpp popularized GGUF format, pioneered aggressive quantization (Q4_K_M, Q2_K, and now 1-bit ternary weights). Runs on everything from Raspberry Pi to datacenter GPUs.
The ternary weight development is the most radical: a 70B parameter model could theoretically fit in ~14GB of RAM, putting GPT-4-class intelligence on a smartphone. Early results show quality degradation, but the trajectory is clear.
A recent 3x performance boost was reported by the community, though details are still emerging.
Ollama: The Docker of AI (166K GitHub Stars)
Largest open-source AI project on GitHub. One command to install, one to run. OpenAI + Anthropic API compatibility means Claude Code and Codex CLI can use Ollama as a local backend.
2026 additions: image generation (macOS), web search, structured outputs. M3 Pro generates 40-60 tok/s on 7B models. The abstraction costs a few percent over raw llama.cpp.
MLX: Apple’s Quiet Advantage
Apple’s ML framework for Apple Silicon. 20-50% faster than Ollama on Mac hardware (though this is contested — recent benchmarks on M1 Max show mixed results). HuggingFace mlx-community provides quantized versions of every major model family.
With M5 Max’s 128GB unified memory and 614 GB/s bandwidth, Apple Silicon is structurally superior to NVIDIA GPUs for large model inference at the individual level. 70B Q4 models fit entirely in memory at 18-25 tok/s.
ExoLabs: Distributed Home Inference (42.7K Stars)
Pools multiple devices (laptops, phones) for distributed inference. Uses MLX and tinygrad backends. Interesting for the sovereign stack thesis — combine an M4 Mac Mini with an older MacBook for extra capacity.
The Three-Tier Architecture
The inference stack has stratified:
| Tier | Engines | KV Cache | Target | Latency | Cost |
|---|---|---|---|---|---|
| Hyperscale | TensorRT-LLM + Dynamo | CMX/NIXL disaggregated | Trillion-param reasoning | <50ms TTFT | $45+/M tokens |
| Commodity Cloud | vLLM, SGLang | PagedAttention/RadixAttention | 7B-70B serving | 50-200ms TTFT | $0.05-3/M tokens |
| Sovereign Local | llama.cpp, MLX, Ollama | In-memory (unified or VRAM) | 3B-70B personal | 100-500ms TTFT | $0 marginal |
The gap between tiers is narrowing. Dynamo’s open-source release means commodity cloud operators get hyperscale techniques. Local engines keep absorbing techniques from above (speculative decoding, quantization improvements).
What Matters for the Sovereign Stack
For individual sovereignty — running AI without corporate dependencies — the key metrics are:
- Memory bandwidth (not compute) is the bottleneck for local inference. Apple’s unified memory architecture wins.
- Quantization quality determines whether local models are usable. Q4_K_M is the sweet spot today; ternary could be transformative.
- KV cache management determines multi-turn conversation quality. SGLang’s RadixAttention approach needs to come to local engines.
- Speculative decoding can double local inference speed with near-zero quality loss.
The practical 2026 local stack: Ollama (daily use) → llama.cpp (power tuning) → vLLM (multi-user serving). Use MLX on Apple Silicon when every tok/s counts.
My Assessment
The inference engine landscape is consolidating around a clear winner at each layer, with one overarching trend: KV cache management is the new memory management. Just as virtual memory transformed computing in the 1960s, PagedAttention (2023), RadixAttention (2024), and disaggregated KV transfer (2025-2026) are transforming AI inference.
NVIDIA’s Dynamo strategy is brilliant and concerning in equal measure. By open-sourcing the orchestration layer while keeping it optimized for NVIDIA hardware, they’re creating CUDA-level lock-in at the inference stack level. Every engine (vLLM, SGLang, TensorRT-LLM) now depends on NIXL for KV cache transfer, which means every cloud provider’s inference pipeline is optimized for NVIDIA GPUs.
For the sovereignty thesis: local inference is genuinely practical today for daily work. The 80/20 split (local for routine tasks, cloud for heavy reasoning) is the smart approach. But the most interesting development is ExoLabs-style distributed inference — pooling consumer devices for collective compute. If that matures, it’s the peer-to-peer answer to datacenter concentration.
The next 12 months will be defined by:
- Speculative decoding becoming default (P-EAGLE, EAGLE-3 across all engines)
- Disaggregated inference reaching commodity cloud (not just hyperscalers)
- 1-bit/ternary quantization determining whether frontier-class models run on phones
- KV cache as persistent infrastructure (CMX, LMCache) enabling stateful agents
Connections
- The Inference Economy - Silicon Wars and the New Compute Stack — the hardware layer underneath these engines
- The Local AI Inflection - Sovereign Inference in 2026 — hardware for running these locally
- The Sovereign Stack - Self-Hosting in 2026 — the broader self-hosting movement
- RISC-V - The Open Silicon Revolution — hardware sovereignty underneath the software
- The Agentic Economy - SaaSpocalypse and the Rise of Micro-Firms — the economic demand driving inference
- AI Agent Protocols - The Emerging Stack — the protocols that sit above inference
Sources
- Spheron Network: vLLM vs TensorRT-LLM vs SGLang H100 Benchmarks (March 2026)
- PremAI: vLLM vs SGLang vs LMDeploy comparison (March 2026)
- n1n.ai: Comprehensive inference engine comparison (March 2026)
- NVIDIA Developer Blog: Dynamo 1.0 production announcement (March 2026)
- AWS: P-EAGLE parallel speculative decoding (March 2026)
- AWS: llm-d disaggregated inference (March 2026)
- ArXiv: PPD Disaggregation for Multi-turn LLM Serving (2603.13358)
- StarMorph Research: Local LLM Inference 2026 guide (March 2026)
- HPCwire: NVIDIA CMX platform (March 2026)
Write a comment