Inference now consumes over 55 percent of AI infrastructure spending, up from roughly a third in 2023. By 2027, McKinsey projects it will hit 70 to 80 percent. Training a frontier model is a one-time expense. Serving it is a continuous bleed. And yet most optimization discussions still fixate on training efficiency, as if the hard part ends when the loss curve flattens.

It doesn't. The hard part is what happens next: millions of users hitting an endpoint, each expecting tokens in under 200 milliseconds. The techniques that make this possible have quietly become the most consequential work in applied ML. Here's what actually moves the needle.

The Memory Wall Is the Real Bottleneck

Large model inference is not compute-bound. It's memory-bandwidth-bound. A 70B parameter model at FP16 needs 140GB just to hold the weights. Every token generation requires reading those weights from HBM to the compute units. On an H100 with 3.35 TB/s memory bandwidth, that read alone takes roughly 42 milliseconds for a full pass, regardless of how fast the tensor cores can multiply.

This is why raw FLOPS comparisons between GPUs miss the point for inference workloads. NVIDIA's Blackwell B200 doubled tensor core throughput over Hopper, but as the FlashAttention-4 authors discovered, shared memory bandwidth and exponential units didn't scale proportionally. The bottleneck shifted from arithmetic to plumbing. More compute doesn't help when the data can't get there fast enough.

The entire inference optimization stack exists to work around this wall. Every technique below is fundamentally a strategy to either reduce how much data moves or hide the latency of moving it.

Quantization: Trading Bits for Throughput

The most direct attack on the memory wall is making the model smaller. Quantization compresses weights from 16-bit floats to 8-bit, 4-bit, or even lower representations. The results are striking and well-documented.

AWQ (Activation-Aware Weight Quantization) at 4-bit retains 95 percent of FP16 accuracy across standard benchmarks like MMLU, HellaSwag, and ARC. GPTQ is slightly less accurate (roughly 90 percent retention) but delivers faster raw throughput on GPU. The real story is in the kernel implementations: AWQ with the Marlin kernel hits 741 tokens per second versus 68 tok/s with the standard kernel, a 10.9x speedup on identical hardware.

NVIDIA's NVFP4 format on Blackwell pushes this further. It reduces memory footprint by 1.8x compared to FP8 while maintaining near-FP8 accuracy (typically less than 1 percent degradation). A single DGX B200 running Llama at FP4 delivers over 3x more inference throughput than a DGX H200.

The counterargument matters, though. Quantization degrades gracefully on benchmarks but can fail unpredictably on tail distributions. Rare tokens, code generation edge cases, and multilingual outputs suffer disproportionately. A model that scores 95 percent on MMLU at 4-bit might hallucinate more on a niche medical query than the accuracy numbers suggest. Production systems need quantization-aware evaluation on their actual task distribution, not just generic benchmarks.

KV Cache: The Silent Memory Hog

Most optimization coverage focuses on model weights. But during inference, the KV (key-value) cache often consumes more memory than the model itself. For a 70B model serving a 32K-context request, the KV cache can exceed 40GB per sequence. Multiply that by concurrent users and you've hit your VRAM ceiling long before compute saturates.

Three approaches dominate in 2026.

PagedAttention, introduced by vLLM, treats KV cache like virtual memory pages. Instead of pre-allocating contiguous blocks for each sequence's maximum possible length, it allocates small pages on demand. This eliminated the 60 to 80 percent memory waste that plagued earlier systems and enabled vLLM's signature throughput gains, up to 24x over HuggingFace Transformers.

Grouped Query Attention (GQA) attacks the problem architecturally. By sharing key-value heads across multiple query heads, GQA reduces KV cache size by 4 to 8x with minimal accuracy loss. Every major model family now uses it: Llama 2 and 3, Mistral, Gemma 2, Qwen. It's the rare optimization that's both free at inference time (the model was trained with it) and substantial in impact.

KV cache quantization compresses the cache itself to FP4 or INT4, separate from model weight quantization. NVIDIA's NVFP4 KV cache cuts cache cost by 50 percent and doubles the effective context budget versus FP8. Representative methods like RazorAttention achieve memory reductions exceeding 70 percent.

These aren't independent choices. Production stacks layer all three. A vLLM deployment serving a GQA-trained model with FP4 KV cache quantization compounds the savings multiplicatively. This is where the attention heads budget analysis becomes directly actionable.

Batching and Scheduling: Where Throughput Actually Lives

Single-request latency gets the attention. Throughput pays the bills.

The Orca paper (OSDI '22) introduced continuous batching, processing new requests as soon as any sequence in the current batch finishes, rather than waiting for the entire batch to complete. The impact was 2 to 23x throughput improvement over static batching, with typical production gains around 3 to 8x for conversational workloads.

Every major serving framework now implements this: vLLM, TGI, TensorRT-LLM (as "in-flight batching"), and SGLang. The differentiation has shifted to how well they handle scheduling under load.

SGLang's RadixAttention addresses a specific inefficiency: repeated prefixes. In RAG pipelines, multi-turn conversations, and few-shot prompting, many requests share identical prompt prefixes. RadixAttention stores computed KV caches in a radix tree and reuses them across requests. Cache hit rates reach 85 to 95 percent in few-shot scenarios versus 15 to 25 percent with PagedAttention alone. For workloads with shared prefixes, SGLang delivers up to 5x higher throughput than baseline systems.

TensorRT-LLM takes the opposite approach: raw speed through deep hardware integration. At 100 concurrent requests, it achieves 1,280ms p95 TTFT versus vLLM's 1,450ms, a 12 percent advantage. But the operational cost is steep. Model compilation is slow, model swaps require re-compilation, and debugging is harder. The 30 to 50 percent throughput advantage over vLLM exists but comes with massive operational complexity.

Speculative Decoding: Predicting the Future to Skip the Present

Standard autoregressive decoding generates one token at a time. Each token requires a full forward pass through the model. Speculative decoding uses a smaller "draft" model to propose multiple tokens at once, then verifies them in a single pass through the large model. Correct predictions skip expensive compute steps entirely.

The latest methods are getting aggressive. Saguaro, published at ICLR 2026, achieves up to 2x speedup over optimized speculative decoding baselines and 5x over standard autoregressive decoding. It does this by speculating on speculations, using multiple draft rounds before verification.

The constraint is acceptance rate. If the draft model's predictions diverge too far from the target model, rejected tokens waste compute. This makes speculative decoding highly workload-dependent. Code generation with predictable patterns benefits enormously. Creative writing with high entropy does not. Temperature and sampling strategy interact with acceptance rates in ways that aren't always intuitive.

There's also a tension with inference-time compute scaling. Chain-of-thought reasoning requires generating many tokens where each depends heavily on the previous context. Speculative decoding's assumption, that a cheap model can predict what an expensive model will say, breaks down when the expensive model is doing novel reasoning. The more you invest in test-time compute, the less speculative decoding helps.

The Serving Framework Decision

Choosing an inference stack in 2026 isn't a technical decision alone. It's an economic one.

vLLM is the default for good reason. Broadest model support, fastest iteration cycle, and a 1.7x speedup in its V1 architecture released in early 2025. If you need to swap models frequently or support diverse architectures, vLLM's flexibility justifies the throughput gap versus TensorRT-LLM.

TensorRT-LLM wins on raw throughput when you're serving a single model at scale and can absorb the compilation overhead. NVIDIA's own benchmarks show DGX B200 systems hitting over 30,000 tokens per second on DeepSeek-R1 (671B parameters) and 60,000 tok/s per GPU on dense models.

SGLang is the right choice for structured workloads with prefix sharing. Its RadixAttention gives it a structural advantage that no amount of kernel optimization in vLLM can match for the right workload patterns.

llama.cpp serves a different market entirely: single-user, edge, CPU-focused. It won't compete on throughput benchmarks, but it will run a 7B model on a MacBook with no GPU at useful speeds.

For MoE models, the picture is more complicated. Expert routing adds scheduling complexity, and load-balancing problems that plague training also manifest during inference as uneven expert utilization. NVIDIA's Blackwell-specific MoE optimizations deliver 4x inference throughput gains over H200 by improving all-to-all communication between experts, but these gains are hardware-locked.

The Cost Math: Cloud API vs. Self-Hosted

The inference optimization discussion is academic if you're using a cloud API. OpenAI, Anthropic, and others have already applied most of these techniques on your behalf. The question is whether their margin exceeds your optimization budget.

The numbers have shifted dramatically. GPT-4-class inference cost roughly $20 per million tokens in early 2023. Equivalent performance costs $0.40 per million tokens in early 2026, a roughly 50x cumulative decline over three years. Frontier model APIs have followed: Claude Opus 4.5 dropped from $15 to $5 per million input tokens. GPT-5 mini sits at $0.25 per million input tokens.

Self-hosting only makes economic sense at scale. Running Llama 405B on 8x H100s via CoreWeave costs roughly $5.47 per million output tokens, more expensive than Together AI's API for the same model at $3.50/M. The breakeven requires sustained 50 percent or higher GPU utilization, which most organizations don't achieve.

Cloud H100 rental has stabilized at $2.85 to $3.50 per hour after a 64 to 75 percent decline from peaks. Reserved capacity pushes this to $1.85/hour. But the real cost of running AI agents in production includes engineering time, monitoring infrastructure, and the coordination overhead of managing GPU clusters. Most teams underestimate these by 3 to 5x.

Long-Context Inference: The Emerging Front

Serving million-token contexts introduces a qualitatively different problem. Attention scales quadratically with sequence length, so a 1M-token prompt is 1,000x more expensive than a 1K-token prompt in raw attention compute.

Ring Attention distributes long sequences across multiple devices, overlapping communication of KV blocks with blockwise attention computation. It enables sequences over 100 million tokens without approximation, scaling linearly with device count. But the communication overhead between devices means practical throughput per token degrades as context grows.

FlashAttention 4 tackles the single-GPU case with software-emulated exponentials and conditional softmax rescaling, reaching 1,613 TFLOPs/s (71 percent hardware utilization) on B200. These kernel-level optimizations compound with architectural choices like GQA and KV cache compression to make 128K-context inference practical on a single node.

The contrarian take: most production workloads don't need million-token contexts. RAG pipelines retrieve a few thousand tokens of relevant context. Multi-turn conversations rarely exceed 32K tokens of useful history. The engineering effort to serve ultra-long contexts may be premature optimization for all but a few document-analysis and code-repository workloads.

What Actually Matters

The gap between knowing these techniques exist and deploying them effectively is where most teams stall. A common failure mode: optimizing the serving stack while ignoring the request pattern. A team running a RAG pipeline with 90 percent prefix overlap should reach for SGLang before touching quantization knobs. Another team serving diverse, one-shot queries would get more from aggressive quantization and continuous batching than from prefix caching that rarely hits.

The inference optimization stack in 2026 isn't one technique. It's a compound effect. GQA reduces KV cache by 4 to 8x. Quantization to FP4 cuts model memory by 4x. PagedAttention eliminates 60 to 80 percent of cache waste. Continuous batching delivers 3 to 8x throughput gains. Speculative decoding adds another 2 to 5x for suitable workloads. These multiply, not add.

But the most important optimization is the one nobody talks about: not running inference at all. Caching responses, batching similar queries, and using smaller models for easy requests before escalating to expensive ones. The cheapest token is the one you never generate.

Inference will consume 80 to 90 percent of a production AI system's lifetime cost. The companies that win won't be the ones with the best models. They'll be the ones who figured out how to serve them without going broke.


Sources