LISTEN TO THIS ARTICLE
Inference Optimization: From 10x Cost to 10x Speed
In late 2022, running a query against GPT-3-class performance cost roughly $20 per million tokens. By March 2026, multiple models exceed that same benchmark at $0.06 per million tokens or less. That's a 300x price collapse in under four years, faster than PC compute dropped during the microprocessor revolution, faster than bandwidth fell during the dotcom boom. Most of the improvement didn't come from cheaper hardware. It came from inference optimization: a set of techniques that squeeze more useful work out of the same silicon.
This guide covers the six techniques that account for the majority of those gains, when each one applies, the serving engines that implement them, and how to stack them for production workloads without turning your inference pipeline into an unmaintainable mess.
Why Inference Is Where the Money Goes
Training a model is a one-time capital expense. Inference is an ongoing operational cost that scales with every user, every query, every agent loop. For most production AI systems, inference accounts for 60-90% of total compute spend. As we covered in the true cost of running AI agents in production, raw API costs are typically only 30-50% of the total, with retries, orchestration overhead, and monitoring eating the rest. Optimizing inference attacks the largest line item on both sides of that equation: fewer tokens wasted means lower API bills and fewer retries.
The optimization surface breaks into two dimensions. Latency optimization reduces the time a user waits for a response, measured in time-to-first-token (TTFT) and tokens per second. Throughput optimization maximizes the number of requests a system handles per unit of hardware, measured in requests per second per GPU. Most techniques improve both, but some force a tradeoff. Knowing which dimension matters for your workload determines which optimizations to prioritize.
Quantization: The Highest-ROI Single Change
Quantization reduces the numerical precision of model weights from 16-bit floating point (FP16 or BF16) down to 8-bit, 4-bit, or even lower representations. The tradeoff is simple: lower precision means less memory per parameter, which means fitting larger models on fewer GPUs, processing more concurrent requests, and moving data faster between memory and compute units.
INT8 quantization halves memory requirements with under 1% quality degradation on most benchmarks. INT4 quantization using methods like AWQ or GPTQ reduces memory by 75%, though quality loss becomes measurable on reasoning-heavy tasks. FP8, supported natively on NVIDIA H100 and newer hardware, has emerged as the sweet spot for production: near-FP16 quality with roughly half the memory footprint and significantly higher throughput.
The practical impact is substantial. A Llama 3.1-70B model in FP16 requires approximately 140GB of GPU memory, meaning at least two A100-80GB GPUs. Quantized to INT4 with AWQ, the same model fits on a single GPU with memory to spare for KV cache and batching overhead. That's not a marginal improvement. It's the difference between a $3,600/month two-GPU setup and a $1,800/month single-GPU deployment.
NVIDIA's Model Optimizer library unifies quantization across frameworks, letting you apply AWQ, GPTQ, SmoothQuant, or FP8 calibration and export directly to TensorRT-LLM or vLLM. The key decision: use FP8 if your hardware supports it and you need minimal quality loss, INT4-AWQ if you're memory-constrained and can tolerate slight degradation on complex reasoning, and INT8 as a safe middle ground.
KV Cache Optimization: The Memory Bottleneck Nobody Budgets For
Every transformer-based model stores key-value pairs from previously processed tokens so it doesn't recompute attention from scratch on each step. This KV cache grows linearly with sequence length and batch size, and in long-context scenarios it consumes up to 70% of total GPU memory during inference. For a 70B model processing 128K-token contexts, the KV cache alone can exceed 40GB.
Three optimization strategies target this bottleneck.
KV cache quantization compresses cached values to lower precision. NVIDIA's NVFP4 KV cache reduces cache memory by 4x compared to FP16, enabling longer contexts and larger batches on the same hardware. The quality impact is minimal because cached attention values are less sensitive to precision loss than model weights.
Selective eviction removes less important cached entries rather than keeping everything. ChunkKV treats semantic chunks rather than isolated tokens as the compression unit, preserving complete linguistic structures while reducing cache size. Entropy-guided strategies from recent research allocate larger cache budgets to higher-entropy attention layers and smaller budgets to lower-entropy ones, achieving better quality-per-byte than uniform compression.
Prefix caching stores and reuses KV cache entries for shared prefixes across requests. If 100 users send messages to the same chatbot with the same system prompt, the KV cache for that system prompt gets computed once and shared across all 100 requests. SGLang's RadixAttention does this automatically. vLLM supports it with manual configuration. In production chatbot and RAG workloads where system prompts and retrieved context overlap heavily, prefix caching alone can cut prefill latency by up to 10x.
Speculative Decoding: Trading Cheap Compute for Expensive Latency
Standard autoregressive decoding generates one token per forward pass through the full model. Speculative decoding uses a small draft model (typically 1-7B parameters) to generate multiple candidate tokens cheaply, then verifies all candidates in a single parallel forward pass through the large target model. Correct predictions yield multiple accepted tokens for the compute cost of one target-model step.
The technique was introduced by Google in 2022 and has since moved from research curiosity to production standard, now built into vLLM, SGLang, TensorRT-LLM, and deployed in Google's AI Overviews at scale.
Three variants dominate production use:
Draft-model speculative decoding pairs a small model with the target model. Acceptance rates of 70-90% on domain-specific tasks yield 2-3x speedup on generation-heavy workloads. The draft model can be a distilled version of the target or any fast model with a compatible vocabulary.
Medusa adds multiple decoding heads to the target model itself, eliminating the need for a separate draft model. Each head predicts a different future position, creating a tree of candidates verified in a single pass. Reported speedups range from 2.2-3.6x depending on the model and task.
EAGLE and its successor EAGLE-3 reuse intermediate features from the target model's own layers to predict future tokens, achieving higher acceptance rates than vocabulary-level prediction. EAGLE-3 introduced multi-layer fusion with a training-time test architecture, and newer methods like Variational Speculative Decoding have pushed acceptance length 9.6% beyond EAGLE-3 on various benchmarks.
The crucial detail: speculative decoding is mathematically lossless. The verification step guarantees that the output distribution matches what the target model would have produced through standard decoding. You get speed without sacrificing quality, though throughput per GPU decreases slightly because the draft model consumes some compute.
Speculative decoding and quantization are multiplicative, not additive. AMD's MI300X benchmarks showed 3.6x total improvement when combining FP8 quantization with speculative decoding on Llama 3.1-405B.
Continuous Batching and Scheduling
Static batching collects a fixed number of requests, processes them together, and returns results when all finish. The problem: a 20-token response and a 2,000-token response in the same batch means the short response sits idle while the long one generates. GPU utilization drops. Latency suffers.
Continuous batching (also called iteration-level scheduling) inserts new requests into the batch as soon as a slot opens, without waiting for the entire batch to complete. vLLM popularized this through its PagedAttention memory manager, which allocates KV cache in non-contiguous blocks like virtual memory pages, eliminating the memory fragmentation that made dynamic batching impractical before 2023.
The impact on throughput is dramatic. Continuous batching with PagedAttention achieves 2-4x higher throughput than static batching on the same hardware, with lower average latency for short requests. Combined with prefix caching, which lets new requests skip prefill for shared context, the effective throughput improvement reaches 5-8x for workloads with substantial context overlap.
Scheduling strategies add another layer. Chunked prefill splits long prompt processing into smaller chunks interleaved with decode steps, preventing a single long-context request from blocking all other requests. Priority scheduling ensures latency-sensitive requests get processed first. These aren't exotic research techniques. They're configuration options in vLLM and SGLang that most deployments leave at defaults.
Choosing an Inference Engine
Four engines cover the production spectrum. The choice depends on your model stability, hardware, and whether you're optimizing for flexibility or peak performance.
vLLM is the right default for most teams. It supports the widest range of models, requires no compilation step, and provides a fast path to production. Throughput on H100 GPUs runs approximately 12,500 tokens per second for typical workloads. Start here unless you have a specific reason not to.
SGLang matches or beats vLLM on throughput (approximately 16,200 tokens/second on H100) and provides a roughly 10-20% advantage on multi-turn workloads with shared context, thanks to its RadixAttention automatic prefix caching. If your workload involves chatbots, RAG pipelines, or multi-turn agent conversations, SGLang's architecture is purpose-built for it.
TensorRT-LLM delivers the highest raw throughput, outperforming vLLM by 30-50% in high-concurrency environments, with an even larger advantage on B200 GPUs. The cost: a 28-minute compilation step for each model configuration and significantly more operational complexity. Use it when your model is stable, you're running at scale, and you need every token per second.
llama.cpp / Ollama serve the local and edge deployment niche. They run quantized models on consumer hardware with no GPU dependency, making them suitable for development, testing, and privacy-sensitive deployments where data can't leave the device.
Stacking Optimizations Without Breaking Things
The techniques above interact. The order you apply them matters, and some combinations create unexpected problems.
Start with quantization. It's the simplest change with the largest standalone impact. Apply FP8 or INT4-AWQ to your model, benchmark quality on your specific evaluation suite (not general benchmarks), and establish your cost and latency baselines. If quality holds, you've just cut your hardware requirements in half.
Add continuous batching next. Switch from static batching to continuous batching with PagedAttention. This is a serving engine choice, not a model change. Moving from a naive deployment to vLLM or SGLang with continuous batching typically doubles throughput.
Enable prefix caching if your workload has shared context. System prompts, retrieval-augmented context, and multi-turn conversation histories all benefit. The hit rate determines the payoff: workloads with 60%+ prefix overlap see substantial latency and throughput improvements.
Add speculative decoding last. It requires the most tuning (draft model selection, tree width, acceptance thresholds) and its benefits are most visible after the other optimizations have reduced baseline latency. Since it's lossless, the risk is low, but the operational complexity of maintaining a draft model alongside your target model is real.
The combined impact of FP8 quantization, Flash Attention, continuous batching, and speculative decoding on H100 hardware delivers 5-8x better cost efficiency than naive FP16 inference with static batching. For teams currently running unoptimized deployments, that's the difference between a $20,000/month GPU bill and a $3,000/month one for the same workload.
When to Optimize Versus When to Switch Models
Not every optimization problem is an inference problem. Sometimes the right move is a different model.
If your bottleneck is latency on simple tasks, try a smaller model before tuning a large one. A well-optimized 8B model on a single GPU often outperforms a quantized 70B model across two GPUs on throughput, latency, and cost for tasks that don't require deep reasoning. The model selection guide covers when smaller models win.
If your bottleneck is cost at high volume, the API-versus-self-hosting calculation matters more than any serving optimization. At 2 million tokens per day, self-hosting crosses the break-even point against most API providers. At 10 million tokens per day, the savings are 5-10x. The production cost guide walks through the full calculation.
If your bottleneck is quality, inference optimization won't help. You need a better model, better prompts, or retrieval-augmented generation. Optimization makes good models cheaper and faster. It doesn't make bad models good. The work on inference-time compute scaling shows that you can trade latency for quality through reasoning, but that's a fundamentally different tradeoff than the efficiency techniques covered here.
The Measurement Discipline
Optimization without measurement is guesswork. Track four metrics from the start.
Time-to-first-token (TTFT) captures the user's perceived responsiveness. For interactive applications, TTFT under 500ms is the target. Prefill optimization, prefix caching, and chunked prefill all target this metric.
Tokens per second (TPS) measures generation speed. For streaming applications, 30-50 TPS matches comfortable reading speed. For agent loops and batch processing, higher is better without a practical ceiling.
Throughput (requests/second/GPU) measures hardware efficiency. This is the metric your finance team cares about. Continuous batching, quantization, and engine selection all target this.
Quality on your specific evaluation suite ensures optimizations haven't degraded the outputs that matter. General benchmarks are insufficient. If your agent writes SQL queries, measure SQL correctness. If it summarizes legal documents, measure summary accuracy. Quantization and prompt compression both carry quality risks that only domain-specific evaluation catches.
Run these measurements before and after every optimization change. The technique that benchmarks show delivering 3x improvement might deliver 1.2x on your specific workload, or 5x. The only way to know is to measure.
Sources
Research Papers:
- AWQ: Activation-aware Weight Quantization -- Lin et al. (2023)
- GPTQ: Accurate Post-Training Quantization for Generative Pre-Trained Transformers -- Frantar et al. (2022)
- Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads -- Cai et al. (2024)
- EAGLE-3: Scaling up Inference Acceleration via Training-Time Test -- Li et al. (2025)
- Variational Speculative Decoding -- (2026)
- ChunkKV: Semantic-Preserving KV Cache Compression -- (2025)
- Entropy-Guided KV Caching for Efficient LLM Inference -- (2025)
- Multi-Tier Dynamic Storage of KV Cache -- (2025)
Industry / Benchmarks:
- LLMflation: LLM Inference Cost Is Going Down Fast -- a16z (2025)
- vLLM vs TensorRT-LLM vs SGLang: H100 Benchmarks -- Spheron (2026)
- Optimizing Inference with NVFP4 KV Cache -- NVIDIA (2026)
- NVIDIA Model Optimizer -- NVIDIA (2025)
- How to Cut LLM Inference Costs with KV Caching -- Pure Storage (2025)
- Looking Back at Speculative Decoding -- Google Research (2025)
Commentary:
- LLM Inference: Prefill, Decode, KV Cache & Cost Guide -- Morph (2026)
- Speculative Decoding: 2-3x Faster LLM Inference -- PremAI (2026)
- SGLang vs vLLM: Inference Engine Comparison -- Particula (2026)
Related Swarm Signal Coverage:
- Inference-Time Scaling: Why AI Models Now Think for Minutes Before Answering
- The True Cost of Running AI Agents in Production
- Model Selection Guide: How to Pick the Right AI Model
- MoE Models Run 405B Parameters at 13B Cost
- The Budget Problem: Why AI Agents Are Learning to Be Cheap