▶️ LISTEN TO THIS ARTICLE

In June 2017, eight researchers at Google published "Attention Is All You Need." The paper introduced the transformer, an architecture that replaced recurrence with self-attention and processed all tokens in parallel. Their base model trained in 12 hours on 8 NVIDIA P100 GPUs and set a new state of the art on machine translation. Eight years later, every frontier AI model, every chatbot, every coding assistant, every image generator runs on a descendant of that architecture. Understanding how transformers work isn't academic. It's the minimum viable knowledge for anyone building with AI.

What the Transformer Replaced

Before transformers, sequence models were recurrent. RNNs and LSTMs processed tokens one at a time, left to right. Each token had to wait for the previous token to be processed. This created two problems that capped their usefulness.

First, training couldn't be parallelized across the sequence. GPUs are designed for massive parallel computation, but recurrent models forced sequential processing. Training was slow.

Second, long-range dependencies degraded. Information from early tokens had to survive through every intermediate step to influence later tokens. The vanishing gradient problem meant that by the time a model processed the 500th token, it had largely forgotten the 10th. LSTMs partially solved this with gating mechanisms, but the fundamental bottleneck remained.

The transformer dispensed with recurrence entirely. Every token attends to every other token simultaneously. The original model, with roughly 65 million parameters across six encoder and six decoder layers, achieved 28.4 BLEU on English-to-German translation, beating everything that came before including ensemble systems.

How Self-Attention Works

Self-attention is the mechanism that lets each token look at every other token and decide which ones matter for understanding its own meaning.

Each input token's embedding gets projected through three separate learned weight matrices to produce three vectors: a Query ("what am I looking for?"), a Key ("what do I have to offer?"), and a Value ("here's my actual content"). In the original paper, each 512-dimensional embedding was projected into 64-dimensional Q, K, and V spaces.

The attention score is computed as: softmax(QK^T / sqrt(d_k)) * V. The dot product of Q with every K measures similarity. The scaling by sqrt(d_k) prevents scores from getting too large, which would push softmax into flat regions where gradients vanish. Softmax converts the scores into probabilities that sum to 1. Those probabilities weight the V vectors.

Multi-head attention runs this process 8 times in parallel with different learned projections. One head might capture syntactic relationships (subject-verb agreement). Another might capture semantic similarity. Another might track positional patterns. The 8 outputs are concatenated and projected through a final matrix. This gives the model multiple simultaneous perspectives on token relationships.

Positional encoding tells the model where each token sits in the sequence, since attention itself is position-agnostic. The original paper used sinusoidal functions. Modern models use Rotary Position Embeddings (RoPE), which encode relative position directly into the attention scores. RoPE requires zero extra parameters and is used by nearly every current LLM including Llama, Mistral, and DeepSeek.

The Three Architectures

The original transformer had both an encoder (processes the full input bidirectionally) and a decoder (generates output left-to-right). The field split this into three variants, and the split determines what each model is good at.

Encoder-only models like BERT process the entire input at once, seeing all tokens simultaneously. BERT-Base has 110 million parameters across 12 layers. It's trained by masking roughly 15% of tokens and predicting them from bidirectional context. This architecture excels at classification, search, named entity recognition, and question answering. BERT scored 80.5 on the overall GLUE benchmark versus GPT's 72.8 because bidirectional context gives it a richer understanding of meaning.

Decoder-only models like GPT process tokens left-to-right, predicting the next token at each step. This architecture won the generation race. GPT-3 scaled to 175 billion parameters and demonstrated that at sufficient size, decoder-only models can perform tasks they weren't explicitly trained on (in-context learning). The architecture is simpler, scales more predictably, and autoregressive generation is naturally suited for open-ended text. Every frontier chatbot today uses this architecture.

Encoder-decoder models like T5 and BART combine both. T5 treats every NLP task as text-to-text, scaling from 60 million to 11 billion parameters. BART uses denoising pre-training. These architectures still outperform decoder-only models on translation and abstractive summarization, where complex input-to-output mapping matters. But for general-purpose AI, decoder-only has become the default.

Scaling Laws Changed Everything

In 2020, Kaplan et al. at OpenAI discovered that language model loss follows a power law with model size, dataset size, and compute, spanning seven orders of magnitude. Their key finding: 73% of a compute budget should go to scaling parameters, with only 27% to data. Bigger models are more sample-efficient.

In 2022, DeepMind's Chinchilla paper corrected this. Hoffmann et al. trained over 400 models and found that model size and training tokens should scale equally: roughly 20 training tokens per parameter is optimal. Their 70 billion parameter Chinchilla, trained on 1.4 trillion tokens, outperformed Gopher (280B), GPT-3 (175B), and Megatron-Turing (530B) by significant margins, beating Gopher by 7.6 points on MMLU. The lesson: most LLMs were undertrained. Too many parameters, not enough data.

DeepSeek challenged both assumptions. DeepSeek-V3 uses 671 billion total parameters but activates only 37 billion per token through Mixture of Experts. It trained on 14.8 trillion tokens for $5.576 million total. For context, that's a fraction of what US labs spend. Their Multi-Head Latent Attention compresses the KV cache by over 90%. FP8 mixed-precision training cuts memory further. The DeepSeek economics breakdown covers how software-hardware co-design enables frontier performance at a fraction of competitors' costs. The scaling laws are real, but how you allocate compute matters as much as how much you have.

Modern Innovations That Matter

Five innovations from 2022-2025 changed what transformers can do in practice.

Mixture of Experts replaces the standard feed-forward layer with multiple parallel "expert" networks plus a routing gate. The gate selects the top-k experts per token. Mixtral 8x7B has 45 billion total parameters but activates only 13 billion per token, delivering Llama 2 70B-level performance at roughly 5x fewer active parameters. The trade-off: all parameters must fit in memory even though only a fraction activate per token. MoE is now the default architecture for frontier models. For a full treatment of how MoE routing works in practice, see the dedicated coverage.

FlashAttention solved the memory bottleneck. Standard attention requires O(n^2) memory, which caps sequence length. Dao et al. used IO-aware tiling to reduce this to O(n) while computing exact attention (no approximation). FlashAttention-1 delivered 2-4x speedups and 10-20x memory savings. FlashAttention-2 improved parallelism further. FlashAttention-3, optimized for NVIDIA H100s, reaches 740 TFLOPs/s in FP16 and close to 1.2 PFLOPs/s in FP8. This should be the default for any transformer deployment.

KV cache optimization addresses the bottleneck in autoregressive generation. The KV cache stores key and value tensors from all previous tokens, and loading it becomes the primary constraint at large batch sizes. Grouped Query Attention (GQA) shares K/V heads across multiple Q heads for 30-40% faster inference, used in Llama 2+ and Mistral. DeepSeek's Multi-Head Latent Attention achieves 90%+ compression. These techniques directly determine how many concurrent users your model can serve.

Speculative decoding uses a small, fast "draft" model to generate multiple candidate tokens, then verifies them in parallel with the large target model. Correct tokens are accepted; incorrect ones are rejected. Apple's QuantSpec achieves over 90% acceptance rates and up to 2.5x end-to-end speedup. This is free performance for any deployment running a large model.

Context windows expanded from 512 tokens in 2017 to 2 million (Gemini 1.5 Pro) and 10 million (Llama 4) by 2025. RoPE, FlashAttention, and sparse attention patterns made this possible. The caveat: most models degrade before their advertised limit. A 200K model typically becomes unreliable around 130K. How attention heads allocate their budget across long contexts determines real-world capability more than the headline number.

State Space Models: The Challenger

Mamba, published in December 2023, offers an alternative to attention with O(n) complexity instead of O(n^2). Mamba-3B outperformed transformers of the same size and matched transformers twice its size on language modeling, with 5x higher throughput at inference.

Mamba-2 proved mathematically that SSMs and attention are related through decompositions of semiseparable matrices, with its core layer running 2-8x faster than Mamba-1 and faster than FlashAttention-2 on sequences of 2K+ tokens.

The limitation: 2025 research showed Mamba underperforms transformers on long-context understanding, in-context learning, and retrieval tasks. The emerging answer is hybrid architectures combining transformer and SSM layers, getting the parallelism advantages of both.

Why This Matters for What You Build

Architecture choices translate directly into cost and capability. Here's the practical cheat sheet.

For text generation and chatbots, use decoder-only models. For classification and search, use encoder-only (or embeddings from decoder-only models). For translation and summarization, encoder-decoder still wins on quality. For budget-constrained deployment, MoE architectures deliver frontier capability at lower compute per token. For low-latency requirements, use smaller dense models with speculative decoding.

DistilBERT demonstrated the principle: 40% smaller than BERT, 60% faster, retaining 97% of language understanding. Model size doesn't equal capability. Training data quality, architecture efficiency, and inference-time scaling all matter at least as much. The teams that understand these trade-offs spend less and ship faster than the teams that default to the biggest model available.

Sources

Research Papers:

Industry / Case Studies:

Commentary:

Related Swarm Signal Coverage: