Morning Edition LIVE
Vol. I · No. 1
Est.
MMXXVI

The A.I. Beat

Dispatches from the frontier of machine intelligence
Three
Dollars
← Front page DEEP DIVE April 15, 2026 · 7 min read
DEEP DIVE

What Are Large Language Models and How Do They Actually Work?

Beyond the buzzwords: how transformer-based language models are built, trained, and deployed, with real architecture details, cost figures, and a clear-eyed look at what 'intelligence' means in this context.
What Are Large Language Models and How Do They Actually Work?
“A 175-billion-parameter model doesn't 'know' things the way you do. It has compressed statistical regularities from a trillion words into a vast, differentiable lookup table.”

Large language models have become infrastructure. They draft legal briefs, triage customer support tickets, generate working code, and write marketing copy at a scale that was science fiction five years ago. OpenAI’s GPT-4 serves over 200 million weekly active users. Anthropic’s Claude processes billions of tokens daily for enterprise clients. Google’s Gemini is woven into Search, Workspace, and Android.

But most coverage oscillates between two poles — breathless hype (“it thinks like a human!”) and dismissive reductionism (“it’s just autocomplete”). Both miss the mark. The reality is more interesting and more technical than either framing allows.

This is a serious look at what LLMs are, how they are built, and where the architecture actually creates — and fails to create — something that resembles understanding.

The Training Pipeline

Building a frontier LLM is a multi-stage industrial process. Each phase transforms the model in a distinct way, and each costs a different order of magnitude.

Stage 1: Data collection and tokenization. The model never sees raw text. A tokenizer — typically using Byte Pair Encoding (BPE) — segments text into subword units. GPT-4’s tokenizer uses roughly 100,000 token types. The training corpus for a frontier model in 2025-2026 is estimated at 10-15 trillion tokens, drawn from web crawls (Common Crawl), books, code repositories, scientific papers, and licensed data. Meta disclosed that Llama 3 was trained on over 15 trillion tokens. Quality filtering — deduplication, toxicity removal, domain balancing — is where much of the secret sauce lives.

Stage 2: Pre-training. This is the expensive part. The model learns to predict the next token given all preceding tokens, a task called causal language modeling. For GPT-3 (175 billion parameters), OpenAI spent an estimated $4.6 million in compute in 2020. By 2024, training a frontier model like GPT-4 was reported to cost north of $100 million. Llama 3 405B required 30.84 million GPU-hours on Meta’s cluster of 16,384 H100 GPUs. The raw scale is staggering, but the task itself is simple: predict the next word, billions of times, adjusting weights each time you get it wrong.

Stage 3: Supervised fine-tuning (SFT). The pre-trained model is a powerful but raw engine. It can complete text, but it does not follow instructions well. SFT uses curated datasets — tens of thousands of human-written demonstrations of ideal behavior — to teach the model to be a helpful assistant rather than a text completion engine. This phase is comparatively cheap (days, not months, of compute) but requires expensive human labor to produce the demonstrations.

Stage 4: RLHF or DPO. Reinforcement Learning from Human Feedback trains a separate reward model on human preference rankings (which of two responses is better?), then uses that reward model to further optimize the LLM via proximal policy optimization (PPO). Direct Preference Optimization (DPO), introduced in 2023, simplifies this by skipping the separate reward model entirely and optimizing preferences directly. Most frontier labs now use some variant of DPO or a hybrid approach. This stage is what makes the difference between a model that can answer your question and one that answers it in the way a human would actually prefer.

The Transformer Stack

Nearly every modern LLM is built on the transformer architecture, introduced in the 2017 paper “Attention Is All You Need” by Vaswani et al. at Google Brain. The architecture has proven remarkably durable — seven years later, no fundamentally different design has displaced it at the frontier, though there have been significant refinements (RoPE embeddings, grouped-query attention, SwiGLU activations, RMSNorm).

Input embeddings convert each token into a high-dimensional vector (typically 4,096 to 12,288 dimensions for large models). Positional information is injected so the model knows token order — modern models use Rotary Position Embeddings (RoPE) rather than the fixed sinusoidal encodings of the original transformer.

Self-attention is the core innovation. For each token, the model computes how much “attention” to pay to every other token in the context. This is done via three learned projections — Query, Key, and Value — for each attention head. The math is straightforward: Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) * V. The division by sqrt(d_k) prevents the dot products from growing too large and pushing softmax into regions with vanishing gradients.

Feed-forward networks follow each attention layer. These are two-layer neural networks that process each token independently, providing the model’s “thinking” capacity. In Llama 3, these use SwiGLU activations and have an intermediate dimension of 28,672 — about 3.5x the model dimension.

The full stack repeats these attention + feed-forward blocks many times. GPT-3 has 96 layers. Llama 3 405B has 126. Each layer refines the representation, building from syntactic patterns in early layers to semantic and reasoning-like patterns in later layers.

How Attention Actually Works: A Concrete Example

Consider the sentence: “The trophy didn’t fit in the suitcase because it was too big.”

What does “it” refer to? The trophy or the suitcase? You know it is the trophy (the trophy was too big to fit). But how does the model figure this out?

In the attention mechanism, when processing the token “it,” the model computes attention scores against every preceding token. A well-trained model will assign high attention weight to “trophy” and low weight to “suitcase” in this context, because the pattern “too big” + “didn’t fit” correlates with the subject that has excessive size, not the container.

This is not a hand-coded rule. No one programmed “resolve pronouns by checking size adjectives.” The model learned this statistical regularity from encountering millions of similar constructions during pre-training. The multi-head design means different heads can track different relationships simultaneously — one head might track syntactic dependencies, another coreference, another semantic role.

This is also why the “it’s just autocomplete” framing undersells what is happening. Yes, the model predicts the next token. But to predict accurately, it must build rich internal representations of syntax, semantics, factual relationships, and even rudimentary reasoning chains. The representations are real and measurable — researchers have found linear probes that can extract part-of-speech tags, parse trees, and factual knowledge from intermediate layers.

The Numbers That Matter

The field has moved fast. Here is where things stand in mid-2026:

GPT-3 (June 2020): 175 billion parameters. Dense transformer. 96 layers, 96 attention heads, 12,288-dimensional embeddings. Trained on 300 billion tokens. Cost: ~$4.6M. This was the model that launched the current era.

GPT-4 (March 2023): Architecture officially undisclosed, but widely reported to be a Mixture of Experts model with approximately 1.8 trillion total parameters (8 experts, ~220B each, 2 active per token). This would make it roughly 10x GPT-3’s total parameter count but only ~2.5x the active compute per token. Training cost estimated at $100M+.

Llama 3 405B (July 2024): Meta’s largest open-weights model. Dense transformer, 405 billion parameters, 126 layers. Trained on 15.6 trillion tokens using 30.84 million GPU-hours. Competitive with GPT-4 on many benchmarks. Released under a permissive license — a pivotal moment for open AI.

Claude 3.5 Sonnet (June 2024) / Claude Opus 4 (May 2025): Anthropic’s flagship models. Architecture details undisclosed. Known for strong performance on long-context tasks (200K token window) and coding benchmarks.

DeepSeek V3 (December 2024): 671 billion total parameters in a MoE configuration, 37 billion active per token. Trained for approximately $5.6 million — an order of magnitude cheaper than Western frontier models. Challenged the assumption that frontier performance requires frontier budgets.

Beyond Pattern Matching

The tired framing is that LLMs are “just pattern matchers” or “stochastic parrots.” This is technically true in the same way that the human brain is “just neurons firing” — accurate at one level of description, misleading at every other.

Recent mechanistic interpretability research has revealed that transformers develop internal structures that go well beyond surface pattern matching:

  • Induction heads (Olsson et al., 2022) are attention head circuits that implement a general copying mechanism, enabling in-context learning.
  • Linear representation of concepts — Neel Nanda’s work and others have shown that models represent truthfulness, sentiment, and factual knowledge as directions in activation space that can be read and manipulated.
  • Chain-of-thought reasoning — when prompted to show their work, models demonstrably perform better on reasoning tasks, suggesting the token generation process itself serves as a form of iterative computation.
  • Grokking — models sometimes memorize training data first, then suddenly develop generalizing algorithms after further training, suggesting a phase transition from memorization to understanding.

None of this means LLMs are conscious or truly “understand” language in the philosophical sense. But reducing them to lookup tables is equally wrong. They are a genuinely new kind of computational artifact — one that compresses statistical regularities into a form that can generalize, compose, and occasionally surprise its creators.

What They Cannot Do (Yet)

Intellectual honesty demands acknowledging the real limitations:

Hallucination remains unsolved. Models generate plausible-sounding but fabricated information. The rate has decreased with each generation, but it has not reached zero and may be inherent to the probabilistic generation paradigm. Retrieval-augmented generation (RAG) mitigates but does not eliminate the problem.

Formal reasoning is brittle. LLMs can solve many reasoning tasks by pattern-matching against similar training examples, but they fail on novel problems that require genuine logical deduction beyond their training distribution. The gap between performance on benchmarks (which may be contaminated) and true reasoning ability is an active area of research.

Knowledge is frozen at training time. Without external tools (search, databases, APIs), a model’s knowledge is limited to its training cutoff. This is a fundamental architectural limitation, not a bug to be patched.

Cost scales with capability. The trend of larger models producing better results has not yet hit a ceiling, but the economic and energy costs of training are becoming a serious constraint. Training the next generation of frontier models may require billions of dollars and gigawatts of power.

The Bottom Line

Large language models are transformer neural networks trained at enormous scale to predict text, then refined through human feedback to be useful. They are not sentient, they are not infallible, and they are not “just autocomplete.” They are the most consequential software architecture of the decade, and understanding how they actually work — at a level deeper than marketing copy — is essential for anyone making decisions about this technology.

The question is no longer whether LLMs work. It is what they are actually doing internally, where the limits of this paradigm lie, and what comes next. Those are the questions worth paying attention to.

large language models explainers machine learning