Morning Edition LIVE
Vol. I · No. 1
Est.
MMXXVI

The A.I. Beat

Dispatches from the frontier of machine intelligence
Three
Dollars
← Front page Large Language Models April 23, 2026 · 8 min read
Large Language Models

AI Hallucinations: Why Models Make Things Up and What's Being Done

AI hallucinations remain the biggest unsolved reliability problem in the field. Here's the technical explanation of why they happen, a taxonomy of types, real examples, and the mitigation stack that's actually working.
AI Hallucinations: Why Models Make Things Up and What's Being Done

In February 2024, Air Canada was ordered by a tribunal to honor a refund policy that did not exist. The airline’s customer service chatbot had invented a bereavement fare discount — complete with plausible-sounding eligibility rules and a claims process — and a customer had relied on it. The tribunal ruled that Air Canada was liable for its chatbot’s fabrication, even though no human at the airline had ever approved or even seen the policy.

In 2023, a New York lawyer submitted a legal brief containing six fabricated case citations — all generated by ChatGPT, all completely nonexistent. The cases had real-sounding names, plausible docket numbers, and fictional holdings that supported his argument perfectly. He was sanctioned by the court.

These are not bugs. They are the predictable consequence of how large language models work. Understanding why hallucinations happen — at a technical level, not just a hand-wave — is essential for anyone who uses AI tools for anything that matters.

Why Hallucinations Happen: The Technical Picture

The explanation starts with a fact that most coverage gets wrong. Language models do not store facts in a database and retrieve them. They predict the next token — the next word, or piece of a word — based on probability distributions learned during training.

When you ask a model “What is the capital of Australia?” the model does not look up “Australia → capital → Canberra” in a table. It computes a probability distribution over all possible next tokens. “Canberra” has the highest probability because it appeared most often in the correct context during training. But “Sydney” has a non-trivial probability too — because millions of web pages contain the phrase “Australia’s capital, Sydney” (incorrectly) or because “Sydney” simply appears near “Australia” far more often than “Canberra” does.

For well-attested facts, the correct answer usually wins the probability race. For less common facts, niche topics, or questions that require combining multiple pieces of information, the probabilities become muddier — and the model may generate a confident-sounding response that is statistically likely but factually wrong.

The Five Root Causes

1. Knowledge gaps. The model was never trained on the relevant information, or the information appeared too rarely in training data to form a strong pattern. Rather than outputting “I don’t know” — which requires the model to have accurate self-knowledge about its own training data — it generates the most plausible continuation. This is how you get fabricated citations: the model knows what a legal citation looks like (format, structure, style) but does not have the specific case in its training data, so it generates one that fits the pattern.

2. Conflicting training data. The internet is full of contradictions. The same question may have different answers across hundreds of web pages. The model learns a blended distribution that may not match any single authoritative source. This is especially problematic for topics where popular belief differs from expert consensus.

3. Sycophancy pressure. Models are trained via reinforcement learning from human feedback (RLHF) to be helpful. Human raters consistently prefer confident, complete answers over hedged or uncertain ones. This training signal creates a systematic pressure to produce an answer — any answer — rather than expressing uncertainty. The model that says “I’m not sure” gets downvoted. The model that invents a plausible answer gets upvoted. The incentives are misaligned.

4. Compositionality failures. Many hallucinations occur when the model needs to combine two or more individually correct facts into a novel conclusion. Each fact may be accurate in isolation, but the combination is wrong. “Person A won the Nobel Prize in 2019” (true) + “Person B won it in 2020” (true) → “Person A and B shared the prize” (false). The model is interpolating in a space where interpolation does not preserve truth.

5. Temporal confusion. Models have a training data cutoff and no reliable mechanism for distinguishing “I was trained on this information” from “this is current.” A model trained on data through early 2025 may state that a company’s CEO is someone who was replaced six months ago — not because it is guessing, but because that was the correct answer when it last saw data about it.

The Taxonomy of Hallucinations

Not all hallucinations are equally dangerous, and they require different detection strategies.

The most dangerous hallucinations are fabrications and confabulated details — because they are the hardest to detect. A fabricated citation looks exactly like a real one. An invented statistic embedded in an otherwise accurate paragraph does not trigger alarm bells unless you specifically check it.

The Hallucination Rate: What the Data Shows

Benchmarking hallucination rates is methodologically difficult — it depends on the domain, the question type, and how you define “hallucination.” But several large-scale evaluations provide useful data points:

  • Vectara’s Hughes Hallucination Evaluation Model (HHEM), which tests models on summarization tasks, found hallucination rates ranging from 1.5% (best-performing models in early 2026) to 15%+ (smaller, less capable models). The rate has fallen roughly 50% per year for frontier models since 2023.
  • FActScore, an evaluation framework from the Allen Institute, measures factual precision in long-form biographies. Top models in early 2026 score 85-92% factual precision — meaning 8-15% of generated atomic facts contain errors.
  • SimpleQA, OpenAI’s factual accuracy benchmark, shows GPT-4o answering 39% of tricky factual questions correctly, Claude 3.5 Sonnet at 28%, and Gemini at similar levels. These are adversarially selected hard questions, not typical queries — but they demonstrate that even frontier models are unreliable on the long tail of factual knowledge.

The trend is clear improvement — hallucination rates are dropping with each model generation. But “improving” and “solved” are different things. A 5% hallucination rate sounds low until you consider that users may ask hundreds of factual questions per week.

The Mitigation Stack

The industry’s approach to hallucinations is not a single solution but a stack of complementary techniques, each addressing a different part of the problem.

Layer 1: Retrieval-Augmented Generation (RAG)

Instead of asking the model to recall facts from memory, retrieve relevant documents at query time and provide them in the prompt. The model’s job shifts from “remember the answer” to “find the answer in this text” — a much easier task that dramatically reduces factual hallucinations.

RAG is the single most effective hallucination mitigation technique in production systems today. Perplexity’s entire product is essentially a RAG system: it searches the web, retrieves relevant pages, and asks the model to synthesize an answer from those pages. Microsoft’s Copilot does the same with enterprise data.

Limitation: RAG only works when the relevant information exists in the retrieval corpus. If the document set does not contain the answer, the model may still hallucinate — sometimes incorporating irrelevant retrieved text in misleading ways.

Layer 2: Improved Training

Model providers are training models to express uncertainty rather than confabulate. Anthropic’s constitutional AI approach includes principles like “If you’re not sure, say so.” OpenAI has introduced training objectives that reward models for refusing to answer when they lack sufficient knowledge.

The results are measurable: Claude 3.5 Sonnet’s refusal rate on questions outside its knowledge is roughly 3x higher than Claude 2’s was — and its accuracy on questions it does answer has increased correspondingly.

Limitation: There is a fundamental tension between helpfulness and accuracy. A model that refuses too often is useless. A model that refuses too rarely hallucinates. Calibrating this tradeoff is an ongoing challenge.

Layer 3: Citation and Attribution

Systems like Perplexity, Bing Copilot, and Google’s AI Overviews now attach citations to specific claims. This does two things: it makes hallucinations easier for users to detect (you can check the source), and it changes the model’s generation behavior (models that are required to cite sources tend to stick closer to those sources).

Limitation: The model can cite a real source while misrepresenting what it says. Citation is necessary but not sufficient for accuracy.

Layer 4: Tool Use and Grounding

Rather than asking the model to recall that 247 times 389 equals 96,083, let it use a calculator. Rather than asking it to remember the current stock price of Apple, let it use a search API. Tool use removes entire categories of hallucination by replacing recall with lookup.

Modern AI systems increasingly use tools by default: ChatGPT calls a search API for current information, uses a Python interpreter for math, and accesses file systems for document analysis. Each tool use is a hallucination that does not happen.

Limitation: The model must correctly decide when to use a tool and how to interpret the results. Models sometimes hallucinate tool calls (calling a function that does not exist) or misinterpret tool output.

Layer 5: Confidence Calibration

The frontier of hallucination research is teaching models to know what they know. A well-calibrated model would express high confidence only when it is likely to be correct and low confidence when it is guessing. Current models are poorly calibrated — they express similar confidence whether they are right or wrong.

Some approaches being explored: training models to output probability estimates alongside claims, using ensemble methods (multiple models vote on an answer, and disagreement signals uncertainty), and meta-cognitive probing (asking the model to evaluate its own confidence before committing to an answer).

Limitation: This is the least mature layer. No production system has achieved reliable confidence calibration yet.

Practical Verification Strategies

If you are using AI tools for anything that matters, here are the specific verification practices that work:

1. Treat AI output like a junior employee’s first draft. Helpful, usually directionally correct, but requires senior review before it goes anywhere. This mental model prevents both over-trust and dismissal.

2. Verify any specific number, date, name, or citation. Hallucinations cluster disproportionately in specific claims. The prose around those claims is usually fine — it is the precise facts that go wrong. If the AI says “a 2024 study by researchers at MIT found that…” verify that the study exists, it was from MIT, and it was in 2024.

3. Be especially skeptical of the impressive. When an AI response includes a surprisingly perfect quote, an uncannily relevant statistic, or a citation that supports your argument exactly, your alarm bells should ring loudest. The model is optimized to produce satisfying responses. The most satisfying response to your question is also the most likely to be fabricated.

4. Use the “ask twice differently” test. If you suspect a claim might be hallucinated, rephrase your question and ask again — or ask a different model. If both give the same answer, it is more likely to be correct. If they give different answers, at least one is hallucinating.

5. Check the edges. Hallucinations are more frequent for: recent events (near or after the training cutoff), niche or obscure topics (less training data), specific numbers and dates (harder for probability-based prediction), and questions about the model itself (models are unreliable narrators of their own capabilities).

6. Demand sources, then check them. Asking the model “cite your source” often produces a real-looking but nonexistent URL. Instead, take the claim to Perplexity or Google and verify independently. Do not trust the model’s self-citations.

The Fundamental Tension

Here is the uncomfortable truth that the AI industry has not yet resolved: the same properties that make language models useful are the properties that make them hallucinate.

If a model could only output information it was certain about, it would be useless for creative tasks, brainstorming, hypothetical reasoning, and most of the things people actually use AI for. The ability to generate novel combinations of ideas — to say things that were never in the training data — is both the superpower and the failure mode.

A model that never hallucinated would be a search engine. We already have those.

The path forward is not eliminating hallucinations entirely — it is building systems and practices that manage the tradeoff. RAG for factual grounding. Tool use for verifiable claims. Confidence calibration for transparency. Human review for high-stakes decisions.

Hallucinations are getting less frequent with each model generation. The rate is falling. But it will never reach zero, because reaching zero would require the model to have a perfect internal representation of all truth — and that is not what these systems are. They are pattern-completion engines that are extraordinarily good at approximating knowledge but fundamentally incapable of the kind of ground-truth verification that humans do when they check a fact.

The right response is not to stop using AI. It is to use AI the way you would use any powerful tool that sometimes fails: with verification, with appropriate skepticism, and with awareness of where the failure modes are.

Trust, but verify. And know exactly what you are verifying.

large language models explainers safety