|

From Pretraining to Post-Training: Why Language Models Hallucinate and How Evaluation Methods Reinforce the Problem

Large language fashions (LLMs) fairly often generate “hallucinations”—assured but incorrect outputs that seem believable. Despite enhancements in coaching strategies and architectures, hallucinations persist. A brand new analysis from OpenAI offers a rigorous rationalization: hallucinations stem from statistical properties of supervised versus self-supervised studying, and their persistence is strengthened by misaligned analysis benchmarks.

What Makes Hallucinations Statistically Inevitable?

The analysis workforce explains hallucinations as errors inherent to generative modeling. Even with completely clear coaching information, the cross-entropy goal utilized in pretraining introduces statistical pressures that produce errors.

The analysis workforce scale back the drawback to a supervised binary classification activity referred to as Is-It-Valid (IIV): figuring out whether or not a mannequin’s output is legitimate or faulty. They show that the generative error charge of an LLM is at the least twice its IIV misclassification charge. In different phrases, hallucinations happen for the similar causes misclassifications seem in supervised studying: epistemic uncertainty, poor fashions, distribution shift, or noisy information.

Why Do Rare Facts Trigger More Hallucinations?

One main driver is the singleton charge—the fraction of information that seem solely as soon as in coaching information. By analogy to Good–Turing missing-mass estimation, if 20% of information are singletons, at the least 20% of them might be hallucinated. This explains why LLMs reply reliably about extensively repeated information (e.g., Einstein’s birthday) however fail on obscure or not often talked about ones.

Can Poor Model Families Lead to Hallucinations?

Yes. Hallucinations additionally emerge when the mannequin class can not adequately characterize a sample. Classic examples embrace n-gram fashions producing ungrammatical sentences, or fashionable tokenized fashions miscounting letters as a result of characters are hidden inside subword tokens. These representational limits trigger systematic errors even when the information itself is adequate.

Why Doesn’t Post-Training Eliminate Hallucinations?

Post-training strategies comparable to RLHF (reinforcement studying from human suggestions), DPO, and RLAIF scale back some errors, particularly dangerous or conspiratorial outputs. But overconfident hallucinations stay as a result of analysis incentives are misaligned.

Like college students guessing on multiple-choice exams, LLMs are rewarded for bluffing when uncertain. Most benchmarks—comparable to MMLU, GPQA, and SWE-bench—apply binary scoring: appropriate solutions get credit score, abstentions (“I don’t know”) get none, and incorrect solutions are penalized no extra harshly than abstentions. Under this scheme, guessing maximizes benchmark scores, even when it fosters hallucinations.

How Do Leaderboards Reinforce Hallucinations?

A evaluate of common benchmarks exhibits that just about all use binary grading with no partial credit score for uncertainty. As a end result, fashions that honestly specific uncertainty carry out worse than those who at all times guess. This creates systemic stress for builders to optimize fashions for assured solutions relatively than calibrated ones.

What Changes Could Reduce Hallucinations?

The analysis workforce argue that fixing hallucinations requires socio-technical change, not simply new analysis suites. They suggest express confidence targets: benchmarks ought to clearly specify penalties for incorrect solutions and partial credit score for abstentions.

For instance: “Answer solely in case you are >75% assured. Mistakes lose 2 factors; appropriate solutions earn 1; ‘I don’t know’ earns 0.”

This design mirrors real-world exams like earlier SAT and GRE codecs, the place guessing carried penalties. It encourages behavioral calibration—fashions abstain when their confidence is under the threshold, producing fewer overconfident hallucinations whereas nonetheless optimizing for benchmark efficiency.

What Are the Broader Implications?

This work reframes hallucinations as predictable outcomes of coaching targets and analysis misalignment relatively than inexplicable quirks. The findings spotlight:

  • Pretraining inevitability: Hallucinations parallel misclassification errors in supervised studying.
  • Post-training reinforcement: Binary grading schemes incentivize guessing.
  • Evaluation reform: Adjusting mainstream benchmarks to reward uncertainty can realign incentives and enhance trustworthiness.

By connecting hallucinations to established studying idea, the analysis demystifies their origin and suggests sensible mitigation methods that shift duty from mannequin architectures to analysis design.


Check out the PAPER and Technical details here. Feel free to try our GitHub Page for Tutorials, Codes and Notebooks. Also, be at liberty to comply with us on Twitter and don’t overlook to be part of our 100k+ ML SubReddit and Subscribe to our Newsletter.

The publish From Pretraining to Post-Training: Why Language Models Hallucinate and How Evaluation Methods Reinforce the Problem appeared first on MarkTechPost.

Similar Posts