Victor Journoud

Co-Founder & Partner

From Guessing to Ground Truth

TL;DR
OpenAI’s study argues that today’s training and evaluation culture rewards models for guessing instead of admitting uncertainty. Because scoreboards focus on accuracy alone, models learn to bluff, which shows up as confident, wrong answers.

The fix isn’t a single benchmark, it’s a shift in how we grade: penalize confident errors more than uncertainty and give partial credit for calibrated “I don’t know.” Some questions are inherently unanswerable from text alone, so accuracy will never be 100 percent. Smaller models can even be better at knowing their limits.

‍

What the paper actually says

Hallucinations are fluent, wrong statements a model makes with confidence. They persist because common evals*¹ prize accuracy without considering whether the model should have abstained. In accuracy-only scoreboards, guessing can outperform honesty.
Change the incentives. The authors propose grading that penalizes confident errors more than expressions of uncertainty, and gives partial credit for appropriate abstention, so models learn to be calibrated instead of bold. This is a scoreboard change, not just “one more benchmark.”
Why this happens statistically. Pretraining is next-word prediction: it captures stable patterns (spelling, syntax) but not arbitrary, low-frequency facts. Some questions simply can’t be inferred from patterns, which explains the specific kinds of hallucinations we see. Later training helps, but doesn’t eliminate them.
Limits and misconceptions.
- Accuracy will not reach 100% on real tasks, because some questions are unanswerable with available info.
- Hallucinations are not inevitable: models can abstain.
- Bigger ≠ always safer: small models may more easily know their limits, which is a calibration problem, not a scale problem.
Current state. Newer models reduce hallucinations, especially in reasoning, but the issue remains and must be designed around.

‍

What It Means — The Future of Reliable AI and Where Implementations Are Heading

‍

TL;DR

Expect a reliability shift from “Did it get the answer right?” to “Did it answer responsibly?” The industry will move toward calibration-aware metrics, abstention-first product patterns, and grounded outputs tied to sources.

Implementations will show up across evals, training, and UX, making systems more predictable and auditable without promising perfection.

‍

The bigger meaning

Reliability becomes multi-dimensional. Accuracy remains important, but confident-wrong will be treated as a distinct, costlier failure than simple inaccuracy, and abstention will be a legitimate, even preferred outcome when evidence is weak.
Leaderboards will evolve. If the main scoreboards stop rewarding lucky guesses, model development will prioritize calibration and honesty. That, more than any single “anti-hallucination test,” is what changes behavior.
“Right-sized” intelligence. The idea that only ever-larger models can be safe is fading. In domains where limits are clear, smaller models that abstain well may offer better risk profiles and lower cost.

‍

What the implementations are likely to look like (at a high level)

Evals that price in uncertainty. Standard evaluation suites will add or elevate metrics that explicitly penalize confident errors and credit calibrated abstentions, making guessing a losing strategy.
Product UX with an “I don’t know” path. Interfaces will normalize abstention and show when answers are uncertain, rather than forcing a guess. Expect clearer confidence cues and explanations of what the model used to decide.
Grounding as default. More answers will be tied to sources of record and citations, so users can verify claims quickly and spot when information is missing altogether.
Training for calibration, not bravado. Optimization targets will increasingly include calibration objectives, so models learn to align confidence with correctness instead of maximizing apparent accuracy through risk-taking.
Model portfolios. Organizations will deploy a mix of models: some tuned for breadth, others for conservative correctness with frequent abstention in sensitive settings.
Transparent reporting. Model cards and system reports will surface abstention rates and confident-error rates, not just accuracy, giving stakeholders a clearer picture of real-world risk.

‍

Bottom line

The research reframes hallucinations from “mysterious glitches” to predictable by-products of how we grade and reward models. The future is less about a perfect, omniscient system and more about well-calibrated systems that know when to speak, when to cite, and when to pause. That shift, baking uncertainty into evals, training, and UX, should make AI more trustworthy and useful in everyday work.

‍

^{*1 Evals is short for evaluations. In AI and machine learning, it means the tests or assessments used to measure how well a model performs. These can include accuracy tests, reasoning checks, robustness against errors, or whether the model responds responsibly. In short, evals are the tools and methods used to judge the quality and reliability of AI models.}

‍

Ready to get started? Let’s Chat