Probability Calibration: Predict Like a Superforecaster
Learn probability calibration like a superforecaster — Brier scores, drills, and free tools to sharpen your forecasts. Based on Philip Tetlock's research.
Most people are confidently wrong. They claim to be 90% sure of things that turn out true only half the time. Probability calibration is the discipline of fixing that gap — making your stated confidence match your actual hit rate. The good news, established by two decades of research from Philip Tetlock and the Good Judgment Project, is that calibration is a trainable skill. Anyone willing to keep score, sit with uncertainty, and revise their views can learn to forecast with the precision of a superforecaster.
This guide explains what calibration is, how Brier scores measure it, what Tetlock's research found about the traits of top forecasters, and the specific drills and free tools that will move your numbers in the right direction.
What probability calibration actually means
It is not the same as being right more often
A forecaster is well-calibrated when their stated probabilities track real-world frequencies. If you say a hundred different events are 70% likely, roughly 70 of them should happen. Not 50, not 90 — 70. Calibration is about the honesty of your confidence, not about how often you happen to be on the winning side.
Accuracy and calibration sound similar but capture different things. A weather forecaster who says "40% chance of rain" every single day in a climate where it actually rains 40% of the time is perfectly calibrated, even though they look uncertain. A pundit who screams "100% certain!" and turns out to be right 80% of the time is wildly overconfident — closer to a coin flip than to truth — even though they were technically correct most of the time.
This is the lens probabilistic thinkers use to evaluate their own judgement. Once you treat your beliefs as numerical claims about the world, you can grade them. And once you can grade them, you can improve them.
The Tetlock research: how we know calibration is trainable
From Expert Political Judgment to the Good Judgment Project
In the 1980s, the political psychologist Philip Tetlock asked a simple question: are pundits any better than chance at predicting world events? He recruited 284 experts — academics, government analysts, journalists — and tracked roughly 28,000 forecasts over 20 years. The headline finding, published in 2005's Expert Political Judgment, was unflattering. The average expert was barely better than random guessing, and famous experts were often worse than less well-known ones, because their confidence outran their evidence.
Crucially, Tetlock identified two cognitive styles. Hedgehogs cling to one big idea and apply it everywhere; foxes draw from many frameworks and update freely. Foxes were systematically better forecasters. That distinction set up the next phase of his work.
From 2011 to 2015, the U.S. intelligence community ran the Aggregative Contingent Estimation tournament, pitting forecasting teams against one another on real geopolitical questions. Tetlock's Good Judgment Project didn't just win — it beat the official intelligence community estimates by such a margin that competitors withdrew. The top 2% of forecasters, dubbed superforecasters, posted Brier scores roughly 30% better than ordinary smart participants. Most importantly, they got better over time, and so did the people they trained.
That is the paper trail that justifies treating calibration as a skill, not a personality trait.
How a Brier score works
The metric that turns vague confidence into a number you can shrink
The Brier score, proposed by meteorologist Glenn Brier in 1950, is the standard metric for grading probabilistic forecasts of binary events. The formula is simple: take the difference between your stated probability and the outcome (1 for happened, 0 for did not happen), square it, and average across many forecasts. Lower is better. A perfect Brier score is 0; a coin-flip forecast on a 50/50 event scores 0.25; predicting 0% or 100% on something that goes the other way scores the maximum of 1.
Two properties make Brier scores useful. First, they reward calibration directly: the score penalises overconfidence and underconfidence symmetrically. Second, they decompose neatly — statisticians can split a Brier score into a calibration term and a resolution term, so you can see whether you are losing points by being miscalibrated or by being too timid (always sitting near 50%).
What separates superforecasters from the rest
The traits Tetlock identified across thousands of forecasts
The 2015 book Superforecasting distilled the traits that consistently distinguished top performers. None of them are rare gifts. They are habits.
Before considering specifics, superforecasters look up how often this kind of event happens historically. A 5% base rate is the gravity their forecast has to escape.
A messy question ("will Russia invade by year-end?") becomes a chain of sub-questions ("are troops at the border?", "have evacuations begun?", "what's the diplomatic posture?").
Rather than swing wildly between 20% and 80% on each headline, they nudge their estimate by a few percentage points as evidence accumulates — Bayesian updating in spirit if not in formal math.
They commit to a number, but treat it as provisional. Being proven wrong is information, not humiliation.
Most people search for reasons their hypothesis is right. Superforecasters deliberately hunt for reasons it might be wrong.
Stepping back, asking what a reasonable observer with no stake in the outcome would conclude — neutralising the gravitational pull of personal narrative.
Every forecast is logged with a probability, a deadline, and a resolution. Without the score, there is no feedback. Without feedback, there is no learning.
Notice what is not on this list: domain expertise. The Good Judgment Project found that subject-matter experts had no systematic edge over generalist superforecasters who applied these habits rigorously. The process beat the credentials.
Calibration training drills you can do this week
Practical exercises that produce measurable improvement
Run a trivia calibration drill
Find 50 binary trivia questions you don't already know the answer to. Answer each one and assign a confidence between 50% and 100%. Grade yourself: of your 80% answers, how many were right? Of your 90% answers? Most people discover their 90% confidence is more like 70% accuracy. The gap is your overconfidence, made legible.
Forecast 10 events with deadlines
Pick ten things that will resolve in the next 30–90 days — election results, sports outcomes, product launches, weather. Assign a probability to each, in writing, before the resolution date. After they resolve, calculate your Brier score and compare it against later batches.
Practise the pre-mortem reflex
Before committing to a forecast, ask: "if this turns out wrong, what was the most likely reason?" Generating two or three failure modes typically pulls overconfident estimates closer to the centre. The pre-mortem is itself a tool of probabilistic thinking — see our <a href="/blog/pre-mortem-decision-making/">pre-mortem guide</a> for the full method.
Argue against your own view in writing
Once a week, write a 200-word case for the opposite of something you believe. The exercise reliably surfaces hidden assumptions and forces you to engage with the strongest version of the counter-argument, not a strawman. This is the disconfirming-evidence habit in action.
The first drill is the most important. Most people have never felt the visceral surprise of seeing their 95% answers proven wrong 30% of the time. That feeling — the calibration gut-punch — is what motivates everything after.
Pair these drills with consistent record-keeping. A spreadsheet with five columns (date, claim, probability, deadline, outcome) is enough. Once a month, review the rows that have resolved and compute a rough Brier score. The trend matters more than the absolute number.
Free tools for tracking your calibration
Platforms that will measure your forecasts so you don't have to
The public-facing successor to Tetlock's tournament. You forecast on real geopolitical and economic questions and see your Brier score against a benchmark community of strong forecasters. The closest thing to formal training that exists for free.
A community forecasting platform with thousands of resolved questions, a robust scoring system, and detailed personal calibration plots that show your hit rate at each confidence level. Ideal for technology, science, and global affairs forecasting.
A lightweight web app from Sage with focused calibration drills (estimation, pastcasting, calibration-in-context). Best for short, deliberate practice sessions of 10–15 minutes.
A free trivia-style calibration trainer used inside Open Philanthropy and adjacent organisations. Excellent for the first drill above — instant feedback on hundreds of questions.
Real-money and play-money prediction markets. The market price aggregates many forecasters' calibrated views into a single probability — see our explainer on <a href="/blog/how-prediction-markets-work/">how prediction markets work</a> to understand what those numbers actually mean.
Pick one and use it weekly. The platform matters less than the consistency. A hundred forecasts logged over six months will tell you more about your judgement than ten thousand opinions you've voiced and forgotten.
Books to deepen your understanding
Tetlock's own work and adjacent classics
Affiliate disclosure: the book links below are affiliate links — if you buy through them, we may earn a small commission at no extra cost to you. The recommendations reflect the books' standing in the field, not the commission rate.
Superforecasting: The Art and Science of Prediction by Philip Tetlock and Dan Gardner is the practitioner-friendly synthesis of the Good Judgment Project. It mixes biography, methodology, and the seven habits at length, and it is the single best entry point into calibration as a discipline.
Expert Political Judgment: How Good Is It? How Can We Know? by Philip Tetlock is the academic precursor — denser, more statistical, and the foundation for everything that came after. Read this if you want to see the original 20-year dataset and the hedgehog-versus-fox distinction in its primary form.
Two complementary classics: Thinking, Fast and Slow by Daniel Kahneman explains the cognitive machinery that produces miscalibration, and Noise by Kahneman, Sibony and Sunstein extends the analysis to the variability between experts on the same case. Together they explain why calibration is hard before showing you how to fix it. Our broader decision-making book list and the probabilistic thinking reading list place these in a wider context.
Common calibration mistakes
Failure modes that quietly inflate your Brier score
Overconfidence on familiar topics. The more you know about a field, the more confident you feel — and the more your confidence outruns the evidence. Domain experts are often worse-calibrated than informed generalists for exactly this reason. The fix is to keep checking your specialist forecasts against base rates the same way you would for an unfamiliar topic. Our piece on the Dunning-Kruger effect covers the cognitive mechanism behind this trap.
Hedging everywhere by sitting near 50%. The opposite failure: refusing to commit to a number out of false humility. A forecaster who calls everything 50/50 is uninformative even if technically well-calibrated, because they have provided zero useful signal. The Brier-score decomposition catches this: low calibration error but also low resolution.
Ignoring base rates. Treating every situation as unique strips away the most useful prior you have. If 5% of startups in this category succeed, your forecast for this specific startup needs a very strong reason to depart from 5%. We cover this in base rate neglect at length.
Anchoring on the first number you saw. Whatever probability you encountered first — in a headline, a friend's guess, a market price — silently pulls your own estimate towards it. Force yourself to write down a forecast before looking at any external estimates.
Forgetting to update. Sticking with a forecast made a month ago, when three relevant news items have since landed, is a classic error. Calibrated forecasters re-resolve their own probabilities at every meaningful new data point — see Bayesian thinking for everyday decisions for the underlying logic.
Where calibration pays off in practice
From decision-making to bet sizing
Calibration is upstream of every probabilistic decision-making technique. Expected value calculations are only as accurate as the probabilities you feed in — overconfident inputs produce systematically wrong EVs. The Kelly criterion for sizing investments or bets requires a calibrated edge estimate; if you think you have a 60% chance when reality is 52%, Kelly will tell you to bet aggressively into negative expected value. Even simple risk-versus-uncertainty reasoning depends on knowing when your probabilities are real and when they are guesses dressed up as numbers.
The professional contexts where calibration training has caught on most quickly — intelligence analysis, pandemic preparedness, financial trading, AI safety research — are exactly the ones where the cost of confidently wrong predictions is highest. The lesson generalises. Any domain where you are paid for judgement under uncertainty is a domain where calibration is a leveraged skill.
Frequently asked questions
Quick answers to common calibration training questions
How long does it take to become well-calibrated?
Is calibration the same as Bayesian thinking?
What's a 'good' Brier score?
Can you train calibration without keeping score?
Does calibration matter if I'm not making big decisions?
Where to start this week
A 30-minute on-ramp
Open a spreadsheet. Add five rows: date, claim, probability, deadline, outcome. Make ten forecasts about events that will resolve in the next 90 days — anything from sports results to product launches to news headlines. Sign up for a free Good Judgment Open or Metaculus account and add five forecasts there too. Set a calendar reminder for 90 days from now to grade them and compute a rough Brier score. Then read Superforecasting while you wait.
That is the entire training programme. The rest is repetition, honest scorekeeping, and the willingness to be wrong in writing. Within a year, your stated 80% confidence will start to mean what it says — and every probabilistic decision downstream of your forecasts will get measurably better.
Keep building your probability toolkit
Once your calibration is improving, put those probabilities to work. The expected value framework turns calibrated estimates into better decisions.