Probability Calibration: Predict Like a Superforecaster

Most people are confidently wrong. They claim to be 90% sure of things that turn out true only half the time. Probability calibration is the discipline of fixing that gap — making your stated confidence match your actual hit rate. The good news, established by two decades of research from Philip Tetlock and the Good Judgment Project, is that calibration is a trainable skill. Anyone willing to keep score, sit with uncertainty, and revise their views can learn to forecast with the precision of a superforecaster.

This guide explains what calibration is, how Brier scores measure it, what Tetlock's research found about the traits of top forecasters, and the specific drills and free tools that will move your numbers in the right direction.

What probability calibration actually means

It is not the same as being right more often

A forecaster is well-calibrated when their stated probabilities track real-world frequencies. If you say a hundred different events are 70% likely, roughly 70 of them should happen. Not 50, not 90 — 70. Calibration is about the honesty of your confidence, not about how often you happen to be on the winning side.

Accuracy and calibration sound similar but capture different things. A weather forecaster who says "40% chance of rain" every single day in a climate where it actually rains 40% of the time is perfectly calibrated, even though they look uncertain. A pundit who screams "100% certain!" and turns out to be right 80% of the time is wildly overconfident — closer to a coin flip than to truth — even though they were technically correct most of the time.

This is the lens probabilistic thinkers use to evaluate their own judgement. Once you treat your beliefs as numerical claims about the world, you can grade them. And once you can grade them, you can improve them.

The Tetlock research: how we know calibration is trainable

From Expert Political Judgment to the Good Judgment Project

In the 1980s, the political psychologist Philip Tetlock asked a simple question: are pundits any better than chance at predicting world events? He recruited 284 experts — academics, government analysts, journalists — and tracked roughly 28,000 forecasts over 20 years. The headline finding, published in 2005's Expert Political Judgment, was unflattering. The average expert was barely better than random guessing, and famous experts were often worse than less well-known ones, because their confidence outran their evidence.

Crucially, Tetlock identified two cognitive styles. Hedgehogs cling to one big idea and apply it everywhere; foxes draw from many frameworks and update freely. Foxes were systematically better forecasters. That distinction set up the next phase of his work.

From 2011 to 2015, the U.S. intelligence community ran the Aggregative Contingent Estimation tournament, pitting forecasting teams against one another on real geopolitical questions. Tetlock's Good Judgment Project didn't just win — it beat the official intelligence community estimates by such a margin that competitors withdrew. The top 2% of forecasters, dubbed superforecasters, posted Brier scores roughly 30% better than ordinary smart participants. Most importantly, they got better over time, and so did the people they trained.

That is the paper trail that justifies treating calibration as a skill, not a personality trait.

How a Brier score works

The metric that turns vague confidence into a number you can shrink

The Brier score, proposed by meteorologist Glenn Brier in 1950, is the standard metric for grading probabilistic forecasts of binary events. The formula is simple: take the difference between your stated probability and the outcome (1 for happened, 0 for did not happen), square it, and average across many forecasts. Lower is better. A perfect Brier score is 0; a coin-flip forecast on a 50/50 event scores 0.25; predicting 0% or 100% on something that goes the other way scores the maximum of 1.

Two properties make Brier scores useful. First, they reward calibration directly: the score penalises overconfidence and underconfidence symmetrically. Second, they decompose neatly — statisticians can split a Brier score into a calibration term and a resolution term, so you can see whether you are losing points by being miscalibrated or by being too timid (always sitting near 50%).

What separates superforecasters from the rest

The traits Tetlock identified across thousands of forecasts

The 2015 book Superforecasting distilled the traits that consistently distinguished top performers. None of them are rare gifts. They are habits.

They start with the base rate

Before considering specifics, superforecasters look up how often this kind of event happens historically. A 5% base rate is the gravity their forecast has to escape.

They break problems into smaller pieces

A messy question ("will Russia invade by year-end?") becomes a chain of sub-questions ("are troops at the border?", "have evacuations begun?", "what's the diplomatic posture?").

They update incrementally

Rather than swing wildly between 20% and 80% on each headline, they nudge their estimate by a few percentage points as evidence accumulates — Bayesian updating in spirit if not in formal math.

They hold strong opinions weakly

They commit to a number, but treat it as provisional. Being proven wrong is information, not humiliation.

They look for disconfirming evidence

Most people search for reasons their hypothesis is right. Superforecasters deliberately hunt for reasons it might be wrong.

They take the outside view

Stepping back, asking what a reasonable observer with no stake in the outcome would conclude — neutralising the gravitational pull of personal narrative.

They keep score

Every forecast is logged with a probability, a deadline, and a resolution. Without the score, there is no feedback. Without feedback, there is no learning.

Notice what is not on this list: domain expertise. The Good Judgment Project found that subject-matter experts had no systematic edge over generalist superforecasters who applied these habits rigorously. The process beat the credentials.

Calibration training drills you can do this week

Practical exercises that produce measurable improvement

Run a trivia calibration drill

Find 50 binary trivia questions you don't already know the answer to. Answer each one and assign a confidence between 50% and 100%. Grade yourself: of your 80% answers, how many were right? Of your 90% answers? Most people discover their 90% confidence is more like 70% accuracy. The gap is your overconfidence, made legible.

Forecast 10 events with deadlines

Pick ten things that will resolve in the next 30–90 days — election results, sports outcomes, product launches, weather. Assign a probability to each, in writing, before the resolution date. After they resolve, calculate your Brier score and compare it against later batches.

Practise the pre-mortem reflex

Before committing to a forecast, ask: "if this turns out wrong, what was the most likely reason?" Generating two or three failure modes typically pulls overconfident estimates closer to the centre. The pre-mortem is itself a tool of probabilistic thinking — see our <a href="/blog/pre-mortem-decision-making/">pre-mortem guide</a> for the full method.

Argue against your own view in writing

Once a week, write a 200-word case for the opposite of something you believe. The exercise reliably surfaces hidden assumptions and forces you to engage with the strongest version of the counter-argument, not a strawman. This is the disconfirming-evidence habit in action.

The first drill is the most important. Most people have never felt the visceral surprise of seeing their 95% answers proven wrong 30% of the time. That feeling — the calibration gut-punch — is what motivates everything after.

Pair these drills with consistent record-keeping. A spreadsheet with five columns (date, claim, probability, deadline, outcome) is enough. Once a month, review the rows that have resolved and compute a rough Brier score. The trend matters more than the absolute number.

Free tools for tracking your calibration

Platforms that will measure your forecasts so you don't have to

✓

Good Judgment Open

The public-facing successor to Tetlock's tournament. You forecast on real geopolitical and economic questions and see your Brier score against a benchmark community of strong forecasters. The closest thing to formal training that exists for free.

✓

Metaculus

A community forecasting platform with thousands of resolved questions, a robust scoring system, and detailed personal calibration plots that show your hit rate at each confidence level. Ideal for technology, science, and global affairs forecasting.

✓

Quantified Intuitions

A lightweight web app from Sage with focused calibration drills (estimation, pastcasting, calibration-in-context). Best for short, deliberate practice sessions of 10–15 minutes.

✓

Calibrate Your Judgment (Open Philanthropy)

A free trivia-style calibration trainer used inside Open Philanthropy and adjacent organisations. Excellent for the first drill above — instant feedback on hundreds of questions.

✓

PredictIt and Manifold

Real-money and play-money prediction markets. The market price aggregates many forecasters' calibrated views into a single probability — see our explainer on <a href="/blog/how-prediction-markets-work/">how prediction markets work</a> to understand what those numbers actually mean.

Pick one and use it weekly. The platform matters less than the consistency. A hundred forecasts logged over six months will tell you more about your judgement than ten thousand opinions you've voiced and forgotten.

Books to deepen your understanding

Tetlock's own work and adjacent classics

Affiliate disclosure: the book links below are affiliate links — if you buy through them, we may earn a small commission at no extra cost to you. The recommendations reflect the books' standing in the field, not the commission rate.

Superforecasting: The Art and Science of Prediction by Philip Tetlock and Dan Gardner is the practitioner-friendly synthesis of the Good Judgment Project. It mixes biography, methodology, and the seven habits at length, and it is the single best entry point into calibration as a discipline.

Expert Political Judgment: How Good Is It? How Can We Know? by Philip Tetlock is the academic precursor — denser, more statistical, and the foundation for everything that came after. Read this if you want to see the original 20-year dataset and the hedgehog-versus-fox distinction in its primary form.

Two complementary classics: Thinking, Fast and Slow by Daniel Kahneman explains the cognitive machinery that produces miscalibration, and Noise by Kahneman, Sibony and Sunstein extends the analysis to the variability between experts on the same case. Together they explain why calibration is hard before showing you how to fix it. Our broader decision-making book list and the probabilistic thinking reading list place these in a wider context.

Common calibration mistakes

Failure modes that quietly inflate your Brier score

Overconfidence on familiar topics. The more you know about a field, the more confident you feel — and the more your confidence outruns the evidence. Domain experts are often worse-calibrated than informed generalists for exactly this reason. The fix is to keep checking your specialist forecasts against base rates the same way you would for an unfamiliar topic. Our piece on the Dunning-Kruger effect covers the cognitive mechanism behind this trap.

Hedging everywhere by sitting near 50%. The opposite failure: refusing to commit to a number out of false humility. A forecaster who calls everything 50/50 is uninformative even if technically well-calibrated, because they have provided zero useful signal. The Brier-score decomposition catches this: low calibration error but also low resolution.

Ignoring base rates. Treating every situation as unique strips away the most useful prior you have. If 5% of startups in this category succeed, your forecast for this specific startup needs a very strong reason to depart from 5%. We cover this in base rate neglect at length.

Anchoring on the first number you saw. Whatever probability you encountered first — in a headline, a friend's guess, a market price — silently pulls your own estimate towards it. Force yourself to write down a forecast before looking at any external estimates.

Forgetting to update. Sticking with a forecast made a month ago, when three relevant news items have since landed, is a classic error. Calibrated forecasters re-resolve their own probabilities at every meaningful new data point — see Bayesian thinking for everyday decisions for the underlying logic.

Where calibration pays off in practice

From decision-making to bet sizing

Calibration is upstream of every probabilistic decision-making technique. Expected value calculations are only as accurate as the probabilities you feed in — overconfident inputs produce systematically wrong EVs. The Kelly criterion for sizing investments or bets requires a calibrated edge estimate; if you think you have a 60% chance when reality is 52%, Kelly will tell you to bet aggressively into negative expected value. Even simple risk-versus-uncertainty reasoning depends on knowing when your probabilities are real and when they are guesses dressed up as numbers.

The professional contexts where calibration training has caught on most quickly — intelligence analysis, pandemic preparedness, financial trading, AI safety research — are exactly the ones where the cost of confidently wrong predictions is highest. The lesson generalises. Any domain where you are paid for judgement under uncertainty is a domain where calibration is a leveraged skill.

Frequently asked questions

Quick answers to common calibration training questions

How long does it take to become well-calibrated?

The Good Judgment Project found measurable improvement in Brier scores after about 100 logged forecasts, and substantial improvement after a year of consistent practice. The first 50 forecasts mostly serve to make you confront how miscalibrated you started out.

Is calibration the same as Bayesian thinking?

They are closely related but not identical. Bayesian thinking is the formal method for updating beliefs in light of evidence; calibration is the property of having those beliefs match reality. You can be a competent Bayesian and still be poorly calibrated if your priors are overconfident, and you can be well-calibrated through experience even without doing any explicit Bayesian arithmetic.

What's a 'good' Brier score?

It depends on the difficulty of the questions. On the Good Judgment Project's geopolitical questions, a Brier score below 0.25 was respectable, below 0.20 was strong, and superforecasters averaged around 0.15. On easier domains the targets are lower. The right benchmark is your own past performance, not an absolute number.

Can you train calibration without keeping score?

Almost certainly not. Without a written record of the probability, the deadline, and the resolution, every forecast disappears into hindsight bias — you remember the ones you got right and forget the ones you got wrong. Keeping score is what makes the feedback loop honest.

Does calibration matter if I'm not making big decisions?

It matters for the small ones too. Calibrated thinking improves how you weigh advice, evaluate news headlines, and notice when a confident voice is bluffing. It is also one of the few intellectual habits that compounds — every honest probability you state is training data for the next one.

Where to start this week

A 30-minute on-ramp

Open a spreadsheet. Add five rows: date, claim, probability, deadline, outcome. Make ten forecasts about events that will resolve in the next 90 days — anything from sports results to product launches to news headlines. Sign up for a free Good Judgment Open or Metaculus account and add five forecasts there too. Set a calendar reminder for 90 days from now to grade them and compute a rough Brier score. Then read Superforecasting while you wait.

That is the entire training programme. The rest is repetition, honest scorekeeping, and the willingness to be wrong in writing. Within a year, your stated 80% confidence will start to mean what it says — and every probabilistic decision downstream of your forecasts will get measurably better.

Keep building your probability toolkit

Once your calibration is improving, put those probabilities to work. The expected value framework turns calibrated estimates into better decisions.

Read the EV guide