Bayes Theorem Explained: Formula and 5 Worked Examples

Bayes theorem explained from the formula up: derivation, intuitive examples (medical tests, spam filters), and the base-rate trap that fools experts.

Updated 26 June 2026

By Rob Griffiths26 June 2026 · 12 min read

Bayes theorem is the mathematical rule for updating a belief when new evidence arrives. It takes a prior probability, a likelihood, and a base rate, and returns a posterior probability - the answer to the question "given what I just saw, how likely is the thing I care about?"

It is short, exact, and almost universally counter-intuitive. The same equation that powers spam filters and medical diagnostics also explains why most positive cancer screens are false alarms, why courtroom statistics get reversed on appeal, and why so many forecasts feel obviously wrong in hindsight. The maths is high-school algebra. The hard part is the framing.

What is the actual Bayes theorem formula?

Written formally:

P(A | B) = P(B | A) × P(A) / P(B)

Every symbol has a name:

P(A | B) - the posterior. The probability that A is true given that B has been observed. The answer the formula returns.
P(B | A) - the likelihood. The probability that B would occur if A were in fact true. This is usually what test data provides.
P(A) - the prior or base rate. The probability that A is true before any evidence is taken into account. The most-ignored term in the formula.
P(B) - the marginal. The total probability of observing B across all possible worlds. Often computed as P(B | A)P(A) + P(B | not A)P(not A).

An equivalent expanded form is often more useful because it forces you to write out the denominator:

P(A | B) = [P(B | A) × P(A)] / [P(B | A)P(A) + P(B | ¬A)P(¬A)]

How is Bayes theorem derived in two lines?

Bayes theorem falls out of the definition of conditional probability with one substitution. Start with the standard definition:

P(A | B) = P(A and B) / P(B)

By symmetry, the joint probability P(A and B) also equals P(B and A) = P(B | A) × P(A). Substitute that into the numerator:

P(A | B) = [P(B | A) × P(A)] / P(B)

That is the entire derivation. The formula is not a new principle - it is a re-arrangement of the conditional-probability definition that puts the term we usually want (the posterior) on the left, and the terms we usually have (likelihood, prior, marginal) on the right. For a refresher on the underlying definition, see conditional probability explained.

Example 1: how does Bayes theorem solve the medical-test trap?

This is the canonical Bayes example because the answer is so wrong-feeling. A disease affects 1 in 1,000 people. A diagnostic test is 99% sensitive (correctly flags 99% of true positives) and 99% specific (correctly clears 99% of true negatives). You test positive. What is the probability you actually have the disease?

Intuition says "about 99%." The actual answer is roughly 9%. Here is the calculation:

Let A = "has the disease." Prior P(A) = 0.001.
Let B = "tests positive." Likelihood P(B | A) = 0.99 (sensitivity).
The false-positive rate P(B | not A) = 0.01 (one minus specificity).

Plug into the expanded form:

P(A | B) = (0.99 × 0.001) / [(0.99 × 0.001) + (0.01 × 0.999)]

= 0.00099 / (0.00099 + 0.00999) = 0.00099 / 0.01098 ≈ 0.0902

So a positive result on a 99%-accurate test for a rare disease means about a 9% chance you actually have it. The other 91% of positives are false alarms from the much-larger pool of healthy people. This is the false positive paradox, and Bayes theorem is the cleanest way to see it.

Example 2: how does Bayes theorem power spam filtering?

Naive Bayes classifiers - the original spam filters - apply this formula word by word. Suppose across a corpus of emails, 30% are spam (P(spam) = 0.30). The word "lottery" appears in 8% of spam emails (P("lottery" | spam) = 0.08) and in 0.5% of ham emails (P("lottery" | ham) = 0.005). An incoming email contains the word "lottery." How spammy is it on that signal alone?

P(spam | "lottery") = (0.08 × 0.30) / [(0.08 × 0.30) + (0.005 × 0.70)]

= 0.024 / (0.024 + 0.0035) = 0.024 / 0.0275 ≈ 0.873

So one word lifts the posterior from a 30% prior to an 87% posterior. The real classifier multiplies these likelihoods across many words (treating them as conditionally independent - the "naive" part), which is mathematically rough but empirically excellent for text classification.

Example 3: how does sequential Bayesian updating work?

Bayes is iterative. The posterior from one observation becomes the prior for the next. Suppose a company assesses a job candidate as having a 40% prior probability of being a strong hire. The technical interview is a noisy signal: a real strong hire passes 80% of the time; a weak hire passes 30% of the time. The candidate passes.

First update:

P(strong | pass) = (0.80 × 0.40) / [(0.80 × 0.40) + (0.30 × 0.60)] = 0.32 / 0.50 = 0.64

The candidate then passes a behavioural round with similar signal strength (80% pass rate among strong, 30% among weak). Plug 0.64 in as the new prior:

P(strong | both pass) = (0.80 × 0.64) / [(0.80 × 0.64) + (0.30 × 0.36)] = 0.512 / 0.620 = 0.826

Two noisy positive signals pushed the belief from 40% to 83%. This is what "updating in light of evidence" looks like in numbers, and it is the practical engine behind Bayesian thinking in everyday decisions.

Example 4: how do you reverse a forecast with Bayes?

A weather model predicts rain tomorrow with 70% confidence. Historically the model is well-calibrated: when it says 70%, rain happens 70% of the time. But the regional climatology for this date is dry - only 15% of historical days in this week have had rain. What posterior should you actually hold?

This case has no observation to update on yet; instead, treat the model's stated probability as a likelihood and reconcile it with the base rate. If the model issues a 70%-confidence rain forecast for 70% of actual rainy days and 30% of actual dry days (a perfectly calibrated model would have a specific likelihood profile - these numbers approximate it):

P(rain | model says 70%) = (0.70 × 0.15) / [(0.70 × 0.15) + (0.30 × 0.85)] ≈ 0.105 / 0.360 ≈ 0.29

A 70%-confidence forecast on a 15%-base-rate day translates to about a 29% real probability. Calibration only holds in aggregate; for any individual case the base rate matters enormously. This is the same machinery as base-rate neglect, just running in reverse.

Example 5: how does Bayes solve the two-coins puzzle?

A bag contains two coins. One is fair (50% heads). One is double-headed (100% heads). You pick a coin at random and flip it once: heads. What is the probability you picked the double-headed coin?

Prior P(double-headed) = 0.5.
Likelihood P(heads | double-headed) = 1.0.
Likelihood P(heads | fair) = 0.5.

P(double | heads) = (1.0 × 0.5) / [(1.0 × 0.5) + (0.5 × 0.5)] = 0.5 / 0.75 = 0.667

One heads observation lifts the posterior from 50% to 66.7%. Two consecutive heads (re-applying Bayes with 0.667 as the new prior):

P(double | HH) = (1.0 × 0.667) / [(1.0 × 0.667) + (0.5 × 0.333)] = 0.667 / 0.833 = 0.800

Three heads in a row pushes it to about 0.889, and so on. Each observation is consistent with both hypotheses, but it is twice as consistent with one of them - and that ratio compounds.

What are the most common Bayes-theorem mistakes?

Ignoring the base rate

Treating the likelihood as if it were the posterior. The 99% test does not give a 99% diagnosis when the disease affects 0.1% of the population. The prior dominates the calculation whenever the base rate is far from 50%.

Confusing P(A | B) with P(B | A)

These are almost never equal. The probability a defendant is guilty given a DNA match is not the same as the probability of a DNA match given guilt. Reversing the direction inflates apparent certainty enormously.

Forgetting the denominator

The marginal P(B) is the total probability of the evidence under every hypothesis. Skipping it produces numbers that look right but are not normalised - they will not sum to one across the hypothesis space.

Using point estimates for likelihoods that have real uncertainty

If the test sensitivity is 'about 99%' with a wide confidence interval, the posterior also has a wide interval. Treating the likelihood as a single number propagates false precision into the answer.

Stopping at one update

Bayes is iterative. New evidence updates the posterior, which becomes the prior for the next round. A single shocking observation rarely settles a question; a sequence of moderate observations usually does.

How do you do Bayesian updating step by step?

Write down the hypothesis space
What are the possible answers to the question? For binary cases, A and not-A. For multiple hypotheses, list them explicitly. Bayes only works when the hypotheses are mutually exclusive and collectively exhaustive.
Assign a prior to each hypothesis
Use the base rate where one exists (population prevalence, historical frequency). When no data is available, an uninformative uniform prior is a defensible default - but flag it as such, because the posterior inherits the prior's uncertainty.
Specify the likelihood of the observation under each hypothesis
P(evidence | A) and P(evidence | not A). These are usually what the data gives you - sensitivity and specificity for a test, hit rates for a forecast, frequency tables for a classifier.
Compute the marginal probability of the evidence
Sum the likelihood times the prior across every hypothesis. This is the denominator that normalises the posterior.
Divide to get the posterior
Numerator: likelihood times prior for the hypothesis you care about. Denominator: the marginal you just computed. Out comes a probability between 0 and 1.
Repeat for the next observation
Substitute the posterior in as the new prior and re-run with the next piece of evidence. Two or three iterations usually resolve most questions worth asking.

Where does Bayes theorem actually show up?

The formula is mathematically universal, but a handful of domains rely on it constantly:

Diagnostic medicine - interpreting test results in the light of disease prevalence. Every "positive but is it really?" question is a Bayes problem.
Spam filtering - the naive Bayes classifier is the textbook example, and modern variants still apply the same logic across millions of features.
Search and recommendation - query-likelihood models in information retrieval are Bayes formulae running over vocabulary distributions.
A/B testing - Bayesian alternatives to frequentist null-hypothesis testing report posterior distributions over effect sizes instead of p-values, which gives interpretable answers like "there is a 92% probability variant B is better than variant A."
Forecasting and decision-making - superforecasters treat every new piece of information as a likelihood and update their posteriors continuously. See probability calibration training.
Machine learning - Bayesian inference underlies probabilistic graphical models, Kalman filters, Gaussian processes, and the inference engines behind variational autoencoders.

Can you do Bayesian thinking without the maths?

The formula is the formal version of a habit: change your mind in proportion to the strength of the evidence, weighted by how plausible the claim was to begin with. Most everyday situations do not need the numbers - they need the framing.

Three working rules approximate the calculation:

Start with the base rate. If you are asked "is this email spam," begin with "what fraction of email is spam?" rather than "how spammy does this one look?"
Ask how surprising the evidence would be under each hypothesis. Evidence that is equally consistent with both possibilities tells you nothing. Evidence that is much more consistent with one tells you a lot.
Update in proportion. One noisy signal moves a strong prior a little. Several independent signals in the same direction move it a lot. A single weak signal almost never settles a question.

This is the everyday application covered in Bayesian thinking for everyday decisions - the mental discipline you can use without ever writing the formula down.

Frequently Asked Questions

Q01What is Bayes theorem in one sentence?

Bayes theorem is the formula for updating the probability of a hypothesis when new evidence arrives: P(hypothesis | evidence) = P(evidence | hypothesis) × P(hypothesis) / P(evidence).

Q02Why is Bayes theorem so important?

Because it is the only mathematically correct way to combine a prior belief with new evidence. Every probabilistic forecast, diagnostic test interpretation, and Bayesian machine-learning model relies on the same equation.

Q03What is the difference between Bayesian and frequentist probability?

Frequentists define probability as the long-run frequency of an event in repeated trials. Bayesians define probability as a degree of belief that can be updated with evidence. Bayes theorem only makes sense under the second interpretation.

Q04What is a prior in Bayes theorem?

The prior is the probability assigned to a hypothesis before observing any evidence. It can come from a base rate, historical data, or a subjective judgment. The posterior depends on it, so a poorly chosen prior produces a poorly calibrated answer.

Q05Why does Bayes theorem give counter-intuitive results for rare diseases?

Because the false-positive pool from the much-larger healthy population dwarfs the true-positive pool from the small infected population. Even a highly accurate test will produce mostly false positives when the disease is rare - the maths is correct; the intuition is wrong.

Q06Can Bayes theorem be applied iteratively?

Yes - and that is one of its strengths. The posterior from one observation becomes the prior for the next. Sequential updating across many pieces of evidence is how Bayesian inference engines actually work in practice.

Q07What is the prosecutor's fallacy?

Confusing the probability of evidence given guilt with the probability of guilt given evidence. They are the two sides of Bayes theorem and are almost never equal. The fallacy systematically overstates the strength of forensic evidence in court.