Conditional probability

In probability theory, conditional probability is a measure of the probability of an event (some particular situation occurring) given that another event has occurred.[1] If the event of interest is A and the event B is known or assumed to have occurred, "the conditional probability of A given B", or "the probability of A under the condition B", is usually written as P(A | B), or sometimes PB(A) or P(A / B). For example, the probability that any given person has a cough on any given day may be only 5%. But if we know or assume that the person has a cold, then they are much more likely to be coughing. The conditional probability of coughing by the unwell might be 75%, then: P(Cough) = 5%; P(Cough | Sick) = 75%

The concept of conditional probability is one of the most fundamental and one of the most important in probability theory.[2] But conditional probabilities can be quite slippery and require careful interpretation.[3] For example, there need not be a causal relationship between A and B, and they don't have to occur simultaneously.

P(A | B) may or may not be equal to P(A) (the unconditional probability of A). If P(A | B) = P(A), then events A and B are said to be "independent": in such a case, knowledge about either event does not give information on the other. P(A | B) (the conditional probability of A given B) typically differs from P(B | A). For example, if a person has dengue, they might have a 90% chance of testing positive for dengue. In this case what is being measured is that if event B ("having dengue") has occurred, the probability of A (test is positive) given that B (having dengue) occurred is 90%: that is, P(A | B) = 90%. Alternatively, if a person tests positive for dengue they may have only a 15% chance of actually having this rare disease because the false positive rate for the test may be high. In this case what is being measured is the probability of the event B (having dengue) given that the event A (test is positive) has occurred: P(B | A) = 15%. Falsely equating the two probabilities causes various errors of reasoning such as the base rate fallacy. Conditional probabilities can be reversed using Bayes' theorem.

Conditional probabilities can be displayed in a conditional probability table.


Conditional probability
Illustration of conditional probabilities with an Euler diagram. The unconditional probability P(A) = 0.30 + 0.10 + 0.12 = 0.52. However, the conditional probability P(A|B1) = 1, P(A|B2) = 0.12 ÷ (0.12 + 0.04) = 0.75, and P(A|B3) = 0.
Probability tree diagram
On a tree diagram, branch probabilities are conditional on the event associated with the parent node. (Here the overbars indicate that the event does not occur.)
Venn Pie Chart describing Bayes' law
Venn Pie Chart describing conditional probabilities

Conditioning on an event

Kolmogorov definition

Given two events A and B, from the sigma-field of a probability space, with the unconditional probability of B (that is, of the event B occurring ) being greater than zero, P(B) > 0, the conditional probability of A given B is defined as the quotient of the probability of the joint of events A and B, and the probability of B:[4]

where is the probability that both events A and B occur. This may be visualized as restricting the sample space to situations in which B occurs. The logic behind this equation is that if the possible outcomes for A and B are restricted to those in which B occurs, this set serves as the new sample space.

Note that this is a definition but not a theoretical result. We just denote the quantity as and call it the conditional probability of A given B.

As an axiom of probability

Some authors, such as de Finetti, prefer to introduce conditional probability as an axiom of probability:

Although mathematically equivalent, this may be preferred philosophically; under major probability interpretations such as the subjective theory, conditional probability is considered a primitive entity. Further, this "multiplication axiom" introduces a symmetry with the summation axiom for mutually exclusive events:[5]

As the probability of a conditional event

Conditional probability can be defined as the probability of a conditional event .[6] Assuming that the experiment underlying the events and is repeated, the Goodman–Nguyen–van Fraassen conditional event can be defined as

It can be shown that

which meets the Kolmogorov definition of conditional probability. Note that the equation is a theoretical result and not a definition. The definition via conditional events can be understood directly in terms of the Kolmogorov axioms and is particularly close to the Kolmogorov interpretation of probability in terms of experimental data. For example, conditional events can be repeated themselves leading to a generalized notion of conditional event . It can be shown[6] that the sequence is i.i.d., which yields a strong law of large numbers for conditional probability:

Measure-theoretic definition

If P(B) = 0, then according to the simple definition, P(A|B) is undefined. However, it is possible to define a conditional probability with respect to a σ-algebra of such events (such as those arising from a continuous random variable).

For example, if X and Y are non-degenerate and jointly continuous random variables with density ƒX,Y(xy) then, if B has positive measure,

The case where B has zero measure is problematic. For the case that B = {y0}, representing a single point, the conditional probability could be defined as

however this approach leads to the Borel–Kolmogorov paradox. The more general case of zero measure is even more problematic, as can be seen by noting that the limit, as all δyi approach zero, of

depends on their relationship as they approach zero. See conditional expectation for more information.

Conditioning on a random variable

Let X be a random variable; we assume for the sake of presentation that X is discrete, that is, X takes on only finitely many values x. Let A be an event. The conditional probability of A given X is defined as the random variable, written P(A|X), that takes on the value


More formally,

The conditional probability P(A|X) is a function of X: e.g., if the function g is defined as


Note that P(A|X) and X are now both random variables. From the law of total probability, the expected value of P(A|X) is equal to the unconditional probability of A.

Partial conditional probability

The partial conditional probability is about the probability of event given that each of the condition events has occurred to a degree (degree of belief, degree of experience) that might be different from 100%. Frequentistically, partial conditional probability makes sense, if the conditions are tested in experiment repetitions of appropriate length .[7] Such -bounded partial conditional probability can be defined as the conditionally expected average occurrence of event in testbeds of length that adhere to all of the probability specifications , i.e.:


Based on that, partial conditional probability can be defined as

where [7]

Jeffrey conditionalization [8] [9] is a special case of partial conditional probability in which the condition events must form a partition:


Suppose that somebody secretly rolls two fair six-sided dice, and we wish to compute the probability that the face-up value of the first one is 2, given the information that their sum is no greater than 5.

  • Let D1 be the value rolled on die 1.
  • Let D2 be the value rolled on die 2.

Probability that D1 = 2

Table 1 shows the sample space of 36 combinations of rolled values of the two dice, each of which occurs probability 1/36, with the numbers displayed in the red and dark gray cells being D1 + D2.

D1 = 2 in exactly 6 of the 36 outcomes; thus P(D1 = 2) = ​636 = ​16:

Table 1
+ D2
1 2 3 4 5 6
D1 1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

Probability that D1 + D2 ≤ 5

Table 2 shows that D1 + D2 ≤ 5 for exactly 10 of the 36 outcomes, thus P(D1 + D2 ≤ 5) = ​1036:

Table 2
+ D2
1 2 3 4 5 6
D1 1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

Probability that D1 = 2 given that D1 + D2 ≤ 5

Table 3 shows that for 3 of these 10 outcomes, D1 = 2.

Thus, the conditional probability P(D1 = 2 | D1+D2 ≤ 5) = ​310 = 0.3:

Table 3
+ D2
1 2 3 4 5 6
D1 1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12

Here, in the earlier notation for the definition of conditional probability, the conditioning event B is that D1 + D2 ≤ 5, and the event A is D1 = 2. We have as seen in the table.

Use in inference

In statistical inference, the conditional probability is an update of the probability of an event based on new information.[3] Incorporating the new information can be done as follows:[1]

  • Let A, the event of interest, be in the sample space, say (X,P).
  • The occurrence of the event A knowing that event B has or will have occurred, means the occurrence of A as it is restricted to B, i.e. .
  • Without the knowledge of the occurrence of B, the information about the occurrence of A would simply be P(A)
  • The probability of A knowing that event B has or will have occurred, will be the probability of relative to P(B), the probability that B has occurred.
  • This results in whenever P(B) > 0 and 0 otherwise.

This approach results in a probability measure that is consistent with the original probability measure and satisfies all the Kolmogorov axioms. This conditional probability measure also could have resulted by assuming that the relative magnitude of the probability of A with respect to X will be preserved with respect to B (cf. a Formal Derivation below).

The wording "evidence" or "information" is generally used in the Bayesian interpretation of probability. The conditioning event is interpreted as evidence for the conditioned event. That is, P(A) is the probability of A before accounting for evidence E, and P(A|E) is the probability of A after having accounted for evidence E or after having updated P(A). This is consistent with the frequentist interpretation, which is the first definition given above.

Statistical independence

Events A and B are defined to be statistically independent if

If P(B) is not zero, then this is equivalent to the statement that

Similarly, if P(A) is not zero, then

is also equivalent. Although the derived forms may seem more intuitive, they are not the preferred definition as the conditional probabilities may be undefined, and the preferred definition is symmetrical in A and B.

Independent events vs. mutually exclusive events

The concepts of mutually independent events and mutually exclusive events are separate and distinct. The following table contrasts results for the two cases (provided the probability of the conditioning event is not zero).

If statistically independent If mutually exclusive

In fact, mutually exclusive events cannot be statistically independent (unless they both are impossible), since knowing that one occurs gives information about the other (specifically, that it certainly does not occur).

Common fallacies

These fallacies should not be confused with Robert K. Shope's 1978 "conditional fallacy", which deals with counterfactual examples that beg the question.

Assuming conditional probability is of similar size to its inverse

Bayes theorem visualisation
A geometric visualisation of Bayes' theorem. In the table, the values 2, 3, 6 and 9 give the relative weights of each corresponding condition and case. The figures denote the cells of the table involved in each metric, the probability being the fraction of each figure that is shaded. This shows that P(A|B) P(B) = P(B|A) P(A) i.e. P(A|B) = P(B|A) P(A)/P(B) . Similar reasoning can be used to show that P(Ā|B) = P(B|Ā) P(Ā)/P(B) etc.

In general, it cannot be assumed that P(A|B) ≈ P(B|A). This can be an insidious error, even for those who are highly conversant with statistics.[10] The relationship between P(A|B) and P(B|A) is given by Bayes' theorem:

That is, P(A|B) ≈ P(B|A) only if P(B)/P(A) ≈ 1, or equivalently, P(A) ≈ P(B).

Assuming marginal and conditional probabilities are of similar size

In general, it cannot be assumed that P(A) ≈ P(A|B). These probabilities are linked through the law of total probability:

where the events form a countable partition of .

This fallacy may arise through selection bias.[11] For example, in the context of a medical claim, let SC be the event that a sequela (chronic disease) S occurs as a consequence of circumstance (acute condition) C. Let H be the event that an individual seeks medical help. Suppose that in most cases, C does not cause S so P(SC) is low. Suppose also that medical attention is only sought if S has occurred due to C. From experience of patients, a doctor may therefore erroneously conclude that P(SC) is high. The actual probability observed by the doctor is P(SC|H).

Over- or under-weighting priors

Not taking prior probability into account partially or completely is called base rate neglect. The reverse, insufficient adjustment from the prior probability is conservatism.

Formal derivation

Formally, P(A | B) is defined as the probability of A according to a new probability function on the sample space, such that outcomes not in B have probability 0 and that it is consistent with all original probability measures.[12][13]

Let Ω be a sample space with elementary events {ω}. Suppose we are told the event B ⊆ Ω has occurred. A new probability distribution (denoted by the conditional notation) is to be assigned on {ω} to reflect this. For events in B, it is reasonable to assume that the relative magnitudes of the probabilities will be preserved. For some constant scale factor α, the new distribution will therefore satisfy:

Substituting 1 and 2 into 3 to select α:

So the new probability distribution is

Now for a general event A,

See also


  1. ^ a b Gut, Allan (2013). Probability: A Graduate Course (Second ed.). New York, NY: Springer. ISBN 978-1-4614-4707-8.
  2. ^ Ross, Sheldon (2010). A First Course in Probability (8th ed.). Pearson Prentice Hall. ISBN 978-0-13-603313-4.
  3. ^ a b Casella, George; Berger, Roger L. (2002). Statistical Inference. Duxbury Press. ISBN 0-534-24312-6.
  4. ^ Kolmogorov, Andrey (1956), Foundations of the Theory of Probability, Chelsea
  5. ^ Gillies, Donald (2000); "Philosophical Theories of Probability"; Routledge; Chapter 4 "The subjective theory"
  6. ^ a b Draheim, Dirk (2017). "An Operational Semantics of Conditional Probabilities that Fully Adheres to Kolmogorov's Explication of Probability Theory". doi:10.13140/RG.2.2.10050.48323/3.
  7. ^ a b c Draheim, Dirk (2017). "Generalized Jeffrey Conditionalization (A Frequentist Semantics of Partial Conditionalization)". Springer. Retrieved December 19, 2017.
  8. ^ Jeffrey, Richard C. (1983), The Logic of Decision, 2nd edition, University of Chicago Press
  9. ^ "Bayesian Epistemology". Stanford Encyclopedia of Philosophy. 2017. Retrieved December 29, 2017.
  10. ^ Paulos, J.A. (1988) Innumeracy: Mathematical Illiteracy and its Consequences, Hill and Wang. ISBN 0-8090-7447-8 (p. 63 et seq.)
  11. ^ Thomas Bruss, F; Der Wyatt Earp Effekt; Spektrum der Wissenschaft; March 2007
  12. ^ George Casella and Roger L. Berger (1990), Statistical Inference, Duxbury Press, ISBN 0-534-11958-1 (p. 18 et seq.)
  13. ^ Grinstead and Snell's Introduction to Probability, p. 134

External links

Bayes' theorem

In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule) describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For example, if cancer is related to age, then, using Bayes' theorem, a person's age can be used to more accurately assess the probability that they have cancer, compared to the assessment of the probability of cancer made without knowledge of the person's age.

One of the many applications of Bayes' theorem is Bayesian inference, a particular approach to statistical inference. When applied, the probabilities involved in Bayes' theorem may have different probability interpretations. With the Bayesian probability interpretation the theorem expresses how a degree of belief, expressed as a probability, should rationally change to account for availability of related evidence. Bayesian inference is fundamental to Bayesian statistics.

Bayes' theorem is named after Reverend Thomas Bayes (; 1701?–1761), who first used conditional probability to provide an algorithm (his Proposition 9) that uses evidence to calculate limits on an unknown parameter, published as An Essay towards solving a Problem in the Doctrine of Chances (1763). In what he called a scholium, Bayes extended his algorithm to any unknown prior cause. Independently of Bayes, Pierre-Simon Laplace in 1774, and later in his 1812 "Théorie analytique des probabilités" used conditional probability to formulate the relation of an updated posterior probability from a prior probability, given evidence. Sir Harold Jeffreys put Bayes's algorithm and Laplace's formulation on an axiomatic basis. Jeffreys wrote that Bayes' theorem "is to the theory of probability what the Pythagorean theorem is to geometry".

Berkson's paradox

Berkson's paradox also known as Berkson's bias or Berkson's fallacy is a result in conditional probability and statistics which is often found to be counterintuitive, and hence a veridical paradox. It is a complicating factor arising in statistical tests of proportions. Specifically, it arises when there is an ascertainment bias inherent in a study design. The effect is related to the explaining away phenomenon in Bayesian networks.

The most common example of Berkson's paradox is a false observation of a negative correlation between two positive traits, i.e., that members of a population which have some positive trait tend to lack a second. Berkson's paradox occurs when this observation appears true when in reality the two properties are unrelated—or even positively correlated—because members of the population where both are absent are not equally observed. For example, a person may observe from their experience that fast food restaurants in their area which serve good hamburgers tend to serve bad fries and vice versa; but because they would likely not eat anywhere where both were bad, they fail to allow for the large number of restaurants in this category which would weaken or even flip the correlation.

It is often described in the fields of medical statistics or biostatistics, as in the original description of the problem by Joseph Berkson.


A bigram or digram is a sequence of two adjacent elements from a string of tokens, which are typically letters, syllables, or words. A bigram is an n-gram for n=2. The frequency distribution of every bigram in a string is commonly used for simple statistical analysis of text in many applications, including in computational linguistics, cryptography, speech recognition, and so on.

Gappy bigrams or skipping bigrams are word pairs which allow gaps (perhaps avoiding connecting words, or allowing some simulation of dependencies, as in a dependency grammar).

Head word bigrams are gappy bigrams with an explicit dependency relationship.

Bigrams help provide the conditional probability of a token given the preceding token, when the relation of the conditional probability is applied:

That is, the probability of a token given the preceding token is equal to the probability of their bigram, or the co-occurrence of the two tokens , divided by the probability of the preceding token.

Borel–Kolmogorov paradox

In probability theory, the Borel–Kolmogorov paradox (sometimes known as Borel's paradox) is a paradox relating to conditional probability with respect to an event of probability zero (also known as a null set). It is named after Émile Borel and Andrey Kolmogorov.

Coherence (philosophical gambling strategy)

In a thought experiment proposed by the Italian probabilist Bruno de Finetti in order to justify Bayesian probability, an array of wagers is coherent precisely if it does not expose the wagerer to certain loss regardless of the outcomes of events on which they are wagering, even if their opponent makes the most judicious choices.

Conditional probability distribution

In probability theory and statistics, given two jointly distributed random variables and , the conditional probability distribution of Y given X is the probability distribution of when is known to be a particular value; in some cases the conditional probabilities may be expressed as functions containing the unspecified value of as a parameter. When both and are categorical variables, a conditional probability table is typically used to represent the conditional probability. The conditional distribution contrasts with the marginal distribution of a random variable, which is its distribution without reference to the value of the other variable.

If the conditional distribution of given is a continuous distribution, then its probability density function is known as the conditional density function. The properties of a conditional distribution, such as the moments, are often referred to by corresponding names such as the conditional mean and conditional variance.

More generally, one can refer to the conditional distribution of a subset of a set of more than two variables; this conditional distribution is contingent on the values of all the remaining variables, and if more than one variable is included in the subset then this conditional distribution is the conditional joint distribution of the included variables.

Conditional probability table

In statistics, the conditional probability table (CPT) is defined for a set of discrete and mutually dependent random variables to display conditional probabilities of a single variable with respect to the others (i.e., the probability of each possible value of one variable if we know the values taken on by the other variables). For example, assume there are three random variables where each has states. Then, the conditional probability table of provides the conditional probability values – where the vertical bar means “given the values of” – for each of the K possible values of the variable and for each possible combination of values of This table has cells. In general, for variables with states for each variable the CPT for any one of them has the number of cells equal to the product

A conditional probability table can be put into matrix form. As an example with only two variables, the values of with k and j ranging over K values, create a K×K matrix. This matrix is a stochastic matrix since the columns sum to 1; i.e. for all j. For example, suppose that two binary variables x and y have the joint probability distribution given in this table:

Each of the four central cells shows the probability of a particular combination of x and y values. The first column sum is the probability that x =0 and y equals any of the values it can have – that is, the column sum 6/9 is the marginal probability that x=0. If we want to find the probability that y=0 given that x=0, we compute the fraction of the probabilities in the x=0 column that have the value y=0, which is 4/9 ÷ 6/9 = 4/6. Likewise, in the same column we find that the probability that y=1 given that x=0 is 2/9 ÷ 6/9 = 2/6. In the same way, we can also find the conditional probabilities for y equalling 0 or 1 given that x=1. Combining these pieces of information gives us this table of conditional probabilities for y:

With more than one conditioning variable, the table would still have one row for each potential value of the variable whose conditional probabilities are to be given, and there would be one column for each possible combination of values of the conditioning variables.

Moreover, the number of columns in the table could be substantially expanded to display the probabilities of the variable of interest conditional on specific values of only some, rather than all, of the other variables.

Conditioning (probability)

Beliefs depend on the available information. This idea is formalized in probability theory by conditioning. Conditional probabilities, conditional expectations, and conditional probability distributions are treated on three levels: discrete probabilities, probability density functions, and measure theory. Conditioning leads to a non-random result if the condition is completely specified; otherwise, if the condition is left random, the result of conditioning is also random.

Confusion of the inverse

Confusion of the inverse, also called the conditional probability fallacy or the inverse fallacy, is a logical fallacy whereupon a conditional probability is equivocated with its inverse: That is, given two events A and B, the probability of A happening given that B has happened is assumed to be about the same as the probability of B given A. More formally, P(A|B) is assumed to be approximately equal to P(B|A).

Context tree weighting

The context tree weighting method (CTW) is a lossless compression and prediction algorithm by Willems, Shtarkov & Tjalkens 1995. The CTW algorithm is among the very few such algorithms that offer both theoretical guarantees and good practical performance (see, e.g. Begleiter, El-Yaniv & Yona 2004).

The CTW algorithm is an “ensemble method,” mixing the predictions of many underlying variable order Markov models, where each such model is constructed using zero-order conditional probability estimators.

Generative model

In statistical classification, including machine learning, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished, following Jebara (2004):

The distinction between these last two classes is not consistently made; Jebara (2004) refers to these three classes as generative learning, conditional learning, and discriminative learning, but Ng & Jordan (2002) only distinguish two classes, calling them generative classifiers (joint distribution) and discriminative classifiers (conditional distribution or no distribution), not distinguishing between the latter two classes. Analogously, a classifier based on a generative model is a generative classifier, while a classifier based on a discriminative model is a discriminative classifier, though this term also refers to classifiers that are not based on a model. Standard examples of each, all of which are linear classifiers, are: generative classifiers: naive Bayes classifier and linear discriminant analysis; discriminative model: logistic regression; non-model classifier: perceptron and support vector machine.

In application to classification, one wishes to go from an observation x to a label y (or probability distribution on labels). One can compute this directly, without using a probability distribution (distribution-free classifier); one can estimate the probability of a label given an observation, (discriminative model), and base classification on that; or one can estimate the joint distribution (generative model), from that compute the conditional probability , and then base classification on that. These are increasingly indirect, but increasingly probabilistic, allowing more domain knowledge and probability theory to be applied. In practice different approaches are used, depending on the particular problem, and hybrids can combine strengths of multiple approaches.

Kushner equation

In filtering theory the Kushner equation (after Harold Kushner) is an equation for the conditional probability density of the state of a stochastic non-linear dynamical system, given noisy measurements of the state. It therefore provides the solution of the nonlinear filtering problem in estimation theory. The equation is sometimes referred to as the Stratonovich–Kushner (or Kushner–Stratonovich) equation. However, the correct equation in terms of Itō calculus was first derived by Kushner although a more heuristic Stratonovich version of it appeared already in Stratonovich's works in late fifties. However, the derivation in terms of Itō calculus is due to Richard Bucy.

List of probability topics

This is a list of probability topics, by Wikipedia page.

It overlaps with the (alphabetical) list of statistical topics. There are also the outline of probability and catalog of articles in probability theory. For distributions, see List of probability distributions. For journals, see list of probability journals. For contributors to the field, see list of mathematical probabilists and list of statisticians.

Monty Hall problem

The Monty Hall problem is a brain teaser, in the form of a probability puzzle, loosely based on the American television game show Let's Make a Deal and named after its original host, Monty Hall. The problem was originally posed (and solved) in a letter by Steve Selvin to the American Statistician in 1975 (Selvin 1975a), (Selvin 1975b). It became famous as a question from a reader's letter quoted in Marilyn vos Savant's "Ask Marilyn" column in Parade magazine in 1990 (vos Savant 1990a):

Suppose you're on a game show, and you're given the choice of three doors: Behind one door is a car; behind the others, goats. You pick a door, say No. 1, and the host, who knows what's behind the doors, opens another door, say No. 3, which has a goat. He then says to you, "Do you want to pick door No. 2?" Is it to your advantage to switch your choice?

Vos Savant's response was that the contestant should switch to the other door (vos Savant 1990a). Under the standard assumptions, contestants who switch have a 2/3 chance of winning the car, while contestants who stick to their initial choice have only a 1/3 chance.

The given probabilities depend on specific assumptions about how the host and contestant choose their doors. A key insight is that, under these standard conditions, there is more information about doors 2 and 3 that was not available at the beginning of the game, when door 1 was chosen by the player: the host's deliberate action adds value to the door he did not choose to eliminate, but not to the one chosen by the contestant originally. Another insight is that switching doors is a different action than choosing between the two remaining doors at random, as the first action uses the previous information and the latter does not. Other possible behaviors than the one described can reveal different additional information, or none at all, and yield different probabilities.

Many readers of vos Savant's column refused to believe switching is beneficial despite her explanation. After the problem appeared in Parade, approximately 10,000 readers, including nearly 1,000 with PhDs, wrote to the magazine, most of them claiming vos Savant was wrong (Tierney 1991). Even when given explanations, simulations, and formal mathematical proofs, many people still do not accept that switching is the best strategy (vos Savant 1991a). Paul Erdős, one of the most prolific mathematicians in history, remained unconvinced until he was shown a computer simulation demonstrating the predicted result (Vazsonyi 1999).

The problem is a paradox of the veridical type, because the correct choice (that one should switch doors) is so counterintuitive it can seem absurd, but is nevertheless demonstrably true. The Monty Hall problem is mathematically closely related to the earlier Three Prisoners problem and to the much older Bertrand's box paradox.

Outline of probability

Probability is a measure of the likeliness that an event will occur. Probability is used to quantify an attitude of mind towards some proposition of whose truth we are not certain. The proposition of interest is usually of the form "A specific event will occur." The attitude of mind is of the form "How certain are we that the event will occur?" The certainty we adopt can be described in terms of a numerical measure and this number, between 0 and 1 (where 0 indicates impossibility and 1 indicates certainty), we call probability. Probability theory is used extensively in statistics, mathematics, science and philosophy to draw conclusions about the likelihood of potential events and the underlying mechanics of complex systems.

Posterior probability

In Bayesian statistics, the posterior probability of a random event or an uncertain proposition is the conditional probability that is assigned after the relevant evidence or background is taken into account. Similarly, the posterior probability distribution is the probability distribution of an unknown quantity, treated as a random variable, conditional on the evidence obtained from an experiment or survey. "Posterior", in this context, means after taking into account the relevant evidence related to the particular case being examined. For instance, there is a ("non-posterior") probability of a person finding buried treasure if they dig in a random spot, and a posterior probability of finding buried treasure if they dig in a spot where their metal detector rings.


In probability theory, to postselect is to condition a probability space upon the occurrence of a given event. In symbols, once we postselect for an event , the probability of some other event changes from to the conditional probability .

For a discrete probability space, , and thus we require that be strictly positive in order for the postselection to be well-defined.

See also PostBQP, a complexity class defined with postselection. Using postselection it seems quantum Turing machines are much more powerful: Scott Aaronson proved PostBQP is equal to PP.

Some quantum experiments use post-selection after the experiment as a replacement for communication during the experiment, by post-selecting the communicated value into a constant.


Probability is the measure of the likelihood that an event will occur. See glossary of probability and statistics. Probability quantifies as a number between 0 and 1, where, loosely speaking, 0 indicates impossibility and 1 indicates certainty. The higher the probability of an event, the more likely it is that the event will occur. A simple example is the tossing of a fair (unbiased) coin. Since the coin is fair, the two outcomes ("heads" and "tails") are both equally probable; the probability of "heads" equals the probability of "tails"; and since no other outcomes are possible, the probability of either "heads" or "tails" is 1/2 (which could also be written as 0.5 or 50%).

These concepts have been given an axiomatic mathematical formalization in probability theory, which is used widely in such areas of study as mathematics, statistics, finance, gambling, science (in particular physics), artificial intelligence/machine learning, computer science, game theory, and philosophy to, for example, draw inferences about the expected frequency of events. Probability theory is also used to describe the underlying mechanics and regularities of complex systems.

Regular conditional probability

Regular conditional probability is a concept that has developed to overcome certain difficulties in formally defining conditional probabilities for continuous probability distributions. It is defined as an alternative probability measure conditioned on a particular value of a random variable.

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.