Joint probability distribution

Given random variables , that are defined on a probability space, the joint probability distribution for is a probability distribution that gives the probability that each of falls in any particular range or discrete set of values specified for that variable. In the case of only two random variables, this is called a bivariate distribution, but the concept generalizes to any number of random variables, giving a multivariate distribution.

The joint probability distribution can be expressed either in terms of a joint cumulative distribution function or in terms of a joint probability density function (in the case of continuous variables) or joint probability mass function (in the case of discrete variables). These in turn can be used to find two other types of distributions: the marginal distribution giving the probabilities for any one of the variables with no reference to any specific ranges of values for the other variables, and the conditional probability distribution giving the probabilities for any subset of the variables conditional on particular values of the remaining variables.

Many sample observations (black) are shown from a joint probability distribution. The marginal densities are shown as well.
Multivariate normal sample
Many sample observations (black) are shown from a joint probability distribution. The marginal densities are shown as well.
Multivariate normal sample


Draws from an urn

Suppose each of two urns contains twice as many red balls as blue balls, and no others, and suppose one ball is randomly selected from each urn, with the two draws independent of each other. Let and be discrete random variables associated with the outcomes of the draw from the first urn and second urn respectively. The probability of drawing a red ball from either of the urns is 2/3, and the probability of drawing a blue ball is 1/3. We can present the joint probability distribution as the following table:

A=Red A=Blue P(B)
B=Red (2/3)(2/3)=4/9 (1/3)(2/3)=2/9 4/9+2/9=2/3
B=Blue (2/3)(1/3)=2/9 (1/3)(1/3)=1/9 2/9+1/9=1/3
P(A) 4/9+2/9=2/3 2/9+1/9=1/3

Each of the four inner cells shows the probability of a particular combination of results from the two draws; these probabilities are the joint distribution. In any one cell the probability of a particular combination occurring is (since the draws are independent) the product of the probability of the specified result for A and the probability of the specified result for B. The probabilities in these four cells sum to 1, as it is always true for probability distributions.

Moreover, the final row and the final column give the marginal probability distribution for A and the marginal probability distribution for B respectively. For example, for A the first of these cells gives the sum of the probabilities for A being red, regardless of which possibility for B in the column above the cell occurs, as 2/3. Thus the marginal probability distribution for gives 's probabilities unconditional on , in a margin of the table.

Coin flips

Consider the flip of two fair coins; let and be discrete random variables associated with the outcomes of the first and second coin flips respectively. Each coin flip is a Bernoulli trial and has a Bernoulli distribution. If a coin displays "heads" then the associated random variable takes the value 1, and it takes the value 0 otherwise. The probability of each of these outcomes is 1/2, so the marginal (unconditional) density functions are

The joint probability density function of and defines probabilities for each pair of outcomes. All possible outcomes are

Since each outcome is equally likely the joint probability density function becomes

Since the coin flips are independent, the joint probability density function is the product of the marginals:

Roll of a die

Consider the roll of a fair die and let if the number is even (i.e. 2, 4, or 6) and otherwise. Furthermore, let if the number is prime (i.e. 2, 3, or 5) and otherwise.

1 2 3 4 5 6
A 0 1 0 1 0 1
B 0 1 1 0 1 0

Then, the joint distribution of and , expressed as a probability mass function, is

These probabilities necessarily sum to 1, since the probability of some combination of and occurring is 1.

Bivariate normal distribution

Multivariate Gaussian
Bivariate normal joint density

The multivariate normal distribution, which is a continuous distribution, is the most commonly encountered distribution in statistics. When there are specifically two random variables, this is the bivariate normal distribution, shown in the graph, with the possible values of the two variables plotted in two of the dimensions and the value of the density function for any pair of such values plotted in the third dimension. The probability that the two variables together fall in any region of their two dimensions is given by the volume under the density function above that region.

Joint cumulative distribution function

For a pair of random variables , the joint cumulative distribution function (CDF) is given by[1]:p. 89


where the right-hand side represents the probability that the random variable takes on a value less than or equal to and that takes on a value less than or equal to .

For random variables , the joint CDF is given by


Interpreting the random variables as a random vector yields a shorter notation:

Joint density function or mass function

Discrete case

The joint probability mass function of two discrete random variables is:


or written in term of conditional distributions

where is the probability of given that .

The generalization of the preceding two-variable case is the joint probability distribution of discrete random variables which is:


or equivalently


This identity is known as the chain rule of probability.

Since these are probabilities, we have in the two-variable case

which generalizes for discrete random variables to

Continuous case

The joint probability density function for two continuous random variables is defined as the derivative of the joint cumulative distribution function (see Eq.1):


This is equal to:

where and are the conditional distributions of given and of given respectively, and and are the marginal distributions for and respectively.

The definition extends naturally to more than two random variables:


Again, since these are probability distributions, one has


Mixed case

The "mixed joint density" may be defined where one or more random variables are continuous and the other random variables are discrete, or vice versa. With one variable of each type we have

One example of a situation in which one may wish to find the cumulative distribution of one random variable which is continuous and another random variable which is discrete arises when one wishes to use a logistic regression in predicting the probability of a binary outcome Y conditional on the value of a continuously distributed outcome . One must use the "mixed" joint density when finding the cumulative distribution of this binary outcome because the input variables were initially defined in such a way that one could not collectively assign it either a probability density function or a probability mass function. Formally, is the probability density function of with respect to the product measure on the respective supports of and . Either of these two decompositions can then be used to recover the joint cumulative distribution function:

The definition generalizes to a mixture of arbitrary numbers of discrete and continuous random variables.

Additional properties

Joint distribution for independent variables

In general two random variables and are independent if the joint cumulative distribution function satisfies

Two discrete random variables and are independent if the joint probability mass function satisfies

for all and .

Similarly, two absolutely continuous random variables are independent if

for all and . This means that acquiring any information about the value of one or more of the random variables leads to a conditional distribution of any other variable that is identical to its unconditional (marginal) distribution; thus no variable provides any information about any other variable.

Joint distribution for conditionally dependent variables

If a subset of the variables is conditionally dependent given another subset of these variables, then the joint distribution is equal to . Therefore, it can be efficiently represented by the lower-dimensional probability distributions and . Such conditional independence relations can be represented with a Bayesian network or copula functions.

Important named distributions

Named joint distributions that arise frequently in statistics include the multivariate normal distribution, the multivariate stable distribution, the multinomial distribution, the negative multinomial distribution, the multivariate hypergeometric distribution, and the elliptical distribution.

See also


  1. ^ Park,Kun Il (2018). Fundamentals of Probability and Stochastic Processes with Applications to Communications. Springer. ISBN 978-3-319-68074-3.

External links

Bapat–Beg theorem

In probability theory, the Bapat–Beg theorem gives the joint probability distribution of order statistics of independent but not necessarily identically distributed random variables in terms of the cumulative distribution functions of the random variables. Ravindra Bapat and Beg published the theorem in 1989, though they did not offer a proof. A simple proof was offered by Hande in 1994.Often, all elements of the sample are obtained from the same population and thus have the same probability distribution. The Bapat–Beg theorem describes the order statistics when each element of the sample is obtained from a different statistical population and therefore has its own probability distribution.

Chow–Liu tree

In probability theory and statistics Chow–Liu tree is an efficient method for constructing a second-order product approximation of a joint probability distribution, first described in a paper by Chow & Liu (1968). The goals of such a decomposition, as with such Bayesian networks in general, may be either data compression or inference.


In probability theory, comonotonicity mainly refers to the perfect positive dependence between the components of a random vector, essentially saying that they can be represented as increasing functions of a single random variable. In two dimensions it is also possible to consider perfect negative dependence, which is called countermonotonicity.

Comonotonicity is also related to the comonotonic additivity of the Choquet integral.The concept of comonotonicity has applications in financial risk management and actuarial science, see e.g. Dhaene et al. (2002a) and Dhaene et al. (2002b). In particular, the sum of the components X1 + X2 + · · · + Xn is the riskiest if the joint probability distribution of the random vector (X1, X2, . . . , Xn) is comonotonic. Furthermore, the α-quantile of the sum equals of the sum of the α-quantiles of its components, hence comonotonic random variables are quantile-additive. In practical risk management terms it means that there is minimal (or eventually no) variance reduction from diversification.

For extensions of comonotonicity, see Jouini & Napp (2004) and Puccetti & Scarsini (2010).

Conditional probability table

In statistics, the conditional probability table (CPT) is defined for a set of discrete and mutually dependent random variables to display conditional probabilities of a single variable with respect to the others (i.e., the probability of each possible value of one variable if we know the values taken on by the other variables). For example, assume there are three random variables where each has states. Then, the conditional probability table of provides the conditional probability values – where the vertical bar means “given the values of” – for each of the K possible values of the variable and for each possible combination of values of This table has cells. In general, for variables with states for each variable the CPT for any one of them has the number of cells equal to the product

A conditional probability table can be put into matrix form. As an example with only two variables, the values of with k and j ranging over K values, create a K×K matrix. This matrix is a stochastic matrix since the columns sum to 1; i.e. for all j. For example, suppose that two binary variables x and y have the joint probability distribution given in this table:

Each of the four central cells shows the probability of a particular combination of x and y values. The first column sum is the probability that x =0 and y equals any of the values it can have – that is, the column sum 6/9 is the marginal probability that x=0. If we want to find the probability that y=0 given that x=0, we compute the fraction of the probabilities in the x=0 column that have the value y=0, which is 4/9 ÷ 6/9 = 4/6. Likewise, in the same column we find that the probability that y=1 given that x=0 is 2/9 ÷ 6/9 = 2/6. In the same way, we can also find the conditional probabilities for y equalling 0 or 1 given that x=1. Combining these pieces of information gives us this table of conditional probabilities for y:

With more than one conditioning variable, the table would still have one row for each potential value of the variable whose conditional probabilities are to be given, and there would be one column for each possible combination of values of the conditioning variables.

Moreover, the number of columns in the table could be substantially expanded to display the probabilities of the variable of interest conditional on specific values of only some, rather than all, of the other variables.


In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, (i.e., the variables tend to show similar behavior), the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (i.e., the variables tend to show opposite behavior), the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.

A distinction must be made between (1) the covariance of two random variables, which is a population parameter that can be seen as a property of the joint probability distribution, and (2) the sample covariance, which in addition to serving as a descriptor of the sample, also serves as an estimated value of the population parameter.

Covariance matrix

In probability theory and statistics, a covariance matrix, also known as auto-covariance matrix, dispersion matrix, variance matrix, or variance–covariance matrix, is a matrix whose element in the i, j position is the covariance between the i-th and j-th elements of a random vector. A random vector is a random variable with multiple dimensions. Each element of the vector is a scalar random variable. Each element has either a finite number of observed empirical values or a finite or infinite number of potential values. The potential values are specified by a theoretical joint probability distribution.

Intuitively, the covariance matrix generalizes the notion of variance to multiple dimensions. As an example, the variation in a collection of random points in two-dimensional space cannot be characterized fully by a single number, nor would the variances in the and directions contain all of the necessary information; a matrix would be necessary to fully characterize the two-dimensional variation.

Because the covariance of the i-th random variable with itself is simply that random variable's variance, each element on the principal diagonal of the covariance matrix is the variance of one of the random variables. Because the covariance of the i-th random variable with the j-th one is the same thing as the covariance of the j-th random variable with the i-th random variable, every covariance matrix is symmetric. Also, every covariance matrix is positive semi-definite.

The auto-covariance matrix of a random vector is typically denoted by or .

Cross-covariance matrix

In probability theory and statistics, a cross-covariance matrix is a matrix whose element in the i, j position is the covariance between the i-th element of a random vector and j-th element of another random vector. A random vector is a random variable with multiple dimensions. Each element of the vector is a scalar random variable. Each element has either a finite number of observed empirical values or a finite or infinite number of potential values. The potential values are specified by a theoretical joint probability distribution. Intuitively, the cross-covariance matrix generalizes the notion of covariance to multiple dimensions.

The cross-covariance matrix of two random vectors and is typically denoted by or .

Empirical risk minimization

Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance.

Exchangeable random variables

In statistics, an exchangeable sequence of random variables (also sometimes interchangeable) is a sequence X1X2X3, ... (which may be finitely or infinitely long) whose joint probability distribution does not change when the positions in the sequence in which finitely many of them appear are altered. Thus, for example the sequences

both have the same joint probability distribution.

It is closely related to the use of independent and identically distributed random variables in statistical models. Exchangeable sequences of random variables arise in cases of simple random sampling.

Generalization error

In supervised learning applications in machine learning and statistical learning theory, generalization error (also known as the out-of-sample error) is a measure of how accurately an algorithm is able to predict outcome values for previously unseen data. Because learning algorithms are evaluated on finite samples, the evaluation of a learning algorithm may be sensitive to sampling error. As a result, measurements of prediction error on the current data may not provide much information about predictive ability on new data. Generalization error can be minimized by avoiding overfitting in the learning algorithm. The performance of a machine learning algorithm is measured by plots of the generalization error values through the learning process, which are called learning curves.

Generative model

In statistical classification, including machine learning, two main approaches are called the generative approach and the discriminative approach. These compute classifiers by different approaches, differing in the degree of statistical modelling. Terminology is inconsistent, but three major types can be distinguished, following Jebara (2004):

The distinction between these last two classes is not consistently made; Jebara (2004) refers to these three classes as generative learning, conditional learning, and discriminative learning, but Ng & Jordan (2002) only distinguishes two classes, calling them generative classifiers (joint distribution) and discriminative classifiers (conditional distribution or no distribution), not distinguishing between the latter two classes. Analogously, a classifier based on a generative model is a generative classifier, while a classifier based on a discriminative model is a discriminative classifier, though this term also refers to classifiers that are not based on a model. Standard examples of each, all of which are linear classifiers, are: generative classifiers: naive Bayes classifier and linear discriminant analysis; discriminative model: logistic regression; non-model classifier: perceptron and support vector machine.

In application to classification, one wishes to go from an observation x to a label y (or probability distribution on labels). One can compute this directly, without using a probability distribution (distribution-free classifier); one can estimate the probability of a label given an observation, (discriminative model), and base classification on that; or one can estimate the joint distribution (generative model), from that compute the conditional probability , and then base classification on that. These are increasingly indirect, but increasingly probabilistic, allowing more domain knowledge and probability theory to be applied. In practice different approaches are used, depending on the particular problem, and hybrids can combine strengths of multiple approaches.

Marginal distribution

In probability theory and statistics, the marginal distribution of a subset of a collection of random variables is the probability distribution of the variables contained in the subset. It gives the probabilities of various values of the variables in the subset without reference to the values of the other variables. This contrasts with a conditional distribution, which gives the probabilities contingent upon the values of the other variables.

Marginal variables are those variables in the subset of variables being retained. These concepts are "marginal" because they can be found by summing values in a table along rows or columns, and writing the sum in the margins of the table. The distribution of the marginal variables (the marginal distribution) is obtained by marginalizing – that is, focusing on the sums in the margin – over the distribution of the variables being discarded, and the discarded variables are said to have been marginalized out.

The context here is that the theoretical studies being undertaken, or the data analysis being done, involves a wider set of random variables but that attention is being limited to a reduced number of those variables. In many applications, an analysis may start with a given collection of random variables, then first extend the set by defining new ones (such as the sum of the original random variables) and finally reduce the number by placing interest in the marginal distribution of a subset (such as the sum). Several different analyses may be done, each treating a different subset of variables as the marginal variables.

Probability distribution

In probability theory and statistics, a probability distribution is a mathematical function that provides the probabilities of occurrence of different possible outcomes in an experiment. In more technical terms, the probability distribution is a description of a random phenomenon in terms of the probabilities of events. For instance, if the random variable X is used to denote the outcome of a coin toss ("the experiment"), then the probability distribution of X would take the value 0.5 for X = heads, and 0.5 for X = tails (assuming the coin is fair). Examples of random phenomena can include the results of an experiment or survey.

A probability distribution is specified in terms of an underlying sample space, which is the set of all possible outcomes of the random phenomenon being observed. The sample space may be the set of real numbers or a set of vectors, or it may be a list of non-numerical values; for example, the sample space of a coin flip would be {heads, tails} .

Probability distributions are generally divided into two classes. A discrete probability distribution (applicable to the scenarios where the set of possible outcomes is discrete, such as a coin toss or a roll of dice) can be encoded by a discrete list of the probabilities of the outcomes, known as a probability mass function. On the other hand, a continuous probability distribution (applicable to the scenarios where the set of possible outcomes can take on values in a continuous range (e.g. real numbers), such as the temperature on a given day) is typically described by probability density functions (with the probability of any individual outcome actually being 0). The normal distribution is a commonly encountered continuous probability distribution. More complex experiments, such as those involving stochastic processes defined in continuous time, may demand the use of more general probability measures.

A probability distribution whose sample space is the set of real numbers is called univariate, while a distribution whose sample space is a vector space is called multivariate. A univariate distribution gives the probabilities of a single random variable taking on various alternative values; a multivariate distribution (a joint probability distribution) gives the probabilities of a random vector – a list of two or more random variables – taking on various combinations of values. Important and commonly encountered univariate probability distributions include the binomial distribution, the hypergeometric distribution, and the normal distribution. The multivariate normal distribution is a commonly encountered multivariate distribution.


In statistical theory, a pseudolikelihood is an approximation to the joint probability distribution of a collection of random variables. The practical use of this is that it can provide an approximation to the likelihood function of a set of observed data which may either provide a computationally simpler problem for estimation, or may provide a way of obtaining explicit estimates of model parameters.

The pseudolikelihood approach was introduced by Julian Besag in the context of analysing data having spatial dependence.

Random neural network

The random neural network (RNN) is a mathematical representation of an interconnected network of neurons or cells which exchange spiking signals. It was invented by Erol Gelenbe and is linked to the G-network model of queueing networks as well as to Gene Regulatory Network models. Each cell state is represented by an integer whose value rises when the cell receives an excitatory spike and drops when it receives an inhibitory spike. The spikes can originate outside the network itself, or they can come from other cells in the networks. Cells whose internal excitatory state has a positive value are allowed to send out spikes of either kind to other cells in the network according to specific cell-dependent spiking rates. The model has a mathematical solution in steady-state which provides the joint probability distribution of the network in terms of the individual probabilities that each cell is excited and able to send out spikes. Computing this solution is based on solving a set of non-linear algebraic equations whose parameters are related to the spiking rates of individual cells and their connectivity to other cells, as well as the arrival rates of spikes from outside the network. The RNN is a recurrent model, i.e. a neural network that is allowed to have complex feedback loops.

A highly energy-efficient implementation of random neural networks was demonstrated by Krishna Palem et al. using the Probabilistic CMOS or PCMOS technology and was shown to be c. 226–300 times more efficient in terms of Energy-Performance-Product.RNNs are also related to artificial neural networks, which (like the random neural network) have gradient-based learning algorithms. The learning algorithm for an n-node random neural network that includes feedback loops (it is also a recurrent neural network) is of computational complexity O(n^3) (the number of computations is proportional to the cube of n, the number of neurons). The random neural network can also be used with other learning algorithms such as reinforcement learning. The RNN has been shown to be a universal approximator for bounded and continuous functions.

Sampling distribution

In statistics, a sampling distribution or finite-sample distribution is the probability distribution of a given random-sample-based statistic. If an arbitrarily large number of samples, each involving multiple observations (data points), were separately used in order to compute one value of a statistic (such as, for example, the sample mean or sample variance) for each sample, then the sampling distribution is the probability distribution of the values that the statistic takes on. In many contexts, only one sample is observed, but the sampling distribution can be found theoretically.

Sampling distributions are important in statistics because they provide a major simplification en route to statistical inference. More specifically, they allow analytical considerations to be based on the probability distribution of a statistic, rather than on the joint probability distribution of all the individual sample values.

Stationary process

In mathematics and statistics, a stationary process (a.k.a. a strict/strictly stationary process or strong/strongly stationary process) is a stochastic process whose unconditional joint probability distribution does not change when shifted in time. Consequently, parameters such as mean and variance also do not change over time.

Since stationarity is an assumption underlying many statistical procedures used in time series analysis, non-stationary data is often transformed to become stationary. The most common cause of violation of stationarity is a trend in the mean, which can be due either to the presence of a unit root or of a deterministic trend. In the former case of a unit root, stochastic shocks have permanent effects, and the process is not mean-reverting. In the latter case of a deterministic trend, the process is called a trend stationary process, and stochastic shocks have only transitory effects after which the variable tends toward a deterministically evolving (non-constant) mean.

A trend stationary process is not strictly stationary, but can easily be transformed into a stationary process by removing the underlying trend, which is solely a function of time. Similarly, processes with one or more unit roots can be made stationary through differencing. An important type of non-stationary process that does not include a trend-like behavior is a cyclostationary process, which is a stochastic process that varies cyclically with time.

For many applications strict-sense stationarity is too restrictive. Other forms of stationarity such as wide-sense stationarity or N-th order stationarity are then employed. The definitions for different kinds of stationarity are not consistent among different authors (see Other terminology).

Stationary sequence

In probability theory – specifically in the theory of stochastic processes, a stationary sequence is a random sequence whose joint probability distribution is invariant over time. If a random sequence X j is stationary then the following holds:

where F is the joint cumulative distribution function of the random variables in the subscript.

If a sequence is stationary then it is wide-sense stationary.

If a sequence is stationary then it has a constant mean (which may not be finite):

Uncertain data

In computer science, uncertain data is data that contains noise that makes it deviate from the correct, intended or original values. In the age of big data, uncertainty or data veracity is one of the defining characteristics of data. Data is constantly growing in volume, variety, velocity and uncertainty (1/veracity). Uncertain data is found in abundance today on the web, in sensor networks, within enterprises both in their structured and unstructured sources. For example, there may be uncertainty regarding the address of a customer in an enterprise dataset, or the temperature readings captured by a sensor due to aging of the sensor. In 2012 IBM called out managing uncertain data at scale in its global technology outlook report that presents a comprehensive analysis looking three to ten years into the future seeking to identify significant, disruptive technologies that will change the world. In order to make confident business decisions based on real-world data, analyses must necessarily account for many different kinds of uncertainty present in very large amounts of data. Analyses based on uncertain data will have an effect on the quality of subsequent decisions, so the degree and types of inaccuracies in this uncertain data cannot be ignored.

Uncertain data is found in the area of sensor networks; text where noisy text is found in abundance on social media, web and within enterprises where the structured and unstructured data may be old, outdated, or plain incorrect; in modeling where the mathematical model may only be an approximation of the actual process. When representing such data in a database, some indication of the probability of the correctness of the various values also needs to be estimated.

There are three main models of uncertain data in databases. In attribute uncertainty, each uncertain attribute in a tuple is subject to its own independent probability distribution. For example, if readings are taken of temperature and wind speed, each would be described by its own probability distribution, as knowing the reading for one measurement would not provide any information about the other.

In correlated uncertainty, multiple attributes may be described by a joint probability distribution. For example, if readings are taken of the position of an object, and the x- and y-coordinates stored, the probability of different values may depend on the distance from the recorded coordinates. As distance depends on both coordinates, it may be appropriate to use a joint distribution for these coordinates, as they are not independent.

In tuple uncertainty, all the attributes of a tuple are subject to a joint probability distribution. This covers the case of correlated uncertainty, but also includes the case where there is a probability of a tuple not belonging in the relevant relation, which is indicated by all the probabilities not summing to one. For example, assume we have the following tuple from a probabilistic database:

Then, the tuple has 10% chance of not existing in the database.

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.