Jeffreys prior

In Bayesian probability, the Jeffreys prior, named after Sir Harold Jeffreys, is a non-informative (objective) prior distribution for a parameter space; it is proportional to the square root of the determinant of the Fisher information matrix:

${\displaystyle p\left({\vec {\theta }}\right)\propto {\sqrt {\det {\mathcal {I}}\left({\vec {\theta }}\right)}}.\,}$

It has the key feature that its functional dependence on the likelihood ${\displaystyle L}$ is invariant under reparameterization of the parameter vector ${\displaystyle {\vec {\theta }}}$ (the functional form of the prior density function itself is not invariant under reparameterization, of course: only the measure that is identically zero has that property; see below). This makes it of special interest for use with scale parameters.[1]

Reparameterization

One-parameter case

For an alternative parameterization ${\displaystyle \varphi }$ we can derive

${\displaystyle p(\varphi )\propto {\sqrt {I(\varphi )}}\,}$

from

${\displaystyle p(\theta )\propto {\sqrt {I(\theta )}}\,}$

using the change of variables theorem for transformations and the definition of Fisher information:

{\displaystyle {\begin{aligned}p(\varphi )&=p(\theta )\left|{\frac {d\theta }{d\varphi }}\right|\\&\propto {\sqrt {I(\theta )\left({\frac {d\theta }{d\varphi }}\right)^{2}}}={\sqrt {\operatorname {E} \!\left[\left({\frac {d\ln L}{d\theta }}\right)^{2}\right]\left({\frac {d\theta }{d\varphi }}\right)^{2}}}\\&={\sqrt {\operatorname {E} \!\left[\left({\frac {d\ln L}{d\theta }}{\frac {d\theta }{d\varphi }}\right)^{2}\right]}}={\sqrt {\operatorname {E} \!\left[\left({\frac {d\ln L}{d\varphi }}\right)^{2}\right]}}\\&={\sqrt {I(\varphi )}}.\end{aligned}}}

That is, the functional form of the prior ${\displaystyle p(\cdot )}$ can be derived from that of the likelihood ${\displaystyle L(\cdot )}$ using the same procedure for both parametrizations.

Note, however, that the form of the prior is different for the two parametrizations. For example, if ${\displaystyle p(\theta )=1/\theta }$ (as in the case of the normal distribution, see below), and ${\displaystyle \varphi =\ln(\theta )}$, then ${\displaystyle p(\varphi )=1}$, which is obviously different from ${\displaystyle 1/\varphi }$.

Multiple-parameter case

For an alternative parameterization ${\displaystyle {\vec {\varphi }}}$ we can derive

${\displaystyle p({\vec {\varphi }})\propto {\sqrt {\det I({\vec {\varphi }})}}\,}$

from

${\displaystyle p({\vec {\theta }})\propto {\sqrt {\det I({\vec {\theta }})}}\,}$

using the change of variables theorem for transformations, the definition of Fisher information, and that the product of determinants is the determinant of the matrix product:

{\displaystyle {\begin{aligned}p({\vec {\varphi }})&=p({\vec {\theta }})\left|\det {\frac {\partial \theta _{i}}{\partial \varphi _{j}}}\right|\\&\propto {\sqrt {\det I({\vec {\theta }})\,{\det }^{2}{\frac {\partial \theta _{i}}{\partial \varphi _{j}}}}}\\&={\sqrt {\det {\frac {\partial \theta _{k}}{\partial \varphi _{i}}}\,\det \operatorname {E} \!\left[{\frac {\partial \ln L}{\partial \theta _{k}}}{\frac {\partial \ln L}{\partial \theta _{l}}}\right]\,\det {\frac {\partial \theta _{l}}{\partial \varphi _{j}}}}}\\&={\sqrt {\det \operatorname {E} \!\left[\sum _{k,l}{\frac {\partial \theta _{k}}{\partial \varphi _{i}}}{\frac {\partial \ln L}{\partial \theta _{k}}}{\frac {\partial \ln L}{\partial \theta _{l}}}{\frac {\partial \theta _{l}}{\partial \varphi _{j}}}\right]}}\\&={\sqrt {\det \operatorname {E} \!\left[{\frac {\partial \ln L}{\partial \varphi _{i}}}{\frac {\partial \ln L}{\partial \varphi _{j}}}\right]}}={\sqrt {\det I({\vec {\varphi }})}}.\end{aligned}}}

Attributes

From a practical and mathematical standpoint, a valid reason to use this non-informative prior instead of others, like the ones obtained through a limit in conjugate families of distributions, is that its construction from the likelihood does not depend on the set of parameter variables that is chosen to describe parameter space. It is not the only prior with this property, however. As is clear from the derivation above, instead of ${\displaystyle \ln(L)}$ we could use any other smooth function ${\displaystyle f(L)}$, and the resulting prior would still have the same kind of invariance property.

Sometimes the Jeffreys prior cannot be normalized, and is thus an improper prior. For example, the Jeffreys prior for the distribution mean is uniform over the entire real line in the case of a Gaussian distribution of known variance.

Use of the Jeffreys prior violates the strong version of the likelihood principle, which is accepted by many, but by no means all, statisticians. When using the Jeffreys prior, inferences about ${\displaystyle {\vec {\theta }}}$ depend not just on the probability of the observed data as a function of ${\displaystyle {\vec {\theta }}}$, but also on the universe of all possible experimental outcomes, as determined by the experimental design, because the Fisher information is computed from an expectation over the chosen universe. Accordingly, the Jeffreys prior, and hence the inferences made using it, may be different for two experiments involving the same ${\displaystyle {\vec {\theta }}}$ parameter even when the likelihood functions for the two experiments are the same—a violation of the strong likelihood principle.

Minimum description length

In the minimum description length approach to statistics the goal is to describe data as compactly as possible where the length of a description is measured in bits of the code used. For a parametric family of distributions one compares a code with the best code based on one of the distributions in the parameterized family. The main result is that in exponential families, asymptotically for large sample size, the code based on the distribution that is a mixture of the elements in the exponential family with the Jeffreys prior is optimal. This result holds if one restricts the parameter set to a compact subset in the interior of the full parameter space. If the full parameter is used a modified version of the result should be used.

Examples

The Jeffreys prior for a parameter (or a set of parameters) depends upon the statistical model.

Gaussian distribution with mean parameter

For the Gaussian distribution of the real value ${\displaystyle x}$

${\displaystyle f(x\mid \mu )={\frac {e^{-(x-\mu )^{2}/2\sigma ^{2}}}{\sqrt {2\pi \sigma ^{2}}}}}$

with ${\displaystyle \sigma }$ fixed, the Jeffreys prior for the mean ${\displaystyle \mu }$ is

{\displaystyle {\begin{aligned}p(\mu )&\propto {\sqrt {I(\mu )}}={\sqrt {\operatorname {E} \!\left[\left({\frac {d}{d\mu }}\log f(x\mid \mu )\right)^{2}\right]}}={\sqrt {\operatorname {E} \!\left[\left({\frac {x-\mu }{\sigma ^{2}}}\right)^{2}\right]}}\\&={\sqrt {\int _{-\infty }^{+\infty }f(x\mid \mu )\left({\frac {x-\mu }{\sigma ^{2}}}\right)^{2}dx}}={\sqrt {1/\sigma ^{2}}}\propto 1.\end{aligned}}}

That is, the Jeffreys prior for ${\displaystyle \mu }$ does not depend upon ${\displaystyle \mu }$; it is the unnormalized uniform distribution on the real line — the distribution that is 1 (or some other fixed constant) for all points. This is an improper prior, and is, up to the choice of constant, the unique translation-invariant distribution on the reals (the Haar measure with respect to addition of reals), corresponding to the mean being a measure of location and translation-invariance corresponding to no information about location.

Gaussian distribution with standard deviation parameter

For the Gaussian distribution of the real value ${\displaystyle x}$

${\displaystyle f(x\mid \sigma )={\frac {e^{-(x-\mu )^{2}/2\sigma ^{2}}}{\sqrt {2\pi \sigma ^{2}}}},}$

with ${\displaystyle \mu }$ fixed, the Jeffreys prior for the standard deviation ${\displaystyle \sigma >0}$ is

{\displaystyle {\begin{aligned}p(\sigma )&\propto {\sqrt {I(\sigma )}}={\sqrt {\operatorname {E} \!\left[\left({\frac {d}{d\sigma }}\log f(x\mid \sigma )\right)^{2}\right]}}={\sqrt {\operatorname {E} \!\left[\left({\frac {(x-\mu )^{2}-\sigma ^{2}}{\sigma ^{3}}}\right)^{2}\right]}}\\&={\sqrt {\int _{-\infty }^{+\infty }f(x\mid \sigma )\left({\frac {(x-\mu )^{2}-\sigma ^{2}}{\sigma ^{3}}}\right)^{2}dx}}={\sqrt {\frac {2}{\sigma ^{2}}}}\propto {\frac {1}{\sigma }}.\end{aligned}}}

Equivalently, the Jeffreys prior for ${\displaystyle \log \sigma =\int d\sigma /\sigma }$ is the unnormalized uniform distribution on the real line, and thus this distribution is also known as the logarithmic prior. Similarly, the Jeffreys prior for ${\displaystyle \log \sigma ^{2}=2\log \sigma }$ is also uniform. It is the unique (up to a multiple) prior (on the positive reals) that is scale-invariant (the Haar measure with respect to multiplication of positive reals), corresponding to the standard deviation being a measure of scale and scale-invariance corresponding to no information about scale. As with the uniform distribution on the reals, it is an improper prior.

Poisson distribution with rate parameter

For the Poisson distribution of the non-negative integer ${\displaystyle n}$,

${\displaystyle f(n\mid \lambda )=e^{-\lambda }{\frac {\lambda ^{n}}{n!}},}$

the Jeffreys prior for the rate parameter ${\displaystyle \lambda >0}$ is

{\displaystyle {\begin{aligned}p(\lambda )&\propto {\sqrt {I(\lambda )}}={\sqrt {\operatorname {E} \!\left[\left({\frac {d}{d\lambda }}\log f(n\mid \lambda )\right)^{2}\right]}}={\sqrt {\operatorname {E} \!\left[\left({\frac {n-\lambda }{\lambda }}\right)^{2}\right]}}\\&={\sqrt {\sum _{n=0}^{+\infty }f(n\mid \lambda )\left({\frac {n-\lambda }{\lambda }}\right)^{2}}}={\sqrt {\frac {1}{\lambda }}}.\end{aligned}}}

Equivalently, the Jeffreys prior for ${\displaystyle {\sqrt {\lambda }}=\int d\lambda /{\sqrt {\lambda }}}$ is the unnormalized uniform distribution on the non-negative real line.

Bernoulli trial

For a coin that is "heads" with probability ${\displaystyle \gamma \in [0,1]}$ and is "tails" with probability ${\displaystyle 1-\gamma }$, for a given ${\displaystyle (H,T)\in \{(0,1),(1,0)\}}$ the probability is ${\displaystyle \gamma ^{H}(1-\gamma )^{T}}$. The Jeffreys prior for the parameter ${\displaystyle \gamma }$ is

{\displaystyle {\begin{aligned}p(\gamma )&\propto {\sqrt {I(\gamma )}}={\sqrt {\operatorname {E} \!\left[\left({\frac {d}{d\gamma }}\log f(x\mid \gamma )\right)^{2}\right]}}={\sqrt {\operatorname {E} \!\left[\left({\frac {H}{\gamma }}-{\frac {T}{1-\gamma }}\right)^{2}\right]}}\\&={\sqrt {\gamma \left({\frac {1}{\gamma }}-{\frac {0}{1-\gamma }}\right)^{2}+(1-\gamma )\left({\frac {0}{\gamma }}-{\frac {1}{1-\gamma }}\right)^{2}}}={\frac {1}{\sqrt {\gamma (1-\gamma )}}}\,.\end{aligned}}}

This is the arcsine distribution and is a beta distribution with ${\displaystyle \alpha =\beta =1/2}$. Furthermore, if ${\displaystyle \gamma =\sin ^{2}(\theta )}$ the Jeffreys prior for ${\displaystyle \theta }$ is uniform in the interval ${\displaystyle [0,\pi /2]}$. Equivalently, ${\displaystyle \theta }$ is uniform on the whole circle ${\displaystyle [0,2\pi ]}$.

N-sided die with biased probabilities

Similarly, for a throw of an ${\displaystyle N}$-sided die with outcome probabilities ${\displaystyle {\vec {\gamma }}=(\gamma _{1},\ldots ,\gamma _{N})}$, each non-negative and satisfying ${\displaystyle \sum _{i=1}^{N}\gamma _{i}=1}$, the Jeffreys prior for ${\displaystyle {\vec {\gamma }}}$ is the Dirichlet distribution with all (alpha) parameters set to one half. This amounts to using a pseudocount of one half for each possible outcome.

Alternatively, if we write ${\displaystyle \gamma _{i}={\phi _{i}}^{2}}$ for each ${\displaystyle i}$, then the Jeffreys prior for ${\displaystyle {\vec {\phi }}}$ is uniform on the (N−1)-dimensional unit sphere (i.e., it is uniform on the surface of an N-dimensional unit ball).

References

1. ^ Jaynes, E. T. (1968) "Prior Probabilities", IEEE Trans. on Systems Science and Cybernetics, SSC-4, 227 pdf.

• Jeffreys, H. (1946). "An Invariant Form for the Prior Probability in Estimation Problems". Proceedings of the Royal Society of London. Series A, Mathematical and Physical Sciences. 186 (1007): 453–461. doi:10.1098/rspa.1946.0056. JSTOR 97883.
• Jeffreys, H. (1939). Theory of Probability. Oxford University Press.

In statistics, additive smoothing, also called Laplace smoothing (not to be confused with Laplacian smoothing as used in image processing), or Lidstone smoothing, is a technique used to smooth categorical data. Given an observation ${\textstyle \scriptstyle {\mathbf {x} \ =\ \left\langle x_{1},\,x_{2},\,\ldots ,\,x_{d}\right\rangle }}$ from a multinomial distribution with ${\textstyle \scriptstyle {N}}$ trials, a "smoothed" version of the data gives the estimator:

${\displaystyle {\hat {\theta }}_{i}={\frac {x_{i}+\alpha }{N+\alpha d}}\qquad (i=1,\ldots ,d),}$

where the "pseudocount" α > 0 is a smoothing parameter. α = 0 corresponds to no smoothing. (This parameter is explained in § Pseudocount below.) Additive smoothing is a type of shrinkage estimator, as the resulting estimate will be between the empirical probability (relative frequency) ${\textstyle \scriptstyle {\frac {x_{i}}{N}}}$, and the uniform probability ${\textstyle \scriptstyle {\frac {1}{d}}}$. Invoking Laplace's rule of succession, some authors have argued[citation needed] that α should be 1 (in which case the term add-one smoothing is also used)[further explanation needed], though in practice a smaller value is typically chosen.

From a Bayesian point of view, this corresponds to the expected value of the posterior distribution, using a symmetric Dirichlet distribution with parameter α as a prior distribution. In the special case where the number of categories is 2, this is equivalent to using a Beta distribution as the conjugate prior for the parameters of Binomial distribution.

Arcsine distribution

In probability theory, the arcsine distribution is the probability distribution whose cumulative distribution function is

${\displaystyle F(x)={\frac {2}{\pi }}\arcsin \left({\sqrt {x}}\right)={\frac {\arcsin(2x-1)}{\pi }}+{\frac {1}{2}}}$

for 0 ≤ x ≤ 1, and whose probability density function is

${\displaystyle f(x)={\frac {1}{\pi {\sqrt {x(1-x)}}}}}$

on (0, 1). The standard arcsine distribution is a special case of the beta distribution with α = β = 1/2. That is, if ${\displaystyle X}$ is the standard arcsine distribution then ${\displaystyle X\sim {\rm {Beta}}{\bigl (}{\tfrac {1}{2}},{\tfrac {1}{2}}{\bigr )}}$.

The arcsine distribution appears

Bayesian inference

Bayesian inference is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available. Bayesian inference is an important technique in statistics, and especially in mathematical statistics. Bayesian updating is particularly important in the dynamic analysis of a sequence of data. Bayesian inference has found application in a wide range of activities, including science, engineering, philosophy, medicine, sport, and law. In the philosophy of decision theory, Bayesian inference is closely related to subjective probability, often called "Bayesian probability".

Bayesian network

A Bayesian network, Bayes network, belief network, decision network, Bayes(ian) model or probabilistic directed acyclic graphical model is a probabilistic graphical model (a type of statistical model) that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bayesian networks are ideal for taking an event that occurred and predicting the likelihood that any one of several possible known causes was the contributing factor. For example, a Bayesian network could represent the probabilistic relationships between diseases and symptoms. Given symptoms, the network can be used to compute the probabilities of the presence of various diseases.

Efficient algorithms can perform inference and learning in Bayesian networks. Bayesian networks that model sequences of variables (e.g. speech signals or protein sequences) are called dynamic Bayesian networks. Generalizations of Bayesian networks that can represent and solve decision problems under uncertainty are called influence diagrams.

Beta distribution

In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval [0, 1] parametrized by two positive shape parameters, denoted by α and β, that appear as exponents of the random variable and control the shape of the distribution. It is a special case of the Dirichlet distribution.

The beta distribution has been applied to model the behavior of random variables limited to intervals of finite length in a wide variety of disciplines.

In Bayesian inference, the beta distribution is the conjugate prior probability distribution for the Bernoulli, binomial, negative binomial and geometric distributions. For example, the beta distribution can be used in Bayesian analysis to describe initial knowledge concerning probability of success such as the probability that a space vehicle will successfully complete a specified mission. The beta distribution is a suitable model for the random behavior of percentages and proportions.

The usual formulation of the beta distribution is also known as the beta distribution of the first kind, whereas beta distribution of the second kind is an alternative name for the beta prime distribution.

Bias of an estimator

In statistics, the bias (or bias function) of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased. In statistics, "bias" is an objective property of an estimator, and while not a desired property, it is not pejorative, unlike the ordinary English use of the term "bias".

Bias can also be measured with respect to the median, rather than the mean (expected value), in which case one distinguishes median-unbiased from the usual mean-unbiasedness property. Bias is related to consistency in that consistent estimators are convergent and asymptotically unbiased (hence converge to the correct value as the number of data points grows arbitrarily large), though individual estimators in a consistent sequence may be biased (so long as the bias converges to zero); see bias versus consistency.

All else being equal, an unbiased estimator is preferable to a biased estimator, but in practice all else is not equal, and biased estimators are frequently used, generally with small bias. When a biased estimator is used, bounds of the bias are calculated. A biased estimator may be used for various reasons: because an unbiased estimator does not exist without further assumptions about a population or is difficult to compute (as in unbiased estimation of standard deviation); because an estimator is median-unbiased but not mean-unbiased (or the reverse); because a biased estimator gives a lower value of some loss function (particularly mean squared error) compared with unbiased estimators (notably in shrinkage estimators); or because in some cases being unbiased is too strong a condition, and the only unbiased estimators are not useful. Further, mean-unbiasedness is not preserved under non-linear transformations, though median-unbiasedness is (see § Effect of transformations); for example, the sample variance is an unbiased estimator for the population variance, but its square root, the sample standard deviation, is a biased estimator for the population standard deviation. These are all illustrated below.

Binomial proportion confidence interval

In statistics, a binomial proportion confidence interval is a confidence interval for the probability of success calculated from the outcome of a series of success–failure experiments (Bernoulli trials). In other words, a binomial proportion confidence interval is an interval estimate of a success probability p when only the number of experiments n and the number of successes nS are known.

There are several formulas for a binomial confidence interval, but all of them rely on the assumption of a binomial distribution. In general, a binomial distribution applies when an experiment is repeated a fixed number of times, each trial of the experiment has two possible outcomes (success and failure), the probability of success is the same for each trial, and the trials are statistically independent. Because the binomial distribution is a discrete probability distribution (i.e., not continuous) and difficult to calculate for large numbers of trials, a variety of approximations are used to calculate this confidence interval, all with their own tradeoffs in accuracy and computational intensity.

A simple example of a binomial distribution is the set of various possible outcomes, and their probabilities, for the number of heads observed when a coin is flipped ten times. The observed binomial proportion is the fraction of the flips that turn out to be heads. Given this observed proportion, the confidence interval for the true probability of the coin landing on heads is a range of possible proportions, which may or may not contain the true proportion. A 95% confidence interval for the proportion, for instance, will contain the true proportion 95% of the times that the procedure for constructing the confidence interval is employed.

Bures metric

In mathematics, in the area of quantum information geometry, the Bures metric (named after Donald Bures) or Helstrom metric (named after Carl W. Helstrom) defines an infinitesimal distance between density matrix operators defining quantum states. It is a quantum generalization of the Fisher information metric, and is identical to the Fubini–Study metric when restricted to the pure states alone.

Credible interval

In Bayesian statistics, a credible interval is an interval within which an unobserved parameter value falls with a particular subjective probability. It is an interval in the domain of a posterior probability distribution or a predictive distribution. The generalisation to multivariate problems is the credible region. Credible intervals are analogous to confidence intervals in frequentist statistics, although they differ on a philosophical basis; Bayesian intervals treat their bounds as fixed and the estimated parameter as a random variable, whereas frequentist confidence intervals treat their bounds as random variables and the parameter as a fixed value. Also, Bayesian credible intervals use (and indeed, require) knowledge of the situation-specific prior distribution, while the frequentist confidence intervals do not.

For example, in an experiment that determines the distribution of possible values of the parameter ${\displaystyle \mu }$, if the subjective probability that ${\displaystyle \mu }$ lies between 35 and 45 is 0.95, then ${\displaystyle 35\leq \mu \leq 45}$ is a 95% credible interval.

Exponential distribution

In probability theory and statistics, the exponential distribution (also known as the negative exponential distribution) is the probability distribution that describes the time between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

The exponential distribution is not the same as the class of exponential families of distributions, which is a large class of probability distributions that includes the exponential distribution as one of its members, but also includes the normal distribution, binomial distribution, gamma distribution, Poisson, and many others.

F-distribution

In probability theory and statistics, the F-distribution, also known as Snedecor's F distribution or the Fisher–Snedecor distribution (after Ronald Fisher and George W. Snedecor) is a continuous probability distribution that arises frequently as the null distribution of a test statistic, most notably in the analysis of variance (ANOVA), e.g., F-test.

Fisher information

In mathematical statistics, the Fisher information (sometimes simply called information) is a way of measuring the amount of information that an observable random variable X carries about an unknown parameter θ of a distribution that models X. Formally, it is the variance of the score, or the expected value of the observed information. In Bayesian statistics, the Asymptotic distribution of the posterior mode depends on the Fisher information and not on the prior (according to the Bernstein–von Mises theorem, which was anticipated by Laplace for exponential families). The role of the Fisher information in the asymptotic theory of maximum-likelihood estimation was emphasized by the statistician Ronald Fisher (following some initial results by Francis Ysidro Edgeworth). The Fisher information is also used in the calculation of the Jeffreys prior, which is used in Bayesian statistics.

The Fisher-information matrix is used to calculate the covariance matrices associated with maximum-likelihood estimates. It can also be used in the formulation of test statistics, such as the Wald test.

Statistical systems of a scientific nature (physical, biological, etc.) whose likelihood functions obey shift invariance have been shown to obey maximum Fisher information. The level of the maximum depends upon the nature of the system constraints.

Haar measure

In mathematical analysis, the Haar measure assigns an "invariant volume" to subsets of locally compact topological groups, consequently defining an integral for functions on those groups.

This measure was introduced by Alfréd Haar in 1933, though its special case for Lie groups had been introduced by Adolf Hurwitz in 1897 under the name "invariant integral". Haar measures are used in many parts of analysis, number theory, group theory, representation theory, statistics, probability theory, and ergodic theory.

Harold Jeffreys

Sir Harold Jeffreys, FRS (22 April 1891 – 18 March 1989) was a British mathematician, statistician, geophysicist, and astronomer. The book that he and Bertha Swirles wrote Theory of Probability, which first appeared in 1939, played an important role in the revival of the Bayesian view of probability.

Minimum description length

The minimum description length (MDL) principle is a formalization of Occam's razor in which the best hypothesis (a model and its parameters) for a given set of data is the one that leads to the best compression of the data. MDL was introduced by Jorma Rissanen in 1978. It is an important concept in information theory and computational learning theory.

Principle of indifference

The principle of indifference (also called principle of insufficient reason) is a rule for assigning epistemic probabilities. Suppose that there are n > 1 mutually exclusive and collectively exhaustive possibilities. The principle of indifference states that if the n possibilities are indistinguishable except for their names, then each possibility should be assigned a probability equal to 1/n.

In Bayesian probability, this is the simplest non-informative prior. The principle of indifference is meaningless under the frequency interpretation of probability, in which probabilities are relative frequencies rather than degrees of belief in uncertain propositions, conditional upon state information.

Principle of transformation groups

The principle of transformation groups is a rule for assigning epistemic probabilities in a statistical inference problem. It was first suggested by Edwin T. Jaynes and can be seen as a generalisation of the principle of indifference.

This can be seen as a method to create objective ignorance probabilities in the sense that two people who apply the principle and are confronted with the same information will assign the same probabilities.

Prior probability

In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into account. For example, the prior could be the probability distribution representing the relative proportions of voters who will vote for a particular politician in a future election. The unknown quantity may be a parameter of the model or a latent variable rather than an observable variable.

Bayes' theorem calculates the renormalized pointwise product of the prior and the likelihood function, to produce the posterior probability distribution, which is the conditional distribution of the uncertain quantity given the data.

Similarly, the prior probability of a random event or an uncertain proposition is the unconditional probability that is assigned before any relevant evidence is taken into account.

Priors can be created using a number of methods. A prior can be determined from past information, such as previous experiments. A prior can be elicited from the purely subjective assessment of an experienced expert. An uninformative prior can be created to reflect a balance among outcomes when no information is available. Priors can also be chosen according to some principle, such as symmetry or maximizing entropy given constraints; examples are the Jeffreys prior or Bernardo's reference prior. When a family of conjugate priors exists, choosing a prior from that family simplifies calculation of the posterior distribution.

Parameters of prior distributions are a kind of hyperparameter. For example, if one uses a beta distribution to model the distribution of the parameter p of a Bernoulli distribution, then:

p is a parameter of the underlying system (Bernoulli distribution), and

α and β are parameters of the prior distribution (beta distribution); hence hyperparameters.Hyperparameters themselves may have hyperprior distributions expressing beliefs about their values. A Bayesian model with more than one level of prior like this is called a hierarchical Bayes model.

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.