In statistics, dependence or association is any statistical relationship, whether causal or not, between two random variables or bivariate data. In the broadest sense correlation is any statistical association, though it commonly refers to the degree to which a pair of variables are linearly related. Familiar examples of dependent phenomena include the correlation between the physical statures of parents and their offspring, and the correlation between the demand for a limited supply product and its price.
Correlations are useful because they can indicate a predictive relationship that can be exploited in practice. For example, an electrical utility may produce less power on a mild day based on the correlation between electricity demand and weather. In this example, there is a causal relationship, because extreme weather causes people to use more electricity for heating or cooling. However, in general, the presence of a correlation is not sufficient to infer the presence of a causal relationship (i.e., correlation does not imply causation).
Formally, random variables are dependent if they do not satisfy a mathematical property of probabilistic independence. In informal parlance, correlation is synonymous with dependence. However, when used in a technical sense, correlation refers to any of several specific types of relationship between mean values. There are several correlation coefficients, often denoted or , measuring the degree of correlation. The most common of these is the Pearson correlation coefficient, which is sensitive only to a linear relationship between two variables (which may be present even when one variable is a nonlinear function of the other). Other correlation coefficients have been developed to be more robust than the Pearson correlation – that is, more sensitive to nonlinear relationships. Mutual information can also be applied to measure dependence between two variables.
The most familiar measure of dependence between two quantities is the Pearson product-moment correlation coefficient, or "Pearson's correlation coefficient", commonly called simply "the correlation coefficient". It is obtained by dividing the covariance of the two variables by the product of their standard deviations. Karl Pearson developed the coefficient from a similar but slightly different idea by Francis Galton.
where is the expected value operator, means covariance, and is a widely used alternative notation for the correlation coefficient. The Pearson correlation is defined only if both standard deviations are finite and positive.
The correlation coefficient is symmetric: .
It is a corollary of the Cauchy–Schwarz inequality that the absolute value of the Pearson correlation coefficient is not bigger than 1. The correlation coefficient is +1 in the case of a perfect direct (increasing) linear relationship (correlation), −1 in the case of a perfect decreasing (inverse) linear relationship (anticorrelation), and some value in the open interval in all other cases, indicating the degree of linear dependence between the variables. As it approaches zero there is less of a relationship (closer to uncorrelated). The closer the coefficient is to either −1 or 1, the stronger the correlation between the variables.
If the variables are independent, Pearson's correlation coefficient is 0, but the converse is not true because the correlation coefficient detects only linear dependencies between two variables.
For example, suppose the random variable is symmetrically distributed about zero, and . Then is completely determined by , so that and are perfectly dependent, but their correlation is zero; they are uncorrelated. However, in the special case when and are jointly normal, uncorrelatedness is equivalent to independence.
Given a series of measurements of the pair indexed by , the sample correlation coefficient can be used to estimate the population Pearson correlation between and . The sample correlation coefficient is defined as
Equivalent expressions for are
where and are the uncorrected sample standard deviations of and .
If and are results of measurements that contain measurement error, the realistic limits on the correlation coefficient are not −1 to +1 but a smaller range. For the case of a linear model with a single independent variable, the coefficient of determination (R squared) is the square of , Pearson's product-moment coefficient.
Consider the joint probability distribution of and given in the table below.
For this joint distribution, the marginal distributions are:
This yields the following expecations and variances:
Rank correlation coefficients, such as Spearman's rank correlation coefficient and Kendall's rank correlation coefficient (τ) measure the extent to which, as one variable increases, the other variable tends to increase, without requiring that increase to be represented by a linear relationship. If, as the one variable increases, the other decreases, the rank correlation coefficients will be negative. It is common to regard these rank correlation coefficients as alternatives to Pearson's coefficient, used either to reduce the amount of calculation or to make the coefficient less sensitive to non-normality in distributions. However, this view has little mathematical basis, as rank correlation coefficients measure a different type of relationship than the Pearson product-moment correlation coefficient, and are best seen as measures of a different type of association, rather than as alternative measure of the population correlation coefficient.
To illustrate the nature of rank correlation, and its difference from linear correlation, consider the following four pairs of numbers :
As we go from each pair to the next pair increases, and so does . This relationship is perfect, in the sense that an increase in is always accompanied by an increase in . This means that we have a perfect rank correlation, and both Spearman's and Kendall's correlation coefficients are 1, whereas in this example Pearson product-moment correlation coefficient is 0.7544, indicating that the points are far from lying on a straight line. In the same way if always decreases when increases, the rank correlation coefficients will be −1, while the Pearson product-moment correlation coefficient may or may not be close to −1, depending on how close the points are to a straight line. Although in the extreme cases of perfect rank correlation the two coefficients are both equal (being both +1 or both −1), this is not generally the case, and so values of the two coefficients cannot meaningfully be compared. For example, for the three pairs (1, 1) (2, 3) (3, 2) Spearman's coefficient is 1/2, while Kendall's coefficient is 1/3.
The information given by a correlation coefficient is not enough to define the dependence structure between random variables. The correlation coefficient completely defines the dependence structure only in very particular cases, for example when the distribution is a multivariate normal distribution. (See diagram above.) In the case of elliptical distributions it characterizes the (hyper-)ellipses of equal density; however, it does not completely characterize the dependence structure (for example, a multivariate t-distribution's degrees of freedom determine the level of tail dependence).
The Randomized Dependence Coefficient is a computationally efficient, copula-based measure of dependence between multivariate random variables. RDC is invariant with respect to non-linear scalings of random variables, is capable of discovering a wide range of functional association patterns and takes value zero at independence.
For two binary variables, the odds ratio measures their dependence, and takes range non-negative numbers, possibly infinity: . Related statistics such as Yule's Y and Yule's Q normalize this to the correlation-like range . The odds ratio is generalized by the logistic model to model cases where the dependent variables are discrete and there may be one or more independent variables.
The correlation ratio, entropy-based mutual information, total correlation, dual total correlation and polychoric correlation are all also capable of detecting more general dependencies, as is consideration of the copula between them, while the coefficient of determination generalizes the correlation coefficient to multiple regression.
The degree of dependence between variables and does not depend on the scale on which the variables are expressed. That is, if we are analyzing the relationship between and , most correlation measures are unaffected by transforming to a + bX and to c + dY, where a, b, c, and d are constants (b and d being positive). This is true of some correlation statistics as well as their population analogues. Some correlation statistics, such as the rank correlation coefficient, are also invariant to monotone transformations of the marginal distributions of and/or .
Most correlation measures are sensitive to the manner in which and are sampled. Dependencies tend to be stronger if viewed over a wider range of values. Thus, if we consider the correlation coefficient between the heights of fathers and their sons over all adult males, and compare it to the same correlation coefficient calculated when the fathers are selected to be between 165 cm and 170 cm in height, the correlation will be weaker in the latter case. Several techniques have been developed that attempt to correct for range restriction in one or both variables, and are commonly used in meta-analysis; the most common are Thorndike's case II and case III equations.
Various correlation measures in use may be undefined for certain joint distributions of X and Y. For example, the Pearson correlation coefficient is defined in terms of moments, and hence will be undefined if the moments are undefined. Measures of dependence based on quantiles are always defined. Sample-based statistics intended to estimate population measures of dependence may or may not have desirable statistical properties such as being unbiased, or asymptotically consistent, based on the spatial structure of the population from which the data were sampled.
Sensitivity to the data distribution can be used to an advantage. For example, scaled correlation is designed to use the sensitivity to the range in order to pick out correlations between fast components of time series. By reducing the range of values in a controlled manner, the correlations on long time scale are filtered out and only the correlations on short time scales are revealed.
The correlation matrix of random variables is the matrix whose entry is . If the measures of correlation used are product-moment coefficients, the correlation matrix is the same as the covariance matrix of the standardized random variables for . This applies both to the matrix of population correlations (in which case is the population standard deviation), and to the matrix of sample correlations (in which case denotes the sample standard deviation). Consequently, each is necessarily a positive-semidefinite matrix. Moreover, the correlation matrix is strictly positive definite if no variable can have all its values exactly generated as a linear function of the values of the others.
The correlation matrix is symmetric because the correlation between and is the same as the correlation between and .
In statistical modelling, correlation matrices representing the relationships between variables are categorized into different correlation structures, which are distinguished by factors such as the number of parameters required to estimate them. For example, in an exchangeable correlation matrix, all pairs of variables are modelled as having the same correlation, so all non-diagonal elements of the matrix are equal to each other. On the other hand, an autoregressive matrix is often used when variables represent a time series, since correlations are likely to be greater when measurements are closer in time. Other examples include independent, unstructured, M-dependent, and Toeplitz.
Similarly for two stochastic processes and : If they are independent, then they are uncorrelated.:p. 151
The conventional dictum that "correlation does not imply causation" means that correlation cannot be used to infer a causal relationship between the variables. This dictum should not be taken to mean that correlations cannot indicate the potential existence of causal relations. However, the causes underlying the correlation, if any, may be indirect and unknown, and high correlations also overlap with identity relations (tautologies), where no causal process exists. Consequently, a correlation between two variables is not a sufficient condition to establish a causal relationship (in either direction).
A correlation between age and height in children is fairly causally transparent, but a correlation between mood and health in people is less so. Does improved mood lead to improved health, or does good health lead to good mood, or both? Or does some other factor underlie both? In other words, a correlation can be taken as evidence for a possible causal relationship, but cannot indicate what the causal relationship, if any, might be.
The Pearson correlation coefficient indicates the strength of a linear relationship between two variables, but its value generally does not completely characterize their relationship. In particular, if the conditional mean of given , denoted , is not linear in , the correlation coefficient will not fully determine the form of .
The adjacent image shows scatter plots of Anscombe's quartet, a set of four different pairs of variables created by Francis Anscombe. The four variables have the same mean (7.5), variance (4.12), correlation (0.816) and regression line (y = 3 + 0.5x). However, as can be seen on the plots, the distribution of the variables is very different. The first one (top left) seems to be distributed normally, and corresponds to what one would expect when considering two variables correlated and following the assumption of normality. The second one (top right) is not distributed normally; while an obvious relationship between the two variables can be observed, it is not linear. In this case the Pearson correlation coefficient does not indicate that there is an exact functional relationship: only the extent to which that relationship can be approximated by a linear relationship. In the third case (bottom left), the linear relationship is perfect, except for one outlier which exerts enough influence to lower the correlation coefficient from 1 to 0.816. Finally, the fourth example (bottom right) shows another example when one outlier is enough to produce a high correlation coefficient, even though the relationship between the two variables is not linear.
These examples indicate that the correlation coefficient, as a summary statistic, cannot replace visual examination of the data. Note that the examples are sometimes said to demonstrate that the Pearson correlation assumes that the data follow a normal distribution, but this is not correct.
If a pair of random variables follows a bivariate normal distribution, the conditional mean is a linear function of , and the conditional mean is a linear function of . The correlation coefficient between and , along with the marginal means and variances of and , determines this linear relationship:
where and are the expected values of and , respectively, and and are the standard deviations of and , respectively.
Autocorrelation, also known as serial correlation, is the correlation of a signal with a delayed copy of itself as a function of delay. Informally, it is the similarity between observations as a function of the time lag between them. The analysis of autocorrelation is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.
Different fields of study define autocorrelation differently, and not all of these definitions are equivalent. In some fields, the term is used interchangeably with autocovariance.
Unit root processes, trend stationary processes, autoregressive processes, and moving average processes are specific forms of processes with autocorrelation.Cluster analysis
Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of exploratory data mining, and a common technique for statistical data analysis, used in many fields, including machine learning, pattern recognition, image analysis, information retrieval, bioinformatics, data compression, and computer graphics.
Cluster analysis itself is not one specific algorithm, but the general task to be solved. It can be achieved by various algorithms that differ significantly in their understanding of what constitutes a cluster and how to efficiently find them. Popular notions of clusters include groups with small distances between cluster members, dense areas of the data space, intervals or particular statistical distributions. Clustering can therefore be formulated as a multi-objective optimization problem. The appropriate clustering algorithm and parameter settings (including parameters such as the distance function to use, a density threshold or the number of expected clusters) depend on the individual data set and intended use of the results. Cluster analysis as such is not an automatic task, but an iterative process of knowledge discovery or interactive multi-objective optimization that involves trial and failure. It is often necessary to modify data preprocessing and model parameters until the result achieves the desired properties.
Besides the term clustering, there are a number of terms with similar meanings, including automatic classification, numerical taxonomy, botryology (from Greek βότρυς "grape"), typological analysis, and community detection. The subtle differences are often in the use of the results: while in data mining, the resulting groups are the matter of interest, in automatic classification the resulting discriminative power is of interest.
Cluster analysis was originated in anthropology by Driver and Kroeber in 1932 and introduced to psychology by Joseph Zubin in 1938 and Robert Tryon in 1939 and famously used by Cattell beginning in 1943 for trait theory classification in personality psychology.Correlation diagram
Terms such as correlation diagram(s), diagram(s) of correlation, and the like may refer to:
Data visualization, the general process of presenting information visually
Statistical graphics, images depicting statistical informationIn chemistry, there are several types of correlation diagrams:
Orgel diagrams, images depicting energies of electronic terms in transition metal complexes
Tanabe–Sugano diagrams, images depicting energies of spectroscopic states
Walsh diagrams, images depicting orbital energies as a function of bond angle
Woodward–Hoffmann rules#Correlation diagrams, images correlating reactant orbitals to product orbitalsCovariance
In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the lesser values, (i.e., the variables tend to show similar behavior), the covariance is positive. In the opposite case, when the greater values of one variable mainly correspond to the lesser values of the other, (i.e., the variables tend to show opposite behavior), the covariance is negative. The sign of the covariance therefore shows the tendency in the linear relationship between the variables. The magnitude of the covariance is not easy to interpret because it is not normalized and hence depends on the magnitudes of the variables. The normalized version of the covariance, the correlation coefficient, however, shows by its magnitude the strength of the linear relation.
A distinction must be made between (1) the covariance of two random variables, which is a population parameter that can be seen as a property of the joint probability distribution, and (2) the sample covariance, which in addition to serving as a descriptor of the sample, also serves as an estimated value of the population parameter.Degree of coherence
In quantum optics, correlation functions are used to characterize the statistical and coherence properties of an electromagnetic field. The degree of coherence is the normalized correlation of electric fields. In its simplest form, termed , it is useful for quantifying the coherence between two electric fields, as measured in a Michelson or other linear optical interferometer. The correlation between pairs of fields, , typically is used to find the statistical character of intensity fluctuations. First order correlation is actually the amplitude-amplitude correlation and the second order correlation is the intensity-intensity correlation. It is also used to differentiate between states of light that require a quantum mechanical description and those for which classical fields are sufficient. Analogous considerations apply to any Bose field in subatomic physics, in particular to mesons (cf. Bose–Einstein correlations).Functional correlation
In statistics, functional correlation is a dimensionality reduction technique used to quantify the correlation and dependence between two variables when the data is functional. Several approaches have been developed to quantify the relation between two functional variables.Lift (data mining)
In data mining and association rule learning, lift is a measure of the performance of a targeting model (association rule) at predicting or classifying cases as having an enhanced response (with respect to the population as a whole), measured against a random choice targeting model. A targeting model is doing a good job if the response within the target is much better than the average for the population as a whole. Lift is simply the ratio of these values: target response divided by average response.
For example, suppose a population has an average response rate of 5%, but a certain model (or rule) has identified a segment with a response rate of 20%. Then that segment would have a lift of 4.0 (20%/5%).
Typically, the modeller seeks to divide the population into quantiles, and rank the quantiles by lift. Organizations can then consider each quantile, and by weighing the predicted response rate (and associated financial benefit) against the cost, they can decide whether to market to that quantile or not.
Lift is analogous to information retrieval's average precision metric, if one treats the precision (fraction of the positives that are true positives) as the target response probability.
The lift curve can also be considered a variation on the receiver operating characteristic (ROC) curve, and is also known in econometrics as the Lorenz or power curve.
Normally distributed and uncorrelated does not imply independent
In probability theory, although simple examples illustrate that linear uncorrelatedness of two random variables does not in general imply their independence, it is sometimes mistakenly thought that it does imply that when the two random variables are normally distributed. This article demonstrates that assumption of normal distributions does not have that consequence, although the multivariate normal distribution, including the bivariate normal distribution, does.
To say that the pair of random variables has a bivariate normal distribution means that every linear combination of and for constant (i.e. not random) coefficients and has a univariate normal distribution. In that case, if and are uncorrelated then they are independent. However, it is possible for two random variables and to be so distributed jointly that each one alone is marginally normally distributed, and they are uncorrelated, but they are not independent; examples are given below.Pearson correlation coefficient
In statistics, the Pearson correlation coefficient (PCC, pronounced ), also referred to as Pearson's r, the Pearson product-moment correlation coefficient (PPMCC) or the bivariate correlation, is a measure of the linear correlation between two variables X and Y. According to the Cauchy–Schwarz inequality it has a value between +1 and −1, where 1 is total positive linear correlation, 0 is no linear correlation, and −1 is total negative linear correlation. It is widely used in the sciences. It was developed by Karl Pearson from a related idea introduced by Francis Galton in the 1880s and for which the mathematical formula was derived and published by Auguste Bravais in 1844.. The naming of the coefficient is thus an example of Stigler's Law.Random variable
In probability and statistics, a random variable, random quantity, aleatory variable, or stochastic variable is a variable whose possible values are outcomes of a random phenomenon. More specifically, a random variable is defined as a function that maps the outcomes of an unpredictable process to numerical quantities, typically real numbers. It is a variable (specifically a dependent variable), in the sense that it depends on the outcome of an underlying process providing the input to this function, and it is random in the sense that the underlying process is assumed to be random.
A random variable's possible values might represent the possible outcomes of a yet-to-be-performed experiment, or the possible outcomes of a past experiment whose already-existing value is uncertain (for example, because of imprecise measurements or quantum uncertainty). They may also conceptually represent either the results of an "objectively" random process (such as rolling a die) or the "subjective" randomness that results from incomplete knowledge of a quantity. The meaning of the probabilities assigned to the potential values of a random variable is not part of probability theory itself but is instead related to philosophical arguments over the interpretation of probability. The mathematics works the same regardless of the particular interpretation in use.
As a function, a random variable is required to be measurable, which allows for probabilities to be assigned to sets of its potential values. It is common that the outcomes depend on some physical variables that are not predictable. For example, when tossing a fair coin, the final outcome of heads or tails depends on the uncertain physical conditions. Which outcome will be observed is not certain. The coin could get caught in a crack in the floor, but such a possibility is excluded from consideration.
The domain of a random variable is the set of possible outcomes. In the case of the coin, there are only two possible outcomes, namely heads or tails. Since one of these outcomes must occur, either the event that the coin lands heads or the event that the coin lands tails must have non-zero probability.
A random variable has a probability distribution, which specifies the probability of its values. Random variables can be discrete, that is, taking any of a specified finite or countable list of values, endowed with a probability mass function characteristic of the random variable's probability distribution; or continuous, taking any numerical value in an interval or collection of intervals, via a probability density function that is characteristic of the random variable's probability distribution; or a mixture of both types.
Two random variables with the same probability distribution can still differ in terms of their associations with, or independence from, other random variables. The realizations of a random variable, that is, the results of randomly choosing values according to the variable's probability distribution function, are called random variates.
The formal mathematical treatment of random variables is a topic in probability theory. In that context, a random variable is understood as a function defined on a sample space whose outcomes are numerical values.Research design
A research design is the set of methods and procedures used in collecting and analyzing measures of the variables specified in the problem research. The design of a study defines the study type (descriptive, correlation, semi-experimental, experimental, review, meta-analytic) and sub-type (e.g., descriptive-longitudinal case study), research problem, hypotheses, independent and dependent variables, experimental design, and, if applicable, data collection methods and a statistical analysis plan. A research design is a framework that has been created to find answers to research questions.Teenage suicide in the United States
Teenage suicide in the United States remains comparatively high in the 15 to 24 age group with 5,079 suicides in this age range in 2014, making it the second leading cause of death for those aged 15 to 24. By comparison, suicide is the 11th leading cause of death for all those age 10 and over, with 33,289 suicides for all US citizens in 2006.In the United States, for the year 2005, the suicide rate for both males and females age 24 and below was lower than the rate for ages 25 and up.According to the Center for Disease Control and Prevention (CDC), suicide is considered the second leading cause of death among college students, the second leading cause of death for people ages 25–34, and the fourth leading cause of death for adults between the ages of 18 and 65. In 2015, the CDC also stated that an estimated 9.3 million adults, which is roughly 4% of the United States population, had suicidal thoughts in one year alone. 1.3 million adults 18 and older attempted suicide in one year, with 1.1 million actually making plans to commit suicide. Looking at younger teenagers, suicide is the third leading cause of death of individuals aged from 10 to 14. Males and females are known to have different suicidal tendencies. For example, males take their lives almost four times the rate females do. Males also commit approximately 77.9% of all suicides, however, the female population are more likely to have thoughts of suicide than males. Males more commonly use a firearm to commit suicide, while females commonly use a form of poison. College students aged 18–22 are less likely to attempt suicide than teenagers. The most common the suicide method among the female aged 15 to 24 is suffocation according to Suicide Prevention Resource Center.A recent study by the CDC with the help of Johns Hopkins University, Harvard, and Boston Children's Hospital has revealed that suicide rates dropping in certain states has been linked to the legalization of same sex marriage in those same states. Suicide rates as a whole fell about 7% but the rates among specifically gay, lesbian, and bisexual teenagers fell at a rate of 14%. In 2013, an estimated 494,169 people were treated in emergency departments for self-inflicted, non fatal injuries, which left an estimated $10.4 billion in combined medical and work loss costs.Suicide differs through race and ethnic backgrounds. The Center for Disease Control and Prevention ranked suicide as the 8th leading cause for American Indians/Alaska Natives. Hispanic students in grades 9–12 have the following percentages: attempting suicide (18.9%), having made a plan about how they would attempt suicide (15.7%), having attempted suicide (11.3%), and having made a suicide attempt that resulted in an injury, poisoning, or overdose that required medical attention (4.1%). These percentages are consistently higher than white and black students.Potential signs include threatening the well-being of oneself and others through physical violence. Other potentially serious threats could include a shared willingness to run away from home, as well the damaging of property. Individuals may also give away most to all personal belongings, reference suicide or suicidal thought on social media, or various other online platforms, increase their use of drugs or alcohol, sleep too little or too much, or may display extreme mood swings. Parents witnessing such threats are recommended to immediately speak with their child and seek immediate mental health evaluation if further threats are made.Uncorrelatedness (probability theory)
In probability theory and statistics, two real-valued random variables, , , are said to be uncorrelated if their covariance, , is zero. If two variables are uncorrelated, there is no linear relationship between them.
Uncorrelated random variables have a Pearson correlation coefficient of zero, except in the trivial case when either variable has zero variance (is a constant). In this case the correlation is undefined.
In general, uncorrelatedness is not the same as orthogonality, except in the special case where at least one of the two random variables has an expected value of 0. In this case, the covariance is the expectation of the product, and and are uncorrelated if and only if .
If and are independent, with finite second moments, then they are uncorrelated. However, not all uncorrelated variables are independent.