Survival function

The survival function is a function that gives the probability that a patient, device, or other object of interest will survive beyond any given specified time.[1]

The survival function is also known as the survivor function[2] or reliability function.[3]

The term reliability function is common in engineering while the term survival function is used in a broader range of applications, including human mortality. Another name for the survival function is the complementary cumulative distribution function.

Definition

Let T be a continuous random variable with cumulative distribution function F(t) on the interval [0,∞). Its survival function or reliability function is:

Examples of survival functions

The graphs below show examples of hypothetical survival functions. The x-axis is time. The y-axis is the proportion of subjects surviving. The graphs show the probability that a subject will survive beyond time t.

Four survival functions

Four survival functions

For example, for survival function 1, the probability of surviving longer than t = 2 months is 0.37. That is, 37% of subjects survive more than 2 months.

Survival function 1

Survival function 1

For survival function 2, the probability of surviving longer than t = 2 months is 0.97. That is, 97% of subjects survive more than 2 months.

Survival function 2

Survival function 2

Median survival may be determined from the survival function. For example, for survival function 2, 50% of the subjects survive 3.72 months. Median survival is thus 3.72 months.

Survival function median survival

Survival function median survival

In some cases, median survival cannot be determined from the graph. For example, for survival function 4, more than 50% of the subjects survive longer than the observation period of 10 months.

Median survival greater than 10 months

Median survival greater than 10 months

The survival function is one of several ways to describe and display survival data. Another useful way to display data is a graph showing the distribution of survival times of subjects. Olkin,[4] page 426, gives the following example of survival data. The number of hours between successive failures of an air-conditioning system were recorded. The time between successive failures are 1, 3, 5, 7, 11, 11, 11, 12, 14, 14, 14, 16, 16, 20, 21, 23, 42, 47, 52, 62, 71, 71, 87, 90, 95, 120, 120, 225, 246, and 261 hours. The mean time between failures is 59.6. This mean value will be used shortly to fit a theoretical curve to the data. The figure below shows the distribution of the time between failures. The blue tick marks beneath the graph are the actual hours between successive failures.

Distribution of AC failure times

Distribution of AC failure times

The distribution of failure times is over-laid with a curve representing an exponential distribution. For this example, the exponential distribution approximates the distribution of failure times. The exponential curve is a theoretical distribution fitted to the actual failure times. This particular exponential curve is specified by the parameter lambda, λ= 1/(mean time between failures) = 1/59.6 = 0.0168. The distribution of failure times is called the probability density function (pdf), if time can take any positive value. In equations, the pdf is specified as f(t). If time can only take discrete values (such as 1 day, 2 days, and so on), the distribution of failure times is called the probability mass function (pmf). Most survival analysis methods assume that time can take any positive value, and f(t) is the pdf. If the time between observed air conditioner failures is approximated using the exponential function, then the exponential curve gives the probability density function, f(t), for air conditioner failure times.

Another useful way to display the survival data is a graph showing the cumulative failures up to each time point. These data may be displayed as either the cumulative number or the cumulative proportion of failures up to each time. The graph below shows the cumulative probability (or proportion) of failures at each time for the air conditioning system. The stairstep line in black shows the cumulative proportion of failures. For each step there is a blue tick at the bottom of the graph indicating an observed failure time. The smooth red line represents the exponential curve fitted to the observed data.

CDF for AC failures

CDF for AC failures

A graph of the cumulative probability of failures up to each time point is called the cumulative distribution function, or CDF. In survival analysis, the cumulative distribution function gives the probability that the survival time is less than or equal to a specific time, t.

Let T be survival time, which is any positive number. A particular time is designated by the lower case letter t. The cumulative distribution function of T is the function

where the right-hand side represents the probability that the random variable T is less than or equal to t. If time can take on any positive value, then the cumulative distribution function F(t) is the integral of the probability density function f(t).

For the air conditioning example, the graph of the CDF below illustrates that the probability that the time to failure is less than or equal to 100 hours is 0.81, as estimated using the exponential curve fit to the data.

AC Time to failure LT 100 hours

AC Time to failure LT 100 hours

An alternative to graphing the probability that the failure time is less than or equal to 100 hours is to graph the probability that the failure time is greater than 100 hours. The probability that the failure time is greater than 100 hours must be 1 minus the probability that the failure time is less than or equal to 100 hours, because total probability must sum to 1.

This gives

P(failure time > 100 hours) = 1 - P(failure time < 100 hours) = 1 – 0.81 = 0.19.

This relationship generalizes to all failure times:

P(T > t) = 1 - P(T < t) = 1 – cumulative distribution function.

This relationship is shown on the graphs below. The graph on the left is the cumulative distribution function, which is P(T < t). The graph on the right is P(T > t) = 1 - P(T < t). The graph on the right is the survival function, S(t). The fact that the S(t) = 1 – CDF is the reason that another name for the survival function is the complementary cumulative distribution function.

Survival function is 1 - CDF

Survival function is 1 - CDF

Parametric survival functions

In some cases, such as the air conditioner example, the distribution of survival times may be approximated well by a function such as the exponential distribution. Several distributions are commonly used in survival analysis, including the exponential, Weibull, gamma, normal, log-normal, and log-logistic.[3][5] These distributions are defined by parameters. The normal (Gaussian) distribution, for example, is defined by the two parameters mean and standard deviation. Survival functions that are defined by parameters are said to be parametric.

In the four survival function graphs shown above, the shape of the survival function is defined by a particular probability distribution: survival function 1 is defined by an exponential distribution, 2 is defined by a Weibull distribution, 3 is defined by a log-logistic distribution, and 4 is defined by another Weibull distribution.

Exponential survival function

For an exponential survival distribution, the probability of failure is the same in every time interval, no matter the age of the individual or device. This fact leads to the "memoryless" property of the exponential survival distribution: the age of a subject has no effect on the probability of failure in the next time interval. The exponential may be a good model for the lifetime of a system where parts are replaced as they fail.[6] It may also be useful for modeling survival of living organisms over short intervals. It is not likely to be a good model of the complete lifespan of a living organism.[7] As Efron and Hastie [8] (p. 134) note, "If human lifetimes were exponential there wouldn't be old or young people, just lucky or unlucky ones".

Weibull survival function

A key assumption of the exponential survival function is that the hazard rate is constant. In an example given above, the proportion of men dying each year was constant at 10%, meaning that the hazard rate was constant. The assumption of constant hazard may not be appropriate. For example, among most living organisms, the risk of death is greater in old age than in middle age – that is, the hazard rate increases with time. For some diseases, such as breast cancer, the risk of recurrence is lower after 5 years – that is, the hazard rate decreases with time. The Weibull distribution extends the exponential distribution to allow constant, increasing, or decreasing hazard rates.

Other parametric survival functions

There are several other parametric survival functions that may provide a better fit to a particular data set, including normal, lognormal, log-logistic, and gamma. The choice of parametric distribution for a particular application can be made using graphical methods or using formal tests of fit. These distributions and tests are described in textbooks on survival analysis.[1][3] Lawless [9] has extensive coverage of parametric models.

Parametric survival functions are commonly used in manufacturing applications, in part because they enable estimation of the survival function beyond the observation period. However, appropriate use of parametric functions requires that data are well modeled by the chosen distribution. If an appropriate distribution is not available, or cannot be specified before a clinical trial or experiment, then non-parametric survival functions offer a useful alternative.

Non-parametric survival functions

A parametric model of survival may not be possible or desirable. In these situations, the most common method to model the survival function is the non-parametric Kaplan–Meier estimator.

Properties

Every survival function S(t) is monotonically decreasing, i.e. for all .

It is a property of a random variable that maps a set of events, usually associated with mortality or failure of some system, onto time.

The time, t = 0, represents some origin, typically the beginning of a study or the start of operation of some system. S(0) is commonly unity but can be less to represent the probability that the system fails immediately upon operation.

Since the CDF is a right-continuous function, the survival function is also right-continuous.

See also

References

  1. ^ a b Kleinbaum, David G.; Klein, Mitchel (2012), Survival analysis: A Self-learning text (Third ed.), Springer, ISBN 978-1441966452
  2. ^ Tableman, Mara; Kim, Jong Sung (2003), Survival Analysis Using S (First ed.), Chapman and Hall/CRC, ISBN 978-1584884088
  3. ^ a b c Ebeling, Charles (2010), An Introduction to Reliability and Maintainability Engineering (Second ed.), Waveland Press, ISBN 978-1577666257
  4. ^ Olkin, Ingram; Gleser, Leon; Derman, Cyrus (1994), Probability Models and Applications (Second ed.), Macmillan, ISBN 0-02-389220-X
  5. ^ Klein, John; Moeschberger, Melvin (2005), Survival Analysis: Techniques for Censored and Truncated Data (Second ed.), Springer, ISBN 978-0387953991
  6. ^ Mendenhall, William; Terry, Sincich (2007), Statistics for Engineering and the Sciences (Fifth ed.), Pearson / Prentice Hall, ISBN 978-0131877061
  7. ^ Brostrom, Göran (2012), Event History Analysis with R (First ed.), Chapman & Hall/CRC, ISBN 978-1439831649
  8. ^ Efron, Bradley; Hastie, Trevor (2016), Computer Age Statistical Inference: Algorithms, Evidence, and Data Science (First ed.), Cambridge University Press, ISBN 978-1107149892
  9. ^ Lawless, Jerald (2002), Statistical Models and Methods for Lifetime Data (Second ed.), Wiley, ISBN 978-0471372158
Accelerated failure time model

In the statistical area of survival analysis, an accelerated failure time model (AFT model) is a parametric model that provides an alternative to the commonly used proportional hazards models. Whereas a proportional hazards model assumes that the effect of a covariate is to multiply the hazard by some constant, an AFT model assumes that the effect of a covariate is to accelerate or decelerate the life course of a disease by some constant. This is especially appealing in a technical context where the 'disease' is a result of some mechanical process with a known sequence of intermediary stages.

Data collection

Data collection is the process of gathering and measuring information on targeted variables in an established system, which then enables one to answer relevant questions and evaluate outcomes. Data collection is a component of research in all fields of study including physical and social sciences, humanities, and business. While methods vary by discipline, the emphasis on ensuring accurate and honest collection remains the same. The goal for all data collection is to capture quality evidence that allows analysis to lead to the formulation of convincing and credible answers to the questions that have been posed.

De Moivre's law

De Moivre's Law is a survival model applied in actuarial science, named for Abraham de Moivre. It is a simple law of mortality based on a linear survival function.

Empirical distribution function

In statistics, an empirical distribution function is the distribution function associated with the empirical measure of a sample. This cumulative distribution function is a step function that jumps up by 1/n at each of the n data points. Its value at any specified value of the measured variable is the fraction of observations of the measured variable that are less than or equal to the specified value.

The empirical distribution function is an estimate of the cumulative distribution function that generated the points in the sample. It converges with probability 1 to that underlying distribution, according to the Glivenko–Cantelli theorem. A number of results exist to quantify the rate of convergence of the empirical distribution function to the underlying cumulative distribution function.

Fan chart (statistics)

A fan chart is made of a group of dispersion fan diagrams,

which may be positioned according to two categorising dimensions.

A dispersion fan diagram is a circular diagram which

reports the same information about a dispersion as a box plot:

namely median, quartiles, and two extreme values.

Force of mortality

In actuarial science, force of mortality represents the instantaneous rate of mortality at a certain age measured on an annualized basis. It is identical in concept to failure rate, also called hazard function, in reliability theory.

Frequentist inference

Frequentist inference is a type of statistical inference that draws conclusions from sample data by emphasizing the frequency or proportion of the data. An alternative name is frequentist statistics. This is the inference framework in which the well-established methodologies of statistical hypothesis testing and confidence intervals are based. Other than frequentistic inference, the main alternative approach to statistical inference is Bayesian inference, while another is fiducial inference.

While "Bayesian inference" is sometimes held to include the approach to inference leading to optimal decisions, a more restricted view is taken here for simplicity.

Kaplan–Meier estimator

The Kaplan–Meier estimator, also known as the product limit estimator, is a non-parametric statistic used to estimate the survival function from lifetime data. In medical research, it is often used to measure the fraction of patients living for a certain amount of time after treatment. In other fields, Kaplan–Meier estimators may be used to measure the length of time people remain unemployed after a job loss, the time-to-failure of machine parts, or how long fleshy fruits remain on plants before they are removed by frugivores. The estimator is named after Edward L. Kaplan and Paul Meier, who each submitted similar manuscripts to the Journal of the American Statistical Association. The journal editor, John Tukey, convinced them to combine their work into one paper, which has been cited about 50,000 times since its publication.

The estimator is given by:

with a time when at least one event happened, di the number of events (i.e., deaths) that happened at time and the individuals known to survive (have not yet had an event or been censored) at time .

Log-Cauchy distribution

In probability theory, a log-Cauchy distribution is a probability distribution of a random variable whose logarithm is distributed in accordance with a Cauchy distribution. If X is a random variable with a Cauchy distribution, then Y = exp(X) has a log-Cauchy distribution; likewise, if Y has a log-Cauchy distribution, then X = log(Y) has a Cauchy distribution.

Lévy flight

A Lévy flight, named for French mathematician Paul Lévy, is a random walk in which the step-lengths have a probability distribution that is heavy-tailed. When defined as a walk in a space of dimension greater than one, the steps made are in isotropic random directions.

The term "Lévy flight" was coined by Benoît Mandelbrot, who used this for one specific definition of the distribution of step sizes. He used the term Cauchy flight for the case where the distribution of step sizes is a Cauchy distribution, and Rayleigh flight for when the distribution is a normal distribution (which is not an example of a heavy-tailed probability distribution).

Later researchers have extended the use of the term "Lévy flight" to include cases where the random walk takes place on a discrete grid rather than on a continuous space.

The particular case for which Mandelbrot used the term "Lévy flight" is defined by the survivor function (commonly known as the survival function) of the distribution of step-sizes, U, being

Here D is a parameter related to the fractal dimension and the distribution is a particular case of the Pareto distribution. Later researchers allow the distribution of step sizes to be any distribution for which the survival function has a power-like tail[citation needed]

for some k satisfying 1 < k < 3. (Here the notation O is the Big O notation.) Such distributions have an infinite variance. Typical examples are the symmetric stable distributions.

Mills ratio

In probability theory, the Mills ratio (or Mills's ratio) of a continuous random variable is the function

where is the probability density function, and

is the complementary cumulative distribution function (also called survival function). The concept is named after John P. Mills. The Mills ratio is related to the hazard rate h(x) which is defined as

by

Outline of statistics

Statistics is a field of inquiry that studies the collection, analysis, interpretation, and presentation of data. It is applicable to a wide variety of academic disciplines, from the physical and social sciences to the humanities; it is also used and misused for making informed decisions in all areas of business and government.

Shape parameter

In probability theory and statistics, a shape parameter is a kind of numerical parameter of a parametric family of probability distributions.Specifically, a shape parameter is any parameter of a probability distribution that is neither a location parameter nor a scale parameter (nor a function of either or both of these only, such as a rate parameter). Such a parameter must affect the shape of a distribution rather than simply shifting it (as a location parameter does) or stretching/shrinking it (as a scale parameter does).

Statistical graphics

Statistical graphics, also known as graphical techniques, are graphics in the field of statistics used to visualize quantitative data.

Statistical population

In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hypothetical and potentially infinite group of objects conceived as a generalization from experience (e.g. the set of all possible hands in a game of poker). A common aim of statistical analysis is to produce information about some chosen population.In statistical inference, a subset of the population (a statistical sample) is chosen to represent the population in a statistical analysis. The ratio of the size of this statistical sample to the size of the population is called a sampling fraction. It is then possible to estimate the population parameters using the appropriate sample statistics.

Statistician

A statistician is a person who works with theoretical or applied statistics. The profession exists in both the private and public sectors. It is common to combine statistical knowledge with expertise in other subjects, and statisticians may work as employees or as statistical consultants.

Survival analysis

Survival analysis is a branch of statistics for analyzing the expected duration of time until one or more events happen, such as death in biological organisms and failure in mechanical systems. This topic is called reliability theory or reliability analysis in engineering, duration analysis or duration modelling in economics, and event history analysis in sociology. Survival analysis attempts to answer questions such as: what is the proportion of a population which will survive past a certain time? Of those that survive, at what rate will they die or fail? Can multiple causes of death or failure be taken into account? How do particular circumstances or characteristics increase or decrease the probability of survival?

To answer such questions, it is necessary to define "lifetime". In the case of biological survival, death is unambiguous, but for mechanical reliability, failure may not be well-defined, for there may well be mechanical systems in which failure is partial, a matter of degree, or not otherwise localized in time. Even in biological problems, some events (for example, heart attack or other organ failure) may have the same ambiguity. The theory outlined below assumes well-defined events at specific times; other cases may be better treated by models which explicitly account for ambiguous events.

More generally, survival analysis involves the modelling of time to event data; in this context, death or failure is considered an "event" in the survival analysis literature – traditionally only a single event occurs for each subject, after which the organism or mechanism is dead or broken. Recurring event or repeated event models relax that assumption. The study of recurring events is relevant in systems reliability, and in many areas of social sciences and medical research.

Survivorship curve

A survivorship curve is a graph showing the number or proportion of individuals surviving to each age for a given species or group (e.g. males or females). Survivorship curves can be constructed for a given cohort (a group of individuals of roughly the same age) based on a life table.

There are three generalized types of survivorship curves:

Type I or convex curves are characterized by high age-specific survival probability in early and middle life, followed by a rapid decline in survival in later life. They are typical of species that produce few offspring but care for them well, including humans and many other large mammals.

Type II or diagonal curves are an intermediate between Types I and III, where roughly constant mortality rate/survival probability is experienced regardless of age. Some birds and some lizards follow this pattern.

Type III or concave curves have the greatest mortality (lowest age-specific survival) early in life, with relatively low rates of death (high probability of survival) for those surviving this bottleneck. This type of curve is characteristic of species that produce a large number of offspring (see r/K selection theory). This includes most marine invertebrates. For example, oysters produce millions of eggs, but most larvae die from predation or other causes; those that survive long enough to produce a hard shell live relatively long.The number or proportion of organisms surviving to any age is plotted on the y-axis (generally with a logarithmic scale starting with 1000 individuals), while their age (often as a proportion of maximum life span) is plotted on the x-axis.

In mathematical statistics, the survival function is one specific form of survivorship curve and plays a basic part in survival analysis.

There are various reasons that a species exhibits their particular survivorship curve, but one contributor can be environmental factors that decrease survival. For example, an outside element that is nondiscriminatory in the ages that it affects (of a particular species) is likely to yield a Type II survivorship curve, in which the young and old are equally likely to be affected. On the other hand, an outside element that preferentially reduces the survival of young individuals is likely to yield a Type III curve. Finally, if an outside element only reduces the survival of organisms later in life, this is likely to yield a Type I curve.

Time domain

Time domain is the analysis of mathematical functions, physical signals or time series of economic or environmental data, with respect to time. In the time domain, the signal or function's value is known for all real numbers, for the case of continuous time, or at various separate instants in the case of discrete time. An oscilloscope is a tool commonly used to visualize real-world signals in the time domain. A time-domain graph shows how a signal changes with time, whereas a frequency-domain graph shows how much of the signal lies within each given frequency band over a range of frequencies.

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.