Statistical population

In statistics, a population is a set of similar items or events which is of interest for some question or experiment.[1] A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hypothetical and potentially infinite group of objects conceived as a generalization from experience (e.g. the set of all possible hands in a game of poker).[2] A common aim of statistical analysis is to produce information about some chosen population.[3]

In statistical inference, a subset of the population (a statistical sample) is chosen to represent the population in a statistical analysis.[4] The ratio of the size of this statistical sample to the size of the population is called a sampling fraction. It is then possible to estimate the population parameters using the appropriate sample statistics.

Subpopulation

A subconcept of a population that shares one or more additional properties is called a subpopulation. For example, if the population is all Egyptian people, a subpopulation is all Egyptian males; if the population is all pharmacies in the world, a subpopulation is all pharmacies in Egypt. By contrast, a sample is a subset of a population that is not chosen to share any additional property.

Descriptive statistics may yield different results for different subpopulations. For instance, a particular medicine may have different effects on different subpopulations, and these effects may be obscured or dismissed if such special subpopulations are not identified and examined in isolation.

Similarly, one can often estimate parameters more accurately if one separates out subpopulations: the distribution of heights among people is better modeled by considering men and women as separate subpopulations, for instance.

Populations consisting of subpopulations can be modeled by mixture models, which combine the distributions within subpopulations into an overall population distribution. Even if subpopulations are well-modeled by given simple models, the overall population may be poorly fit by a given simple model – poor fit may be evidence for existence of subpopulations. For example, given two equal subpopulations, both normally distributed, if they have the same standard deviation and different means, the overall distribution will exhibit low kurtosis relative to a single normal distribution – the means of the subpopulations fall on the shoulders of the overall distribution. If sufficiently separated, these form a bimodal distribution, otherwise it simply has a wide peak. Further, it will exhibit overdispersion relative to a single normal distribution with the given variation. Alternatively, given two subpopulations with the same mean and different standard deviations, the overall population will exhibit high kurtosis, with a sharper peak and heavier tails (and correspondingly shallower shoulders) than a single distribution

See also

References

  1. ^ "Glossary of statistical terms: Population". Statistics.com. Retrieved 22 February 2016.
  2. ^ Weisstein, Eric W. "Statistical population". MathWorld.
  3. ^ Yates, Daniel S.; Moore, David S; Starnes, Daren S. (2003). The Practice of Statistics (2nd ed.). New York: Freeman. ISBN 978-0-7167-4773-4. Archived from the original on 2005-02-09.
  4. ^ "Glossary of statistical terms: Sample". Statistics.com. Retrieved 22 February 2016.

External links

Anthropic principle

The anthropic principle is a philosophical consideration that observations of the universe must be compatible with the conscious and sapient life that observes it. Some proponents of the anthropic principle reason that it explains why this universe has the age and the fundamental physical constants necessary to accommodate conscious life. As a result, they believe it is unremarkable that this universe has fundamental constants that happen to fall within the narrow range thought to be compatible with life.

The strong anthropic principle (SAP), as explained by John D. Barrow and Frank Tipler, states that this is all the case because the universe is in some sense compelled to eventually have conscious and sapient life emerge within it. Some critics of the SAP argue in favor of a weak anthropic principle (WAP) similar to the one defined by Brandon Carter, which states that the universe's ostensible fine tuning is the result of selection bias (specifically survivor bias): i.e., only in a universe capable of eventually supporting life will there be living beings capable of observing and reflecting on the matter. Most often such arguments draw upon some notion of the multiverse for there to be a statistical population of universes to select from and from which selection bias (our observance of only this universe, compatible with our life) could occur.

Bapat–Beg theorem

In probability theory, the Bapat–Beg theorem gives the joint probability distribution of order statistics of independent but not necessarily identically distributed random variables in terms of the cumulative distribution functions of the random variables. Ravindra Bapat and Beg published the theorem in 1989, though they did not offer a proof. A simple proof was offered by Hande in 1994.Often, all elements of the sample are obtained from the same population and thus have the same probability distribution. The Bapat–Beg theorem describes the order statistics when each element of the sample is obtained from a different statistical population and therefore has its own probability distribution.

Bluefield, Virginia

Bluefield is a town in Tazewell County, Virginia, United States, located along the Bluestone River. The population was 5,444 at the 2010 census. It is part of the Bluefield WV-VA micropolitan area which has a population of 107,342. The micropolitan area is the 350th largest statistical population area in the United States.

Box plot

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram. Outliers may be plotted as individual points.

Box plots are non-parametric: they display variation in samples of a statistical population without making any assumptions of the underlying statistical distribution (though Tukey's boxplot assumes symmetry for the whiskers and normality for their length). The spacings between the different parts of the box indicate the degree of dispersion (spread) and skewness in the data, and show outliers. In addition to the points themselves, they allow one to visually estimate various L-estimators, notably the interquartile range, midhinge, range, mid-range, and trimean. Box plots can be drawn either horizontally or vertically. Box plots received their name from the box in the middle.

Cluster sampling

Cluster sampling is a sampling plan used when mutually homogeneous yet internally heterogeneous groupings are evident in a statistical population. It is often used in marketing research. In this sampling plan, the total population is divided into these groups (known as clusters) and a simple random sample of the groups is selected. The elements in each cluster are then sampled. If all elements in each sampled cluster are sampled, then this is referred to as a "one-stage" cluster sampling plan. If a simple random subsample of elements is selected within each of these groups, this is referred to as a "two-stage" cluster sampling plan. A common motivation for cluster sampling is to reduce the total number of interviews and costs given the desired accuracy. For a fixed sample size, the expected random error is smaller when most of the variation in the population is present internally within the groups, and not between the groups.

Counterfactual definiteness

In quantum mechanics, counterfactual definiteness (CFD) is the ability to speak "meaningfully" of the definiteness of the results of measurements that have not been performed (i.e., the ability to assume the existence of objects, and properties of objects, even when they have not been measured). The term "counterfactual definiteness" is used in discussions of physics calculations, especially those related to the phenomenon called quantum entanglement and those related to the Bell inequalities. In such discussions "meaningfully" means the ability to treat these unmeasured results on an equal footing with measured results in statistical calculations. It is this (sometimes assumed but unstated) aspect of counterfactual definiteness that is of direct relevance to physics and mathematical models of physical systems and not philosophical concerns regarding the meaning of unmeasured results.

"Counterfactual" may appear in physics discussions as a noun. What is meant in this context is "a value that could have been measured but, for one reason or another, was not."

Data set

A data set (or dataset) is a collection of data. Most commonly a data set corresponds to the contents of a single database table, or a single statistical data matrix, where every column of the table represents a particular variable, and each row corresponds to a given member of the data set in question. The data set lists values for each of the variables, such as height and weight of an object, for each member of the data set. Each value is known as a datum. The data set may comprise data for one or more members, corresponding to the number of rows.

The term data set may also be used more loosely, to refer to the data in a collection of closely related tables, corresponding to a particular experiment or event. Less used names for this kind of data sets are data corpus and data stock. An example of this type is the data sets collected by space agencies performing experiments with instruments aboard space probes. Data sets that are so large that traditional data processing applications are inadequate to deal with them are known as big data.In the open data discipline, data set is the unit to measure the information released in a public open data repository. The European Open Data portal aggregates more than half a million data sets. In this field other definitions have been proposed but currently there is not an official one. Some other issues (real-time data sources, non-relational data sets, etc.) increases the difficulty to reach a consensus about it.

Location test

A location test is a statistical hypothesis test that compares the location parameter of a statistical population to a given constant, or that compares the location parameters of two statistical populations to each other. Most commonly, the location parameter (or parameters) of interest are expected values, but location tests based on medians or other measures of location are also used.

Mean

There are several kinds of mean in various branches of mathematics (especially statistics).

For a data set, the arithmetic mean, also called the mathematical expectation or average, is the central value of a discrete set of numbers: specifically, the sum of the values divided by the number of values. The arithmetic mean of a set of numbers x1, x2, ..., xn is typically denoted by , pronounced "x bar". If the data set were based on a series of observations obtained by sampling from a statistical population, the arithmetic mean is the sample mean (denoted ) to distinguish it from the mean of the underlying distribution, the population mean (denoted or ).

In probability and statistics, the population mean, or expected value, are a measure of the central tendency either of a probability distribution or of the random variable characterized by that distribution. In the case of a discrete probability distribution of a random variable X, the mean is equal to the sum over every possible value weighted by the probability of that value; that is, it is computed by taking the product of each possible value x of X and its probability p(x), and then adding all these products together, giving . An analogous formula applies to the case of a continuous probability distribution. Not every probability distribution has a defined mean; see the Cauchy distribution for an example. Moreover, for some distributions the mean is infinite.

For a finite population, the population mean of a property is equal to the arithmetic mean of the given property while considering every member of the population. For example, the population mean height is equal to the sum of the heights of every individual divided by the total number of individuals. The sample mean may differ from the population mean, especially for small samples. The law of large numbers dictates that the larger the size of the sample, the more likely it is that the sample mean will be close to the population mean.

Outside probability and statistics, a wide range of other notions of "mean" are often used in geometry and analysis; examples are given below.

Mixture distribution

In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection according to given probabilities of selection, and then the value of the selected random variable is realized. The underlying random variables may be random real numbers, or they may be random vectors (each having the same dimension), in which case the mixture distribution is a multivariate distribution.

In cases where each of the underlying random variables is continuous, the outcome variable will also be continuous and its probability density function is sometimes referred to as a mixture density. The cumulative distribution function (and the probability density function if it exists) can be expressed as a convex combination (i.e. a weighted sum, with non-negative weights that sum to 1) of other distribution functions and density functions. The individual distributions that are combined to form the mixture distribution are called the mixture components, and the probabilities (or weights) associated with each component are called the mixture weights. The number of components in mixture distribution is often restricted to being finite, although in some cases the components may be countably infinite. More general cases (i.e. an uncountable set of component distributions), as well as the countable case, are treated under the title of compound distributions.

A distinction needs to be made between a random variable whose distribution function or density is the sum of a set of components (i.e. a mixture distribution) and a random variable whose value is the sum of the values of two or more underlying random variables, in which case the distribution is given by the convolution operator. As an example, the sum of two jointly normally distributed random variables, each with different means, will still have a normal distribution. On the other hand, a mixture density created as a mixture of two normal distributions with different means will have two peaks provided that the two means are far enough apart, showing that this distribution is radically different from a normal distribution.

Mixture distributions arise in many contexts in the literature and arise naturally where a statistical population contains two or more subpopulations. They are also sometimes used as a means of representing non-normal distributions. Data analysis concerning statistical models involving mixture distributions is discussed under the title of mixture models, while the present article concentrates on simple probabilistic and statistical properties of mixture distributions and how these relate to properties of the underlying distributions.

Sample (statistics)

In statistics and quantitative research methodology, a data sample is a set of data collected and the world selected from a statistical population by a defined procedure. The elements of a sample are known as sample points, sampling units or observations.

Typically, the population is very large, making a census or a complete enumeration of all the values in the population either impractical or impossible. the sample usually represents a subset of manageable size. Samples are collected and statistics are calculated from the samples, so that one can make inferences or extrapolations from the sample to the population.

The data sample may be drawn from a population without replacement (i.e. no element can be selected more than once in the same sample), in which case it is a subset of a population; or with replacement (i.e. an element may appear multiple times in the one sample), in which case it is a multisubset.

Sampling

Sampling may refer to:

Sampling (signal processing), converting a continuous signal into a discrete signal

Sampling (graphics), converting continuous colors into discrete color components

Sampling (music), re-using portions of sound recordings in a piece

Sampler (musical instrument), an electronic music instrument that plays back sound recordings on command

Sampling (statistics), selection of observations to acquire some knowledge of a statistical population

Sampling (case studies), selection of cases for single or multiple case studies

Sampling (audit), application of audit procedures to less than 100% of population to be audited

Sampling (medicine), gathering of matter from the body to aid in the process of a medical diagnosis and/or evaluation of an indication for treatment, further medical tests or other procedures.

Sampling (occupational hygiene), detection of hazardous materials in the workplace

Sampling (for testing or analysis), taking a representative portion of a material or product to test (e.g. by physical measurements, chemical analysis, microbiological examination), typically for the purposes of identification, quality control, or regulatory assessment. See Sample (material).Specific types of sampling include:

Chorionic villus sampling, a method of detecting fetal abnormalities

Food sampling, the process of taking a representative portion of a food for analysis, usually to test for quality, safety or compositional compliance. (Not to be confused with Food, free samples, a method of promoting food items to consumers)

Oil sampling, the process of collecting samples of oil from machinery for analysis

Theoretical sampling, the process of selecting comparison cases or sites in qualitative research

Water sampling, the process of taking a portion of water for analysis or other testing, e.g. drinking water to check that it complies with relevant water quality standards, or river water to check for pollutants, or bathing water to check that it is safe for bathing, or intrusive water in a building to identify its source.

Work sampling, a method of estimating the standard time for manufacturing operations.

Sampling (statistics)

In statistics, quality assurance, and survey methodology, sampling is the selection of a subset (a statistical sample) of individuals from within a statistical population to estimate characteristics of the whole population. Statisticians attempt for the samples to represent the population in question. Two advantages of sampling are lower cost and faster data collection than measuring the entire population.

Each observation measures one or more properties (such as weight, location, colour) of observable bodies distinguished as independent objects or individuals. In survey sampling, weights can be applied to the data to adjust for the sample design, particularly stratified sampling. Results from probability theory and statistical theory are employed to guide the practice. In business and medical research, sampling is widely used for gathering information about a population. Acceptance sampling is used to determine if a production lot of material meets the governing specifications.

Standard deviation

In statistics, the standard deviation (SD, also represented by the lower case Greek letter sigma σ or the Latin letter s) is a measure that is used to quantify the amount of variation or dispersion of a set of data values. A low standard deviation indicates that the data points tend to be close to the mean (also called the expected value) of the set, while a high standard deviation indicates that the data points are spread out over a wider range of values.

The standard deviation of a random variable, statistical population, data set, or probability distribution is the square root of its variance. It is algebraically simpler, though in practice less robust, than the average absolute deviation.

A useful property of the standard deviation is that, unlike the variance, it is expressed in the same units as the data.

In addition to expressing the variability of a population, the standard deviation is commonly used to measure confidence in statistical conclusions. For example, the margin of error in polling data is determined by calculating the expected standard deviation in the results if the same poll were to be conducted multiple times. This derivation of a standard deviation is often called the "standard error" of the estimate or "standard error of the mean" when referring to a mean. It is computed as the standard deviation of all the means that would be computed from that population if an infinite number of samples were drawn and a mean for each sample were computed.

It is very important to note that the standard deviation of a population and the standard error of a statistic derived from that population (such as the mean) are quite different but related (related by the inverse of the square root of the number of observations). The reported margin of error of a poll is computed from the standard error of the mean (or alternatively from the product of the standard deviation of the population and the inverse of the square root of the sample size, which is the same thing) and is typically about twice the standard deviation—the half-width of a 95 percent confidence interval.

In science, many researchers report the standard deviation of experimental data, and only effects that fall much farther than two standard deviations away from what would have been expected are considered statistically significant—normal random error or variation in the measurements is in this way distinguished from likely genuine effects or associations. The standard deviation is also important in finance, where the standard deviation on the rate of return on an investment is a measure of the volatility of the investment.

When only a sample of data from a population is available, the term standard deviation of the sample or sample standard deviation can refer to either the above-mentioned quantity as applied to those data or to a modified quantity that is an unbiased estimate of the population standard deviation (the standard deviation of the entire population).

Statistic

A statistic (singular) or sample statistic is a single measure of some attribute of a sample (e.g. its arithmetic mean value). It is calculated by applying a function (statistical algorithm) to the values of the items of the sample, which are known together as a set of data.

More formally, statistical theory defines a statistic as a function of a sample where the function itself is independent of the sample's distribution; that is, the function can be stated before realization of the data. The term statistic is used both for the function and for the value of the function on a given sample.

A statistic is distinct from a statistical parameter, which is not computable in cases where the population is infinite, and therefore impossible to examine and measure all its items. However, a statistic, when used to estimate a population parameter, is called an estimator. For instance, the sample mean is a statistic that estimates the population mean, which is a parameter.

When a statistic (a function) is being used for a specific purpose, it may be referred to by a name indicating its purpose: in descriptive statistics, a descriptive statistic is used to describe the data; in estimation theory, an estimator is used to estimate a parameter of the distribution (population); in statistical hypothesis testing, a test statistic is used to test a hypothesis. However, a single statistic can be used for multiple purposes – for example the sample mean can be used to describe a data set, to estimate the population mean, or to test a hypothesis.

Statistical unit

A unit in a statistical analysis is one member of a set of entities being studied. It is the material source for the mathematical abstraction of a "random variable". Common examples of a unit would be a single person, animal, plant, manufactured item, or country that belongs to a larger collection of such entities being studied.

Units are often referred to as being either experimental units, sampling units or units of observation:

An "experimental unit" is typically thought of as one member of a set of objects that are initially equivalent, with each object then subjected to one of several experimental treatments. Put simply, it is the smallest entity to which a treatment is applied.

A "sampling unit" is typically thought of as an object that has been sampled from a statistical population. This term is commonly used in opinion polling and survey sampling.For example, in an experiment on educational methods, methods may be applied to classrooms of students. This would indicate the classroom as the experimental unit. Measurements of progress may be obtained on individual students, as observational units. But the treatment (teaching method) being applied to the class would not be applied independently to the individual students. Hence the student could not be regarded as the experimental unit. The class, or the teacher by method combination if the teacher had multiple classes, would be the appropriate experimental unit.

In most statistical studies, the goal is to generalize from the observed units to a larger set consisting of all comparable units that exist but are not directly observed. For example, if we randomly sample 100 people and ask them which candidate they intend to vote for in an election, our main interest is in the voting behavior of all eligible voters, not exclusively on the 100 observed units.

In some cases, the observed units may not form a sample from any meaningful population, but rather constitute a convenience sample, or may represent the entire population of interest. In this situation, we may study the units descriptively, or we may study their dynamics over time. But it typically does not make sense to talk about generalizing to a larger population of such units. Studies involving countries or business firms are often of this type. Clinical trials also typically use convenience samples, however the aim is often to make inferences about the efficacy of treatments in other patients, and given the inclusion and exclusion criteria for some clinical trials, the sample may not be representative of the majority of patients with the condition or disease.

In simple data sets, the units are in one-to-one correspondence with the data values. In more complex data sets, multiple measurements are made for each unit. For example, if blood pressure measurements are made daily for a week on each subject in a study, there would be seven data values for each statistical unit. Multiple measurements taken on an individual are not independent (they will be more alike compared to measurements taken on different individuals). Ignoring these dependencies during the analysis can lead to an inflated sample size or pseudoreplication.

While a unit is often the lowest level at which observations are made, in some cases, a unit can be further decomposed as a statistical assembly.

Many statistical analyses use quantitative data that have units of measurement. This is a distinct and non-overlapping use of the term "unit."

Statistical unit are divided into two theye are

A:unit of collection

B:unit of analysis and interpretation

Unit of collection are those units in which figures relating to a particular problem are either enumerated or estimated .the units of collection may be simple or composite .a simple unit is one which represent a singel condition without any qualification .a composite unit is one which is formed by adding a qualification word or phrase to a simple unit

Example -:labour-hours and passenger-killometer

Unit of analysis and interpretation are those unit in term of which statistical data are analysed and interpreted

Example -:ratios ,percentage ,co-efficient ect...

Statistics

Statistics is a branch of mathematics dealing with data collection, organization, analysis, interpretation and presentation. In applying statistics to, for example, a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model process to be studied. Populations can be diverse topics such as "all people living in a country" or "every atom composing a crystal". Statistics deals with all aspects of data including the planning of data collection in terms of the design of surveys and experiments.

See glossary of probability and statistics.

When census data cannot be collected, statisticians collect data by developing specific experiment designs and survey samples. Representative sampling assures that inferences and conclusions can reasonably extend from the sample to the population as a whole. An experimental study involves taking measurements of the system under study, manipulating the system, and then taking additional measurements using the same procedure to determine if the manipulation has modified the values of the measurements. In contrast, an observational study does not involve experimental manipulation.

Two main statistical methods are used in data analysis: descriptive statistics, which summarize data from a sample using indexes such as the mean or standard deviation, and inferential statistics, which draw conclusions from data that are subject to random variation (e.g., observational errors, sampling variation). Descriptive statistics are most often concerned with two sets of properties of a distribution (sample or population): central tendency (or location) seeks to characterize the distribution's central or typical value, while dispersion (or variability) characterizes the extent to which members of the distribution depart from its center and each other. Inferences on mathematical statistics are made under the framework of probability theory, which deals with the analysis of random phenomena.

A standard statistical procedure involves the test of the relationship between two statistical data sets, or a data set and synthetic data drawn from an idealized model. A hypothesis is proposed for the statistical relationship between the two data sets, and this is compared as an alternative to an idealized null hypothesis of no relationship between two data sets. Rejecting or disproving the null hypothesis is done using statistical tests that quantify the sense in which the null can be proven false, given the data that are used in the test. Working from a null hypothesis, two basic forms of error are recognized: Type I errors (null hypothesis is falsely rejected giving a "false positive") and Type II errors (null hypothesis fails to be rejected and an actual difference between populations is missed giving a "false negative"). Multiple problems have come to be associated with this framework: ranging from obtaining a sufficient sample size to specifying an adequate null hypothesis.Measurement processes that generate statistical data are also subject to error. Many of these errors are classified as random (noise) or systematic (bias), but other types of errors (e.g., blunder, such as when an analyst reports incorrect units) can also be important. The presence of missing data or censoring may result in biased estimates and specific techniques have been developed to address these problems.

Statistics can be said to have begun in ancient civilization, going back at least to the 5th century BC, but it was not until the 18th century that it started to draw more heavily from calculus and probability theory. In more recent years statistics has relied more on statistical software to produce tests such as descriptive analysis.

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.