Accuracy and precision

Precision is a description of random errors, a measure of statistical variability.

Accuracy has two definitions:

  1. More commonly, it is a description of systematic errors, a measure of statistical bias; as these cause a difference between a result and a "true" value, ISO calls this trueness.
  2. Alternatively, ISO defines accuracy as describing a combination of both types of observational error above (random and systematic), so high accuracy requires both high precision and high trueness.

In simplest terms, given a set of data points from repeated measurements of the same quantity, the set can be said to be precise if the values are close to each other, while the set can be said to be accurate if their average is close to the true value of the quantity being measured. In the first, more common definition above, the two concepts are independent of each other, so a particular set of data can be said to be either accurate, or precise, or both, or neither.

Common technical definition

Accuracy and precision
Accuracy is the proximity of measurement results to the true value; precision, the repeatability, or reproducibility of the measurement

In the fields of science and engineering, the accuracy of a measurement system is the degree of closeness of measurements of a quantity to that quantity's true value.[1] The precision of a measurement system, related to reproducibility and repeatability, is the degree to which repeated measurements under unchanged conditions show the same results.[1][2] Although the two words precision and accuracy can be synonymous in colloquial use, they are deliberately contrasted in the context of the scientific method.

The field of statistics, where the interpretation of measurements plays a central role, prefers to use the terms bias and variability instead of accuracy and precision: bias is the amount of inaccuracy and variability is the amount of imprecision.

A measurement system can be accurate but not precise, precise but not accurate, neither, or both. For example, if an experiment contains a systematic error, then increasing the sample size generally increases precision but does not improve accuracy. The result would be a consistent yet inaccurate string of results from the flawed experiment. Eliminating the systematic error improves accuracy but does not change precision.

A measurement system is considered valid if it is both accurate and precise. Related terms include bias (non-random or directed effects caused by a factor or factors unrelated to the independent variable) and error (random variability).

The terminology is also applied to indirect measurements—that is, values obtained by a computational procedure from observed data.

In addition to accuracy and precision, measurements may also have a measurement resolution, which is the smallest change in the underlying physical quantity that produces a response in the measurement.

In numerical analysis, accuracy is also the nearness of a calculation to the true value; while precision is the resolution of the representation, typically defined by the number of decimal or binary digits.

In military terms, accuracy refers primarily to the accuracy of fire (or "justesse de tir"), the precision of fire expressed by the closeness of a grouping of shots at and around the centre of the target.[3]

Quantification

In industrial instrumentation, accuracy is the measurement tolerance, or transmission of the instrument and defines the limits of the errors made when the instrument is used in normal operating conditions.[4]

Ideally a measurement device is both accurate and precise, with measurements all close to and tightly clustered around the true value. The accuracy and precision of a measurement process is usually established by repeatedly measuring some traceable reference standard. Such standards are defined in the International System of Units (abbreviated SI from French: Système international d'unités) and maintained by national standards organizations such as the National Institute of Standards and Technology in the United States.

This also applies when measurements are repeated and averaged. In that case, the term standard error is properly applied: the precision of the average is equal to the known standard deviation of the process divided by the square root of the number of measurements averaged. Further, the central limit theorem shows that the probability distribution of the averaged measurements will be closer to a normal distribution than that of individual measurements.

With regard to accuracy we can distinguish:

  • the difference between the mean of the measurements and the reference value, the bias. Establishing and correcting for bias is necessary for calibration.
  • the combined effect of that and precision.

A common convention in science and engineering is to express accuracy and/or precision implicitly by means of significant figures. Here, when not explicitly stated, the margin of error is understood to be one-half the value of the last significant place. For instance, a recording of 843.6 m, or 843.0 m, or 800.0 m would imply a margin of 0.05 m (the last significant place is the tenths place), while a recording of 8436 m would imply a margin of error of 0.5 m (the last significant digits are the units).

A reading of 8,000 m, with trailing zeroes and no decimal point, is ambiguous; the trailing zeroes may or may not be intended as significant figures. To avoid this ambiguity, the number could be represented in scientific notation: 8.0 × 103 m indicates that the first zero is significant (hence a margin of 50 m) while 8.000 × 103 m indicates that all three zeroes are significant, giving a margin of 0.5 m. Similarly, it is possible to use a multiple of the basic measurement unit: 8.0 km is equivalent to 8.0 × 103 m. In fact, it indicates a margin of 0.05 km (50 m). However, reliance on this convention can lead to false precision errors when accepting data from sources that do not obey it. For example, a source reporting a number like 153,753 with precision +/- 5,000 looks like it has precision +/- 0.5. Under the convention it would have been rounded to 154,000.

Precision includes:

  • repeatability — the variation arising when all efforts are made to keep conditions constant by using the same instrument and operator, and repeating during a short time period; and
  • reproducibility — the variation arising using the same measurement process among different instruments and operators, and over longer time periods.

ISO definition (ISO 5725)

Accuracy (trueness and precision)
According to ISO 5725-1, Accuracy consists of trueness (proximity of measurement results to the true value) and precision (repeatability or reproducibility of the measurement)

A shift in the meaning of these terms appeared with the publication of the ISO 5725 series of standards in 1994, which is also reflected in the 2008 issue of the "BIPM International Vocabulary of Metrology" (VIM), items 2.13 and 2.14.[1]

According to ISO 5725-1,[5] the general term "accuracy" is used to describe the closeness of a measurement to the true value. When the term is applied to sets of measurements of the same measureand, it involves a component of random error and a component of systematic error. In this case trueness is the closeness of the mean of a set of measurement results to the actual (true) value and precision is the closeness of agreement among a set of results.

ISO 5725-1 and VIM also avoid the use of the term "bias", previously specified in BS 5497-1,[6] because it has different connotations outside the fields of science and engineering, as in medicine and law.

High accuracy Low precision

Low accuracy due to poor precision

High precision Low accuracy

Low accuracy due to poor trueness

In binary classification

Accuracy is also used as a statistical measure of how well a binary classification test correctly identifies or excludes a condition. That is, the accuracy is the proportion of true results (both true positives and true negatives) among the total number of cases examined.[7] To make the context clear by the semantics, it is often referred to as the "Rand accuracy" or "Rand index".[8][9][10] It is a parameter of the test. The formula for quantifying binary accuracy is:

Accuracy = (TP+TN)/(TP+TN+FP+FN)

where: TP = True positive; FP = False positive; TN = True negative; FN = False negative

In psychometrics and psychophysics

In psychometrics and psychophysics, the term accuracy is interchangeably used with validity and constant error. Precision is a synonym for reliability and variable error. The validity of a measurement instrument or psychological test is established through experiment or correlation with behavior. Reliability is established with a variety of statistical techniques, classically through an internal consistency test like Cronbach's alpha to ensure sets of related questions have related responses, and then comparison of those related question between reference and target population.

In logic simulation

In logic simulation, a common mistake in evaluation of accurate models is to compare a logic simulation model to a transistor circuit simulation model. This is a comparison of differences in precision, not accuracy. Precision is measured with respect to detail and accuracy is measured with respect to reality.[11][12]

In information systems

Information retrieval systems, such as databases and web search engines, are evaluated by many different metrics, some of which are derived from the confusion matrix, which divides results into true positives (documents correctly retrieved), true negatives (documents correctly not retrieved), false positives (documents incorrectly retrieved), and false negatives (documents incorrectly not retrieved). Commonly used metrics include the notions of precision and recall. In this context, precision is defined as the fraction of retrieved documents which are relevant to the query (true positives divided by true+false positives), using a set of ground truth relevant results selected by humans. Recall is defined as the fraction of relevant documents retrieved compared to the total number of relevant documents (true positives divided by true positives+false negatives). Less commonly, the metric of accuracy is used, is defined as the total number of correct classifications (true positives plus true negatives) divided by the total number of documents.

None of these metrics take into account the ranking of results. Ranking is very important for web search engines because readers seldom go past the first page of results, and there are too many documents on the web to manually classify all of them as to whether they should be included or excluded from a given search. Adding a cutoff at a particular number of results takes ranking into account to some degree. The measure precision at k, for example, is a measure of precision looking only at the top ten (k=10) search results. More sophisticated metrics, such as discounted cumulative gain, take into account each individual ranking, and are more commonly used where this is important.

See also

References

  1. ^ a b c JCGM 200:2008 International vocabulary of metrology — Basic and general concepts and associated terms (VIM)
  2. ^ Taylor, John Robert (1999). An Introduction to Error Analysis: The Study of Uncertainties in Physical Measurements. University Science Books. pp. 128–129. ISBN 0-935702-75-X.
  3. ^ North Atlantic Treaty Organization, Nato Standardization Agency AAP-6 - Glossary of terms and definitions, p 43.
  4. ^ Creus, Antonio. Instrumentación Industrial
  5. ^ BS ISO 5725-1: "Accuracy (trueness and precision) of measurement methods and results - Part 1: General principles and definitions.", p.1 (1994)
  6. ^ BS 5497-1: "Precision of test methods. Guide for the determination of repeatability and reproducibility for a standard test method." (1979)
  7. ^ Metz, CE (October 1978). "Basic principles of ROC analysis" (PDF). Semin Nucl Med. 8 (4): 283–98. PMID 112681.
  8. ^ "Archived copy" (PDF). Archived from the original (PDF) on 2015-03-11. Retrieved 2015-08-09.CS1 maint: Archived copy as title (link)
  9. ^ Powers, David M. W (2015). "What the F-measure doesn't measure". arXiv:1503.06410 [cs.IR].
  10. ^ David M W Powers. "The Problem with Kappa" (PDF). Anthology.aclweb.org. Retrieved 11 December 2017.
  11. ^ Acken, John M. (1997). "none". Encyclopedia of Computer Science and Technology. 36: 281–306.
  12. ^ Glasser, Mark; Mathews, Rob; Acken, John M. (June 1990). "1990 Workshop on Logic-Level Modelling for ASICS". SIGDA Newsletter. 20 (1).

External links

Berkson error model

The Berkson error model is a description of random error (or misclassification) in measurement. Unlike classical error, Berkson error causes little or no bias in the measurement. It was proposed by Joseph Berkson in an article entitled “Are there two regressions?,” published in 1950.

An example of Berkson error arises in exposure assessment in epidemiological studies. Berkson error may predominate over classical error in cases where exposure data are highly aggregated. While this kind of error reduces the power of a study, risk estimates themselves are not themselves attenuated (as would be the case where random error predominates).

Bias (statistics)

Statistical bias is a feature of a statistical technique or of its results whereby the expected value of the results differs from the true underlying quantitative parameter being estimated.

Bias of an estimator

In statistics, the bias (or bias function) of an estimator is the difference between this estimator's expected value and the true value of the parameter being estimated. An estimator or decision rule with zero bias is called unbiased. Otherwise the estimator is said to be biased. In statistics, "bias" is an objective property of an estimator, and while not a desired property, it is not pejorative, unlike the ordinary English use of the term "bias".

Bias can also be measured with respect to the median, rather than the mean (expected value), in which case one distinguishes median-unbiased from the usual mean-unbiasedness property. Bias is related to consistency in that consistent estimators are convergent and asymptotically unbiased (hence converge to the correct value as the number of data points grows arbitrarily large), though individual estimators in a consistent sequence may be biased (so long as the bias converges to zero); see bias versus consistency.

All else being equal, an unbiased estimator is preferable to a biased estimator, but in practice all else is not equal, and biased estimators are frequently used, generally with small bias. When a biased estimator is used, bounds of the bias are calculated. A biased estimator may be used for various reasons: because an unbiased estimator does not exist without further assumptions about a population or is difficult to compute (as in unbiased estimation of standard deviation); because an estimator is median-unbiased but not mean-unbiased (or the reverse); because a biased estimator gives a lower value of some loss function (particularly mean squared error) compared with unbiased estimators (notably in shrinkage estimators); or because in some cases being unbiased is too strong a condition, and the only unbiased estimators are not useful. Further, mean-unbiasedness is not preserved under non-linear transformations, though median-unbiasedness is (see § Effect of transformations); for example, the sample variance is an unbiased estimator for the population variance, but its square root, the sample standard deviation, is a biased estimator for the population standard deviation. These are all illustrated below.

Calibration

Calibration in measurement technology and metrology is the comparison of measurement values delivered by a device under test with those of a calibration standard of known accuracy. Such a standard could be another measurement device of known accuracy, a device generating the quantity to be measured such as a voltage, sound tone, or a physical artefact, such as a metre ruler.

The outcome of the comparison can result in no significant error being noted on the device under test, a significant error being noted but no adjustment made, or an adjustment made to correct the error to an acceptable level. Strictly speaking, the term calibration means just the act of comparison, and does not include any subsequent adjustment.

The calibration standard is normally traceable to a national standard held by a National Metrological Institute.

Circular error probable

In the military science of ballistics, circular error probable (CEP) (also circular error probability or circle of equal probability) is a measure of a weapon system's precision. It is defined as the radius of a circle, centered on the mean, whose boundary is expected to include the landing points of 50% of the rounds; said otherwise, it is the median error radius. That is, if a given bomb design has a CEP of 100 m, when 100 are targeted at the same point, 50 will fall within a 100 m circle around their average impact point. (The distance between the target point and the average impact point is referred to as bias.)

There are associated concepts, such as the DRMS (distance root mean square), which is the square root of the average squared distance error, and R95, which is the radius of the circle where 95% of the values would fall in.

The concept of CEP also plays a role when measuring the accuracy of a position obtained by a navigation system, such as GPS or older systems such as LORAN and Loran-C.

Explanatory power

This article deals with explanatory power in the context of the philosophy of science. For a statistical measure of explanatory power, see coefficient of determination or mean squared prediction error.Explanatory power is the ability of a hypothesis or theory to effectively explain the subject matter it pertains to. The opposite of explanatory power is explanatory impotence.

In the past, various criteria or measures for explanatory power have been proposed. In particular, one hypothesis, theory, or explanation can be said to have more explanatory power than another about the same subject matter

if more facts or observations are accounted for;

if it changes more "surprising facts" into "a matter of course" (following Peirce);

if more details of causal relations are provided, leading to a high accuracy and precision of the description;

if it offers greater predictive power, i.e., if it offers more details about what we should expect to see, and what we should not;

if it depends less on authorities and more on observations;

if it makes fewer assumptions;

if it is more falsifiable, i.e., more testable by observation or experiment (following Popper).Recently, David Deutsch proposed that theorists should seek explanations that are hard to vary.

By this expression he intends to state that a hard to vary explanation provides specific details which fit together so tightly that it is impossible to change any one detail without affecting the whole theory.

Frequency standard

A frequency standard is a stable oscillator used for frequency calibration or reference. A frequency standard generates a fundamental frequency with a high degree of accuracy and precision. Harmonics of this fundamental frequency are used to provide reference points.

Since time is the reciprocal of frequency, it is relatively easy to derive a time standard from a frequency standard. A standard clock comprises a frequency standard, a device to count off the cycles of the oscillation emitted by the frequency standard, and a means of displaying or outputting the result.

Frequency standards in a network or facility are sometimes administratively designated as primary or secondary. The terms primary and secondary, as used in this context, should not be confused with the respective technical meanings of these words in the discipline of precise time and frequency.

Horology

Horology ("the study of time", related to Latin horologium from Greek ὡρολόγιον, "instrument for telling the hour", from ὥρα hṓra "hour; time" and -o- interfix and suffix -logy) is the study of the measurement of time. Clocks, watches, clockwork, sundials, hourglasses, clepsydras, timers, time recorders, marine chronometers and atomic clocks are all examples of instruments used to measure time. In current usage, horology refers mainly to the study of mechanical time-keeping devices, while chronometry more broadly includes electronic devices that have largely supplanted mechanical clocks for the best accuracy and precision in time-keeping.

People interested in horology are called horologists. That term is used both by people who deal professionally with timekeeping apparatus (watchmakers, clockmakers), as well as aficionados and scholars of horology. Horology and horologists have numerous organizations, both professional associations and more scholarly societies. The largest horological membership organisation globally is the NAWCC, the National Association of Watch and Clock Collectors, which is USA based, but also has local chapters elsewhere.

Measurement

Measurement is the assignment of a number to a characteristic of an object or event, which can be compared with other objects or events. The scope and application of measurement are dependent on the context and discipline. In the natural sciences and engineering, measurements do not apply to nominal properties of objects or events, which is consistent with the guidelines of the International vocabulary of metrology published by the International Bureau of Weights and Measures. However, in other fields such as statistics as well as the social and behavioral sciences, measurements can have multiple levels, which would include nominal, ordinal, interval and ratio scales.Measurement is a cornerstone of trade, science, technology, and quantitative research in many disciplines. Historically, many measurement systems existed for the varied fields of human existence to facilitate comparisons in these fields. Often these were achieved by local agreements between trading partners or collaborators. Since the 18th century, developments progressed towards unifying, widely accepted standards that resulted in the modern International System of Units (SI). This system reduces all physical measurements to a mathematical combination of seven base units. The science of measurement is pursued in the field of metrology.

Medical test

A medical test is a medical procedure performed to detect, diagnose, or monitor diseases, disease processes, susceptibility, or to determine a course of treatment. Medical tests relate to clinical chemistry and molecular diagnostics, and are typically performed in a medical laboratory.

Nanoindenter

A nanoindenter is the main component for indentation hardness tests used in nanoindentation. Since the mid-1970s nanoindentation has become the primary method for measuring and testing very small volumes of mechanical properties. Nanoindentation, also called depth sensing indentation or instrumented indentation, gained popularity with the development of machines that could record small load and displacement with high accuracy and precision. The load displacement data can be used to determine modulus of elasticity, hardness, yield strength, fracture toughness, scratch hardness and wear properties.

Observational error

Observational error (or measurement error) is the difference between a measured value of a quantity and its true value. In statistics, an error is not a "mistake". Variability is an inherent part of the results of measurements and of the measurement process.

Measurement errors can be divided into two components: random error and systematic error.Random errors are errors in measurement that lead to measurable values being inconsistent when repeated measurements of a constant attribute or quantity are taken. Systematic errors are errors that are not determined by chance but are introduced by an inaccuracy (involving either the observation or measurement process) inherent to the system. Systematic error may also refer to an error with a non-zero mean, the effect of which is not reduced when observations are averaged.

Pedant

A pedant is a person who is excessively concerned with formalism, accuracy, and precision, or one who makes an ostentatious and arrogant show of learning.

Pipette

A pipette (sometimes spelled pipet) is a laboratory tool commonly used in chemistry, biology and medicine to transport a measured volume of liquid, often as a media dispenser. Pipettes come in several designs for various purposes with differing levels of accuracy and precision, from single piece glass pipettes to more complex adjustable or electronic pipettes. Many pipette types work by creating a partial vacuum above the liquid-holding chamber and selectively releasing this vacuum to draw up and dispense liquid. Measurement accuracy varies greatly depending on the style.

Precision bias

Precision bias is a form of cognitive bias in which an evaluator of information commits a logical fallacy as the result of confusing accuracy and precision. More particularly, in assessing the merits of an argument, a measurement, or a report, an observer or assessor falls prey to precision bias when he or she believes that greater precision implies greater accuracy (i.e., that simply because a statement is precise, it is also true); the observer or assessor are said to provide false precision.

Precision bias, whether called by that phrase or another, is addressed in fields such as economics, in which there is a significant danger that a seemingly impressive quantity of statistics may be collected even though these statistics may be of little value for demonstrating any particular truth.

It is also called the numeracy bias, or the range estimate aversion.

The clustering illusion and the Texas sharpshooter fallacy may both be treated as relatives of precision bias. In these former fallacies, precision is mistakenly considered evidence of causation, when in fact the clustered information may actually be the result of randomness.

Sensitivity and specificity

Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as a classification function, that are widely used in medicine:

Sensitivity (also called the true positive rate, the recall, or probability of detection in some fields) measures the proportion of actual positives that are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition).

Specificity (also called the true negative rate) measures the proportion of actual negatives that are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition).Equivalently, in medical tests sensitivity is the extent to which actual positives are not overlooked (so false negatives are few), and specificity is the extent to which actual negatives are classified as such (so false positives are few). Thus a highly sensitive test rarely overlooks an actual positive (for example, showing "nothing bad" despite something bad existing); a highly specific test rarely registers a positive classification for anything that is not the target of testing (for example, finding one bacterial species and mistaking it for another closely related one that is the true target); and a test that is highly sensitive and highly specific does both, so it "rarely overlooks a thing that it is looking for" and it "rarely mistakes anything else for that thing." Because most medical tests do not have sensitivity and specificity values above 99%, "rarely" does not equate to certainty. But for practical reasons, tests with sensitivity and specificity values above 90% have high credibility, albeit usually no certainty, in differential diagnosis.

Sensitivity therefore quantifies the avoiding of false negatives and specificity does the same for false positives. For any test, there is usually a trade-off between the measures – for instance, in airport security, since testing of passengers is for potential threats to safety, scanners may be set to trigger alarms on low-risk items like belt buckles and keys (low specificity) in order to increase the probability of identifying dangerous objects and minimize the risk of missing objects that do pose a threat (high sensitivity). This trade-off can be represented graphically using a receiver operating characteristic curve. A perfect predictor would be described as 100% sensitive, meaning all sick individuals are correctly identified as sick, and 100% specific, meaning no healthy individuals are incorrectly identified as sick. In reality, however, any non-deterministic predictor will possess a minimum error bound known as the Bayes error rate. The values of sensitivity and specificity are agnostic to the percent of positive cases in the population of interest (as opposed to, for example, precision).

The terms "sensitivity" and "specificity" were introduced by the American biostatistician Jacob Yerushalmy in 1947 .

Significant figures

The significant figures (also known as the significant digits) of a number are digits that carry meaning contributing to its measurement resolution. This includes all digits except:

All leading zeros;

Trailing zeros when they are merely placeholders to indicate the scale of the number (exact rules are explained at identifying significant figures); and

Spurious digits introduced, for example, by calculations carried out to greater precision than that of the original data, or measurements reported to a greater precision than the equipment supports.Significance arithmetic are approximate rules for roughly maintaining significance throughout a computation. The more sophisticated scientific rules are known as propagation of uncertainty.

Numbers are often rounded to avoid reporting insignificant figures. For example, it would create false precision to express a measurement as 12.34500 kg (which has seven significant figures) if the scales only measured to the nearest gram and gave a reading of 12.345 kg (which has five significant figures). Numbers can also be rounded merely for simplicity rather than to indicate a given precision of measurement, for example, to make them faster to pronounce in news broadcasts.

Statistical dispersion

In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard deviation, and interquartile range.

Dispersion is contrasted with location or central tendency, and together they are the most used properties of distributions.

USS Gleaves (DD-423)

USS Gleaves (DD-423) was the lead ship of the Gleaves class of destroyers. She is the only ship of the United States Navy to be named for Admiral Albert Gleaves, who is credited with improving the accuracy and precision of torpedoes and other naval arms.

Gleaves was launched by the Bath Iron Works, Bath, Maine, 9 December 1939, sponsored jointly by Miss Evelina Gleaves Van Metre and Miss Clotilda Florence Cohen, granddaughters of Admiral Gleaves; and commissioned 14 June 1940, at Boston Navy Yard, Lieutenant Commander E. H. Pierce in command.

ISO standards by standard number
1–9999
10000–19999
20000+

This page is based on a Wikipedia article written by authors (here).
Text is available under the CC BY-SA 3.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.