Journal of Applied Physiology Watch the video to see how APS reaches out to developing nations.
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


J Appl Physiol 85: 775-786, 1998;
8750-7587/98 $5.00
This Article
Right arrow Abstract Freely available
Right arrow Full Text (PDF) Free
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Curran-Everett, D.
Right arrow Articles by Kafadar, K.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Curran-Everett, D.
Right arrow Articles by Kafadar, K.
Vol. 85, Issue 3, 775-786, September 1998

INVITED REVIEW
Fundamental concepts in statistics: elucidation and illustration

Douglas Curran-Everett, Sue Taylor, and Karen Kafadar

Departments of Pediatrics and of Preventive Medicine and Biometrics, School of Medicine, University of Colorado Health Sciences Center, Denver, 80262; and Department of Mathematics, University of Colorado at Denver, Denver, Colorado 80217-3364

    ABSTRACT
Top
Abstract
Introduction
Summary
Appendix
References

Fundamental concepts in statistics form the cornerstone of scientific inquiry. If we fail to understand fully these fundamental concepts, then the scientific conclusions we reach are more likely to be wrong. This is more than supposition: for 60 years, statisticians have warned that the scientific literature harbors misunderstandings about basic statistical concepts. Original articles published in 1996 by the American Physiological Society's journals fared no better in their handling of basic statistical concepts. In this review, we summarize the two main scientific uses of statistics: hypothesis testing and estimation. Most scientists use statistics solely for hypothesis testing; often, however, estimation is more useful. We also illustrate the concepts of variability and uncertainty, and we demonstrate the essential distinction between statistical significance and scientific importance. An understanding of concepts such as variability, uncertainty, and significance is necessary, but it is not sufficient; we show also that the numerical results of statistical analyses have limitations.

confidence interval; estimation; tolerance interval; uncertainty; variability

    INTRODUCTION
Top
Abstract
Introduction
Summary
Appendix
References

There are very few things which we know, which are not capable of being reduc'd to a Mathematical Reasoning, ... and where a Mathematical Reasoning can be had, it's as great folly to make use of any other, as to grope for a thing in the dark when you have a Candle standing by you.
John Arbuthnot (1692)

STATISTICS IS ONE KIND of mathematical reasoning. Its concepts and principles are ubiquitous in science: as researchers, we use them to design experiments, analyze data, report results, and interpret the published findings of others. Indeed, it is from this foundation of statistical concepts and principles that scientific knowledge is accumulated. If we fail to understand fully these fundamental statistical concepts and principles---if our statistical reasoning is faulty---then we are more likely to reach wrong scientific conclusions. Wrong conclusions based on faulty reasoning is shoddy science; it is also unethical (1, 21, 30).

Regrettably, faulty reasoning in statistics rears its head in the practice of science: for 60 years, statisticians have documented statistical errors in the scientific literature (3, 4, 17, 33, 50). In part, these errors exist because many introductory textbooks of statistics paradoxically hinder literacy in statistics: they emphasize methods rather than concepts, they contain glaring errors, or they perpetuate misconceptions (4, 11, 12).

In his editorial prelude to a series of statistical papers, Yates (51) wrote that the papers were designed to raise statistical consciousness and thereby reduce statistical errors in journals published by the American Physiological Society. Rather than reinforce concepts, these papers reviewed methods: analysis of variance (20), linear regression (37, 46), mathematical modeling (22, 29, 40), risk assessment (36), and statistical packages (34). The proper use of any statistical technique, however, requires an understanding of the fundamental statistical concepts behind the technique.

How well do physiologists understand fundamental concepts in statistics? One way to answer this question is to examine the empirical incidence of basic statistical quantities such as standard deviations, standard errors, and confidence intervals. These quantities characterize different statistical features: standard deviations characterize variability in the population, whereas standard errors and confidence intervals characterize uncertainty about the estimated values of population parameters, e.g., means. Of the original articles published in 1996 by the American Physiological Society, the overwhelming majority (69-93%, range) report standard errors, apparently not as estimates of uncertainty but as estimates of variability (Table 1). Virtually no articles (0-2%, range) report confidence intervals, recommended by statisticians (2, 5, 9, 10, 28, 39) as interval estimates of uncertainty about the values of population parameters. Moreover, few articles (4-15%, range) report precise P values, which precludes personal assessment of statistical significance.

                              
View this table:
[in this window]
[in a new window]
 
Table 1.   Manuscripts for the American Physiological Society's journals in 1996: use of statistics and statisticians

In this review, we summarize the primary scientific uses of statistics. Then, we illustrate several fundamental concepts: variability, uncertainty, and significance. Last, we illustrate that although an understanding of concepts such as variability, uncertainty, and significance is necessary, it is not sufficient: it is essential to realize also that the numerical results of statistical analyses have limitations.

Glossary

 alpha Critical significance level
Ave {q} Average of the quantity q
µ Population mean
 nu Degrees of freedom
n Number of observations
N (µ, sigma 2) Normal (Gaussian) distribution with mean µ and variance sigma 2
P Achieved significance level
Pr {A} Probability of event A
 sigma Population standard deviation
 &sfgr;<SUB><OVL><IT>y</IT></OVL></SUB> Standard deviation of the sampling distribution of the sample mean
s Sample standard deviation
 sigma 2 Population variance
s2 Sample variance
SE {q} Standard error of the quantity q
Var {q} Variance of the quantity q
Y Random variable Y
yi Sample observation i, where i = 1, 2, ... , n
 <OVL><IT>y</IT></OVL> Sample mean

    SCIENTIFIC USES OF STATISTICS

In science, there are two main uses of statistics: hypothesis testing and estimation. Most researchers use statistics solely for hypothesis testing. In many situations, statisticians play down hypothesis testing and prefer estimation instead.

Hypothesis testing. To test a scientific hypothesis, a researcher must formulate the hypothesis before any data are collected, then design and execute an experiment that is relevant to it. Because the hypothesis is most often one of no difference, the hypothesis is called, by tradition, the null hypothesis.1 Using data from the experiment, the researcher must next compute the observed value T of a test statistic. Finally, the researcher must compare the observed value T with some critical value T *, chosen from the distribution of the test statistic that is based on the null hypothesis. If T is more extreme than T *, then that is a surprising result if the null hypothesis is true, and the researcher is entitled, on statistical grounds, to become skeptical about the scientific validity of the null hypothesis.

The statistical test of a null hypothesis is useful because it assesses the strength of the evidence: it helps guard against an unwarranted conclusion, or it helps argue for a real experimental effect (19, 48). Nevertheless, a null hypothesis is often an artificial construct: before any data are recorded, the investigator knows---at least, suspects---that the null hypothesis is not exactly true. Moreover, the only question a hypothesis test can answer is a trivial one: is there anything other than random variation here?2

Statisticians have emphasized repeatedly the limited value of hypothesis testing (2, 4, 9, 18, 24, 28, 31, 38, 50). In fact, the P values that result from hypothesis tests have been described as "absurdly academic"3 (25) and as having a "strictly limited role" (19) in data analysis. Within the scientific community, unwarranted focus on hypothesis testing has blurred the distinction between statistical significance and scientific importance (3, 13, 19). Most investigators appear to reach scientific conclusions that are based not on their knowledge of science but solely on the probabilities of test statistics (16); this is an untenable approach to scientific discovery.

The limited utility of hypothesis testing can be demonstrated with an example. Suppose a clinician wants to assess the impact of a placebo and the beta -blockers bisoprolol and metoprolol on heart rate variability in patients with left heart failure. Suppose also that the clinician constructs the null and alternative hypotheses, H0 and H1, as
<IT>H</IT><SUB>0</SUB>: treatments have identical effects
 on heart rate variability
<IT>H</IT><SUB>1</SUB>: treatments have different effects
 on heart rate variability
The result of this hypothesis test will fail to convey any information about the direction or magnitude of the treatment effects on heart rate variability. Direction and magnitude are important: in patients with left heart failure, decreases in heart rate variability are associated with increases in the risk of sudden cardiac catastrophe (49). Direction and magnitude of an effect reflect scientific importance; they are obtained by estimation.

Estimation. Regardless of the statistical result of a hypothesis test, the crucial question concerns the scientific result: is the experimental effect big enough to be relevant? A point estimate of a population parameter4 and an interval estimate of the uncertainty about the value of that parameter help answer this question. For example, one point estimate of a population mean is the sample mean; one interval estimate of the uncertainty about the value of the populations mean is a confidence interval. Interval estimates circumvent the drawbacks inherent to hypothesis testing, yet they provide the same statistical information as a hypothesis test (15, 18, 28, 38). More important, point and interval estimates convey information about scientific importance.

Practical considerations. Estimation focuses attention on the magnitude and uncertainty of the experimental results. We must emphasize that hypothesis testing can have value beyond assessing the strength of the experimental evidence: for example, hypothesis testing is useful if an investigator wants to evaluate the importance of between-subject variability in an experiment. In practice, estimation should be done whenever it is relevant and feasible; the precise P value from the associated hypothesis test should be reported with the point and interval estimates. When more than one hypothesis is tested in an experiment, the problem of multiple comparisons becomes relevant. Nevertheless, a discussion of the issues involved in multiple-comparison procedures is beyond the scope of this review; Refs. 2, 9, 42, and 48 summarize these issues.

For the rest of this review, we focus our attention on several aspects of estimation.

    USING SAMPLES TO LEARN ABOUT POPULATIONS

As researchers, we use samples to make inferences about populations. A sample interests us not because of its own merits but because it helps us estimate selected characteristics of the underlying population: for example, the sample mean <OVL><IT>y</IT></OVL> estimates the population mean µ.5

As an illustration, suppose the random variable Y represents the change in systolic blood pressure after some intervention. Suppose also that the distribution of Y conforms to a normal distribution. A normal distribution is specified completely by two parameters: the mean and variance. The population mean µ conveys the location of the center of the distribution; the population standard deviation sigma , the square root of the population variance sigma 2, conveys the spread of the distribution. The distribution of possible outcomes of the random variable Y is described by the normal probability density function ( f ), which incorporates µ and sigma 2
<IT>f</IT>(<IT>y</IT>) = <FR><NU>1</NU><DE>&sfgr;<RAD><RCD>2&pgr;</RCD></RAD></DE></FR>⋅exp {−(<IT>y</IT> − &mgr;)<SUP>2</SUP>/(2⋅&sfgr;<SUP>2</SUP>)}, (1)
for −∞ < <IT>y</IT> < +∞
In Fig. 1, the distributions for three different populations are theoretical: each depicts the distribution of population values as if we had observed the entire population.6


View larger version (16K):
[in this window]
[in a new window]
 
Fig. 1.   Using samples to learn about populations: 3 normal distributions. These distributions differ in location, reflected in the mean µ, or spread, reflected in the standard deviation sigma . A normal probability density function (Eq. 1) describes the distribution of each population.

Suppose we want to estimate µ1 = -15, the mean of population 1, in Fig. 1. To do this, we would measure the change in systolic blood pressure in a sample of n independent observations, y1, y2, ... , yn, from the population. For simplicity, assume we limit the sample to 10 observations. One random sample is
−33, −15, −6, 0, 18, −3, 8, −22, −22, −7
The average of these sample observations is the sample mean <OVL><IT>y</IT></OVL>
<OVL><IT>y</IT></OVL> = <FR><NU>1</NU><DE>10</DE></FR>⋅<LIM><OP>∑</OP><LL><IT>i</IT>=1</LL><UL>10</UL></LIM> <IT>y</IT><SUB><IT>i</IT></SUB> = −8.2 (2)
Because of intrinsic variability in the population, the sample mean <OVL><IT>y</IT></OVL> differs from the population mean µ1; only because this is a contrived example do we know the true magnitude of the discrepancy.7 Next, we review measures that estimate variability in the population.

    ESTIMATING VARIABILITY IN THE POPULATION

The preceding sample observations, -33, -15, ... , -7, differ because the population from which they were drawn is distributed over a range of possible values. This intrinsic variability is more than a distraction: it is an integral part of statistics, and the careful study of variability may reveal something about underlying scientific processes (25). The most common measure of the variability among sample observations is the sample standard deviation s, the square root of the sample variance s2
<IT>s</IT><SUP>2</SUP> = <FR><NU>1</NU><DE><IT>n</IT> − 1</DE></FR>⋅<LIM><OP>∑</OP><LL><IT>i</IT>=1</LL><UL><IT>n</IT></UL></LIM> ( <IT>y</IT><SUB><IT>i</IT></SUB> − <OVL><IT>y</IT></OVL>)<SUP>2</SUP>
(See also Refs. 2, 9, 42, and 48.) The sample standard deviation characterizes the typical distance of an observation from the distribution center; in other words, it reflects the dispersion of individual sample observations about the sample mean. The sample standard deviation s also estimates the population standard deviation sigma : the standard deviation of the sample observations -33, -15, ... , -7 is s = 15.2, which estimates sigma  = 20.

Most journals would publish the preceding sample mean and standard deviation as
−8.2 mmHg ± 15.2
The ± symbol, however, is superfluous: the standard deviation is a single positive number. A standard deviation can be reported clearly with notation of this form
−8.2 mmHg (SD 15.2)
In a table, the symbol SD can be omitted without loss of clarity as long as the table legend identifies the parenthetical value as a standard deviation.

The standard deviation is often a useful index of variability, but in many experimental situations it may be a deceptive one: even subtle departures from a normal distribution can render useless the standard deviation as an index of variability (43); often, the distribution of a biological variable differs grossly from a normal distribution. As one example, the distribution of values for plasma creatinine (26) resembles the skewed distribution depicted in Fig. 2. When the tails of a distribution are elongated, as is the right tail of this skewed distribution, the sample standard deviation will be an inflated measure of variability in the population (43, 48). There are two remedies to this misrepresentation of variability by the standard deviation: use another measure of variability, or transform the data.


View larger version (13K):
[in this window]
[in a new window]
 
Fig. 2.   Estimating variability in the population: a skewed distribution. The lognormal probability density function (Eq. A1) describes this skewed distribution in which the Pr {Y <=  6.1} = 0.50 and the Pr {2.1 <= Y <=  16.4} = 0.68 (gray area). For a normal distribution with the same mean and variance (inset), the Pr {Y <= 10.0} = 0.50, and the Pr {-3.1 <=  Y <=  23.1} = 0.68 (gray area). See APPENDIX for further explanation.

Alternative measures of variability. Two measures of variability that are useful with a variety of distributions are the mean absolute deviation and the interquartile range. The mean absolute deviation (Ave {|dev|}) is the average distance of the sample observations from the sample mean
Ave {‖dev‖} = <FR><NU>1</NU><DE><IT>n</IT></DE></FR>⋅<LIM><OP>∑</OP><LL><IT>i</IT>=1</LL><UL><IT>n</IT></UL></LIM> ‖<IT>y</IT><SUB><IT>i</IT></SUB> − <OVL><IT>y</IT></OVL>‖

The interquartile range (often designated as IQR) encompasses the middle 50% of a distribution and is the difference between the 75th and 25th percentiles. For 0 < phi < 1, the 100phith percentile is the value below which 100phi% of the distribution is found.

Data transformation. When the sample observations happen to be drawn from a population that has a skewed distribution (e.g., a constituent of blood or the growth rate of a tumor), a transformation may change the shape of their distribution so that the distribution of the transformed observations is more symmetric (14, 23, 26, 32, 48). Common transformations include the logarithmic, inverse, square root, and arc sine transformations. The APPENDIX reviews a useful family of data transformations.

In the next section, we revisit the unknown discrepancy between the sample estimate of a population parameter and the population parameter itself.

    ESTIMATING UNCERTAINTY ABOUT A POPULATION PARAMETER

In the sampling exercise from USING SAMPLES TO LEARN ABOUT POPULATIONS, the sample mean <OVL><IT>y</IT></OVL> = -8.2 (Eq. 2) estimated the population mean µ1 = -15. If we had calculated this sample mean from experimental observations, then we would be uncertain about the magnitude of the discrepancy between the sample estimate <OVL><IT>y</IT></OVL> and the population parameter µ1. The ability to estimate the level of uncertainty about the value of a population parameter by using the sample estimate of that parameter is a powerful aspect of statistics (47).

Suppose we measure the same response variable, the change in systolic blood pressure, in a second sample of 10 independent observations drawn from the same population. We know beforehand that because of random sampling the mean of the second sample, <OVL><IT>y</IT></OVL><SUB>2</SUB>, will differ from the mean of the first sample, <OVL><IT>y</IT></OVL><SUB>1</SUB> -8.2. If we measure the change in systolic blood pressure in 100 samples of 10 independent observations, then we expect 100 different estimates of the population mean µ1; for example
<OVL><IT>y</IT></OVL><SUB>1</SUB> = −8.2, <OVL><IT>y</IT></OVL><SUB>2</SUB> = −8.1, …, <OVL><IT>y</IT></OVL><SUB>100</SUB> = −22.5
If we treat these 100 observed sample means as 100 observations, then we can calculate their mean and standard deviation, designated as Ave{<OVL><IT>y</IT></OVL>} and SD{<OVL><IT>y</IT></OVL>}
Ave {<OVL><IT>y</IT></OVL>} = −14.5 and SD {<OVL><IT>y</IT></OVL>} = 6.07

We can generalize from this empirical distribution of sample means to a theoretical distribution of the sample mean for a sample of size n. Consider a random variable Y that is distributed normally with mean µ and variance sigma 2, which are known; the notation for this normal distribution is Y ~ N(µ, sigma 2). If an infinite number of samples, each with n independent observations, is drawn from this normal distribution, then the sample means <OVL><IT>y</IT></OVL><SUB>1</SUB>, <OVL><IT>y</IT></OVL><SUB>2</SUB>, … , <OVL><IT>y</IT></OVL><SUB>∞</SUB> will also be distributed normally.8 The average of the sample means, Ave {<OVL><IT>y</IT></OVL>}, is the population mean µ, but the variance of the sample means (Var {<OVL><IT>y</IT></OVL>}) is smaller than the population variance sigma 2 by a factor of 1/n
Ave {<OVL><IT>y</IT></OVL>} = &mgr; and Var {<OVL><IT>y</IT></OVL>} = &sfgr;<SUP>2</SUP><SUB><OVL><IT>y</IT></OVL></SUB> = &sfgr;<SUP>2</SUP>/<IT>n</IT>
(The APPENDIX derives these expressions. Figure 3 develops these expressions using empirical examples.) Therefore, the standard deviation of the theoretical distribution of the sample mean, &sfgr;<SUB><OVL><IT>y</IT></OVL></SUB>, is
&sfgr;<SUB><OVL><IT>y</IT></OVL></SUB> = &sfgr;/<RAD><RCD><IT>n</IT></RCD></RAD>
If the sample size n increases, then the standard deviation &sfgr;<SUB><OVL><IT>y</IT></OVL></SUB> will decrease: that is, the more sample observations we have, the more certain we will be that the point estimate <OVL><IT>y</IT></OVL> is near the actual population mean µ.


View larger version (10K):
[in this window]
[in a new window]
 
Fig. 3.   Estimating uncertainty about a population parameter: empirical distributions of sample means. These distributions are based on 1,000 samples of 5 (A), 10 (B), 20 (C), or 40 (D) observations drawn at random from population 1, for which the mean µ = -15 and the variance sigma 2 = 400. For each empirical distribution, the average of the sample means, Ave {<OVL><IT>y</IT></OVL>}, happens to be -15.1. As sample size increases, however, the sample means become concentrated more closely about Ave {<OVL><IT>y</IT></OVL>}. When sample size doubles, the variance of the sample means, Var {<OVL><IT>y</IT></OVL>}, is approximately halved.

The standard deviation of the theoretical distribution of the sample mean is known also as the standard error of the sample mean, SE {<OVL><IT>y</IT></OVL>}; that is
SE {<OVL><IT>y</IT></OVL>} = &sfgr;/<RAD><RCD><IT>n</IT></RCD></RAD>
In estimation, the standard error of the mean has no particular value; instead, it is useful because of its role in the calculation of a confidence interval for the population mean µ.9

Confidence intervals. When we construct a confidence interval for the population mean, we assign numerical bounds to the expected discrepancy between the sample mean <OVL><IT>y</IT></OVL> and the population mean µ. In essence, a confidence interval is a range that we expect, with some level of confidence, to include the actual value of the population mean. Below, we use the theoretical distribution of the sample mean to derive the confidence interval for the population mean µ.10

In the theoretical distribution of the sample mean, 100(1 - alpha )% of the possible sample means is included in the interval
[&mgr; − <IT>a</IT>, &mgr; + <IT>a</IT>] (4)
where the allowance a is
<IT>a</IT> = <IT>z</IT><SUB>&agr;/2</SUB>⋅SE {<OVL><IT>y</IT></OVL>} (5)
In Eq. 5, zalpha /2 is the 100[1 - (alpha /2)]th percentile from the standard normal distribution, i.e., a normal distribution with mean 0 and variance 1, and SE {<OVL><IT>y</IT></OVL>} is defined by Eq. 3. Therefore, when the population standard deviation sigma  is known, 95% of the possible sample means are within 1.96⋅SE {<OVL><IT>y</IT></OVL>} of the population mean µ.

The interval in Eq. 4 can be written as the probability expression
Pr {&mgr; − <IT>a</IT> ≤ <OVL><IT>y</IT></OVL> ≤ &mgr; + <IT>a</IT>} = 1 − &agr;
which declares that the probability is 1 - alpha  that a sample mean lies within the interval [µ - a, µ + a]. After algebraic rearrangement, this expression can be written
Pr {<OVL><IT>y</IT></OVL> − <IT>a</IT> ≤ &mgr; ≤ <OVL><IT>y</IT></OVL> + <IT>a</IT>} = 1 − &agr;
but note that the randomness resides in the parameter estimate <OVL><IT>y</IT></OVL>, not in the actual parameter µ. In this form, the interval
[<OVL><IT>y</IT></OVL> − <IT>a</IT>, <OVL><IT>y</IT></OVL> + <IT>a</IT>] (6)
is called the 100(1 - alpha )% confidence interval for the population mean µ.

In practice, the sample standard deviation s estimates the population standard deviation sigma , which means that <IT>s</IT>/<RAD><RCD><IT>n</IT></RCD></RAD> estimates the standard error of the mean (Eq. 3). In calculating a 100(1 - alpha )% confidence interval for the mean µ, this uncertainty about the actual value of sigma  is handled by replacing zalpha /2 in Eq. 5 with talpha /2,nu , the 100[1 - (alpha /2)]th percentile from a Student t distribution with nu  = n - 1 degrees of freedom. Therefore, the allowance applied to the sample mean to obtain the 100(1 - alpha )% confidence interval for the population mean (Eq. 6) is
<IT>a</IT> = <IT>t</IT><SUB>&agr;/2,&ngr;</SUB>⋅SE {<OVL><IT>y</IT></OVL>}
where SE {<OVL><IT>y</IT></OVL>} = <IT>s</IT>/<RAD><RCD><IT>n</IT></RCD></RAD>. Note that this allowance exceeds the allowance in Eq. 5: there is greater uncertainty about the value of the population mean µ. This happens because if nu  < infinity , then talpha /2,nu  > zalpha /2 for all values of alpha .

Suppose we want to calculate a confidence interval for the population mean µ1 = -15 by using the observations -33, -15, ... , -7 of the first sample. The mean and standard deviation of these 10 observations are <OVL><IT>y</IT></OVL> = -8.2 and s = 15.2. Therefore, the estimated standard error of the mean is
SE {<OVL><IT>y</IT></OVL>} = <IT>s</IT>/<RAD><RCD><IT>n</IT></RCD></RAD> = 15.2/<RAD><RCD>10</RCD></RAD> = 4.81
Because n = 10, there are nu  = n - 1 = 9 degrees of freedom. If we want a 95% confidence interval, then alpha  = 0.05, talpha /2,nu = 2.26, and the allowance a = 2.26 × 4.81 = 10.9. Therefore, the 95% confidence interval is
[−19.1, +2.7]
In other words, we can declare, with 95% confidence, that the population mean is included in the interval [-19.1, +2.7].

Bear in mind that a single confidence interval either does or does not include the value of the population parameter; in experimental situations, we are uncertain about which of these outcomes has occurred. Instead, the level of confidence in a confidence interval is based on the concept of drawing a large number of samples, each with n observations, from the population. When we measured the change in systolic blood pressure in 100 random samples, we obtained 100 different sample means and 100 different sample standard deviations. As a consequence, we will calculate 100 different 100(1 - alpha )% confidence intervals; we expect ~100(1 - alpha )% of these observed confidence intervals to include the actual value of the population mean (see Fig. 4).


View larger version (25K):
[in this window]
[in a new window]
 
Fig. 4.   Estimating uncertainty about a population parameter: 95% confidence intervals for a population mean. These confidence intervals are for 100 samples of 10 observations drawn at random from population 1 in Fig. 1. It is because of the random sampling that the position and length of the confidence interval vary from sample to sample. About 95 of these intervals---the actual number will vary---are expected to cover the population mean of -15 mmHg. In this example, 98 of the confidence intervals cover the population mean µ; the 2 exceptions are highlighted (heavy black lines numbered 1 and 2).

A confidence interval characterizes the uncertainty about the estimated value of a population parameter. Sometimes, an investigator may be interested less in the value of the population parameter and more in the distribution of individual observations. A tolerance interval characterizes the uncertainty about the estimated distribution of those individual observations (see APPENDIX).

Next, we illustrate the distinction between statistical significance and scientific importance. Last, we show that the numerical results of statistical analyses have limitations.

    STATISTICAL AND SCIENTIFIC SIGNIFICANCE DIFFER

Hypothesis testing, as the primary scientific use of statistics, has a drawback: the result of a hypothesis test conveys mere statistical significance. In contrast, estimation conveys scientific significance.11 This distinction is obvious if we use the results of a recent clinical trial. In this trial, the Systolic Hypertension in the Elderly Program (SHEP) Cooperative Research Group (45) evaluated the impact of antihypertensive drugs on the incidence of stroke in persons with isolated systolic hypertension. When compared with placebo, these drugs reduced by 36% (P = 0.0003) the incidence of stroke. Associated with this reduced incidence of stroke was a greater decrease in systolic blood pressure.

To appreciate the distinction between statistical significance and scientific importance, consider two populations that represent the theoretical distributions of the decreases in systolic blood pressure for the two groups. Let the decrease in systolic blood pressure of the placebo group be designated Y1 and that of the drug treatment group be designated Y2. Assume that Y1 and Y2 are distributed normally
<IT>Y</IT><SUB>1</SUB> ∼ <IT>N</IT> (&mgr;<SUB>1</SUB>, &sfgr;<SUP>2</SUP><SUB>1</SUB>) and <IT>Y</IT><SUB>2</SUB> ∼ <IT>N</IT> (&mgr;<SUB>2</SUB>, &sfgr;<SUP>2</SUP><SUB>2</SUB>)
The normal probability density function (Eq. 1), in which approximate values for the observed sample means and variances from the SHEP trial, <OVL><IT>y</IT></OVL><SUB><IT>i</IT></SUB> and s2i, are substituted for the population means and variances, generates the population distributions depicted in Fig. 5
<OVL><IT>y</IT></OVL><SUB>1</SUB> = −15 ⇒ &mgr;<SUB>1</SUB>, <IT>s</IT><SUP>2</SUP><SUB>1</SUB> = 400 ⇒ &sfgr;<SUP>2</SUP><SUB>1</SUB>
and <OVL><IT>y</IT></OVL><SUB>2</SUB> = −25 ⇒ &mgr;<SUB>2</SUB>, <IT>s</IT><SUP>2</SUP><SUB>2</SUB> = 400 ⇒ &sfgr;<SUP>2</SUP><SUB>2</SUB>


View larger version (14K):
[in this window]
[in a new window]
 
Fig. 5.   Statistical and scientific significance differ: placebo (black) and drug-treatment (gray) populations. The populations represent theoretical distributions of changes in systolic blood pressure during year 5 of the Systolic Hypertension in the Elderly Program clinical trial (see Ref. 45). The distributions are described by the normal probability density function (Eq. 1) in which the sample means and variances, <OVL><IT>y</IT></OVL><SUB><IT>i</IT></SUB> and s2i, are substituted for the population means and variances. To generate samples of size n from each population, observations (Obs) were drawn at random from the placebo population; corresponding observations from the drug-treatment population were obtained by subtracting 10 from each placebo observation. The sampling procedure is illustrated for n = 2.

Suppose our objective is to estimate the difference between population means
&mgr;<SUB>2</SUB> − &mgr;<SUB>1</SUB> = −25 − (−15) = −10 mmHg
The SHEP group established convincingly that the difference µ2 - µ1, which represents the greater decrease in systolic blood pressure after drug therapy, was important. To estimate µ2 - µ1, we would sample at random from each population: the difference between sample means, <OVL><IT>y</IT></OVL><SUB>2</SUB> − <OVL><IT>y</IT></OVL><SUB>1</SUB>, estimates the difference between population means, µ2 - µ1.

By drawing samples of 2-128 observations from each population (Table 2) and by forcing <OVL><IT>y</IT></OVL><SUB>2</SUB> − <OVL><IT>y</IT></OVL><SUB>1</SUB> -10 (see Fig. 5), the distinction between statistical significance and scientific importance becomes clear. As sample size n grows, the statistical significance increases, from P = 0.71 for n = 2 to P < 0.001 for n = 128. Regardless of sample size, one aspect of scientific importance, that reflected by the difference <OVL><IT>y</IT></OVL><SUB>2</SUB> − <OVL><IT>y</IT></OVL><SUB>1</SUB>, remains constant. As sample size increases, uncertainty about the actual difference µ2 - µ1, another aspect of scientific importance characterized by the numerical bounds of the confidence interval, decreases.

                              
View this table:
[in this window]
[in a new window]
 
Table 2.   Statistical and scientific significance differ: statistical results

Practical considerations. In experimental situations, the distinction between statistical significance and scientific importance can be maintained by routinely addressing two questions: how likely is it that the experimental effect is real, and is the experimental effect large enough to be relevant? The first question can be answered simply: compare the P value, obtained in the hypothesis test, with the critical significance level alpha , chosen before any data are collected; if P < alpha , then the experimental effect is likely to be real. The second question can be answered in two steps: calculate a confidence interval for the population parameter, and then assess the numerical bounds of that confidence interval for scientific importance; if either bound of the confidence interval is important from a scientific perspective, then the experimental effect may be large enough to be relevant.

Consider the results when 15 sample observations were drawn from the placebo and drug treatment populations: when compared with placebo, the greater decrease in systolic blood pressure after drug therapy was unconvincing from a statistical perspective (P = 0.18). Because the 95% confidence interval was [-25, +5], uncertainty about the actual impact of drug treatment on systolic blood pressure is relatively large. Note, however, that the additional decrease in systolic blood pressure gained by drug treatment may have been as pronounced as 25 mmHg. From a scientific perspective, further studies, designed with greater statistical power, are warranted.

To illustrate that a significant statistical result may have little scientific importance, imagine that systolic blood pressure had been measured in mmH2O rather than in mmHg. Consider the results when 128 sample observations were drawn from the two populations: the greater decrease in systolic blood pressure after drug therapy was compelling from a statistical perspective (P < 0.001). If the confidence interval [-15, -5] is expressed in mmHg (by dividing each bound by 13.6), then the investigator can declare, with 95% confidence, that the magnitude of the greater decrease in systolic blood pressure was 0.4-1.1 mmHg. In this example, the investigator can be quite certain of a trivial experimental effect.

Whatever the statistical result of a hypothesis test, assessment of the corresponding confidence interval incorporates the scientific importance of the experimental result.

    LIMITATIONS OF STATISTICS

Although the process of scientific discovery requires an understanding of fundamental concepts in statistics, the use of statistics does have limitations. For example, not many of us would accept, solely on the basis of a close temporal relationship, that solar radiation governs stock market prices (Fig. 6). The limitations of statistics are more subtle if an association is plausible.


View larger version (12K):
[in this window]
[in a new window]
 
Fig. 6.   Limitations of statistics: solar radiation and New York stock market prices during 1929 (after Ref. 27). In general, increases in stock prices were associated with decreases in solar radiation. This nonsensical association illustrates the phenomenon of spurious correlation.

Imagine this scenario: a neurological syndrome results from impaired production of some neurotransmitter. Drugs A and B, derivatives of the same parent compound, both stimulate production of this neurotransmitter. Just one of the drugs, however, continues to increase neurotransmitter production over its entire therapeutic range. At higher doses, the second drug becomes less effective at boosting neurotransmitter production and causes neurotoxicity. For each drug, Table 3 lists administered drug concentrations and measured increases in neurotransmitter production. If you rely on only the regression statistics in Table 3, which drug is which? If you are unfortunate and happen to have this hypothetical syndrome, then your choice assumes added importance.

                              
View this table:
[in this window]
[in a new window]
 
Table 3.   Limitations of statistics: raw data and regression statistics

From the regression statistics alone, it is impossible to differentiate the drugs. Their identities are plain, however, when the data are plotted (Fig. 7): drug A increases neurotransmitter production over the entire range of drug concentrations; the increase in neurotransmitter production begins to fall at higher concentrations of drug B. 


View larger version (8K):
[in this window]
[in a new window]
 
Fig. 7.   Limitations of statistics: scatterplots of drug concentration x and increase in neurotransmitter production y. For each drug, the fitted first-order model y = 3 + 0.5x and corresponding regression statistics are identical (see Table 3). For only drug A, however, is this first-order relationship plausible. For drug B, a second-order model of the form Y = beta 0 + beta 1X + beta 2X 2 + epsilon  is required.

Practical considerations. Data graphics are essential also if the requisite assumptions behind a particular statistical technique are to be verified. For examples in regression, see chapt. 3 in Ref. 23.

    SUMMARY
Top
Abstract
Introduction
Summary
Appendix
References

It is depressing to find how much good biological work is in danger of being wasted through incompetent and misleading analysis  ...
Frank Yates and Michael J. R. Healy (1964)

This scathing remark, written almost 35 years ago (50) but relevant even now (4), reflects the frustrations felt by statisticians over the statistical misconceptions held by scientists. These misconceptions exist in large part because of shortcomings in the cursory statistics education we received in graduate or medical school (4, 11, 12). The major defect in most introductory courses in statistics is that fundamental concepts in statistics, the cornerstone of scientific inquiry (47), are neglected rather than emphasized (4, 7, 17, 44, 50). Statisticians share responsibility with other faculty for ensuring that introductory courses in statistics are relevant and sound (7, 44, 50).

In this review, we have reiterated the primary role of statistics within science to be one of estimation: estimation of a population parameter or estimation of the uncertainty about the value of that parameter. Moreover, we have demonstrated the essential distinction between statistical significance and scientific importance; of the two, scientific importance merits more consideration. We have shown also that without data graphics, data analysis is a game of chance. And last, that this review was written by a physiologist and two statisticians embodies one of the most basic notions in all science: collaboration.

    APPENDIX
Top
Abstract
Introduction
Summary
Appendix
References

This APPENDIX reviews the lognormal distribution (a distribution that reveals limitations of the standard deviation as an estimate of variability), a versatile family of data transformations, the theoretical distribution of the sample mean, tolerance intervals, the statistical equations required to perform the significance sampling exercise, and the confidence interval for the difference between two population means.

Lognormal distribution. The lognormal distribution is a common probability distribution model for skewed data. The random variable Y is distributed lognormally if the logarithm of Y is distributed normally with mean tau  and variance xi 2, or ln Y ~ N(tau xi 2). Formally, the lognormal probability density function g is
<IT>g</IT>(<IT>y</IT>) = <FR><NU>1</NU><DE><IT>y</IT>&xgr;<RAD><RCD>2&pgr;</RCD></RAD></DE></FR>⋅exp {−ln<SUP>2</SUP> ( <IT>y</IT>/<IT>e</IT><SUP>&tgr;</SUP>)/(2⋅&xgr;<SUP>2</SUP>)}, for <IT>y</IT> > 0 (A1)
The mean µg and variance sigma 2g of the lognormal distribution specified by Eq. A1 are
&mgr;<SUB><IT>g</IT></SUB> = <IT>e</IT><SUP>&tgr;+(&xgr;<SUP>2</SUP>/2)</SUP> and &sfgr;<SUP>2</SUP><SUB><IT>g</IT></SUB> = <IT>e</IT><SUP>2&tgr;+&xgr;<SUP>2</SUP></SUP>⋅(<IT>e</IT><SUP>&xgr;<SUP>2</SUP></SUP> − 1)
For the distribution in Fig. 2, tau  = 1.803 and xi 2 = 1; therefore, µg = 10 and sigma 2g = 172.

A family of data transformations. Box and Cox (14) have described a family of power transformations in which an observed variable y is transformed into the variable w by using the parameter lambda  
<IT>w</IT> = <FENCE><AR><R><C> ( <IT>y</IT><SUP>&lgr;</SUP> − 1)/&lgr; </C><C>for &lgr; ≠ 0, and </C></R><R><C> ln <IT>y</IT></C><C>for &lgr; = 0</C></R></AR></FENCE>
The inverse (lambda  = -1) and square root transformations (lambda  = 0.5) are members of this family. Draper and Smith (Ref. 23, p. 225-226) summarize the steps required to estimate the parameter lambda  so that the distribution of w is as normal (Gaussian) as possible.

Theoretical distribution of the sample mean. Suppose some random variable X is distributed normally with mean µ and variance sigma 2: that is, X ~ N(µ, sigma 2). When a sample of n independent observations, x1, x2, ... , xn, is drawn repeatedly from this distribution, the observed sample means can be treated as observations. These sample means will be distributed normally with mean µ and variance sigma 2/n, or
Ave {<OVL><IT>x</IT></OVL>} = &mgr; and Var {<OVL><IT>x</IT></OVL>} = &sfgr;<SUP>2</SUP><SUB><OVL><IT>x</IT></OVL></SUB> = &sfgr;<SUP>2</SUP>/<IT>n</IT>
As you might expect, there is a mathematical foundation to these relationships.

Consider the linear function L
<IT>L</IT> = <IT>k</IT><SUB>1</SUB><IT>X</IT><SUB>1</SUB>+ <IT>k</IT><SUB>2</SUB><IT>X</IT><SUB>2</SUB> + … + <IT>k<SUB>m</SUB>X</IT><SUB><IT>m</IT></SUB>
For i = 1, 2, ... , m, each ki is a real constant, and each Xi ~ Nisigma 2i). The mean of L, Ave {L}, is
Ave {<IT>L</IT>} = <IT>k</IT><SUB>1</SUB>&mgr;<SUB>1</SUB> + <IT>k</IT><SUB>2</SUB>&mgr;<SUB>2</SUB> + … + <IT>k</IT><SUB><IT>m</IT></SUB>&mgr;<SUB><IT>m</IT></SUB> = <LIM><OP>∑</OP><LL><IT>i</IT>=1</LL><UL><IT>m</IT></UL></LIM> <IT>k</IT><SUB><IT>i</IT></SUB>&mgr;<SUB><IT>i</IT></SUB>
If X1, X2, ... , Xm are mutually independent, then the variance of L, Var {L}, is
Var {<IT>L</IT>} = <IT>k</IT><SUP>2</SUP><SUB>1</SUB>&sfgr;<SUP>2</SUP><SUB>1</SUB> + <IT>k</IT><SUP>2</SUP><SUB>2</SUB>&sfgr;<SUP>2</SUP><SUB>2</SUB> + … + <IT>k</IT><SUP>2</SUP><SUB><IT>m</IT></SUB>&sfgr;<SUP>2</SUP><SUB><IT>m</IT></SUB> = <LIM><OP>∑</OP><LL><IT>i</IT>=1</LL><UL><IT>m</IT></UL></LIM> <IT>k</IT><SUP>2</SUP><SUB><IT>i</IT></SUB>&sfgr;<SUP>2</SUP><SUB><IT>i</IT></SUB>

If the function L is <OVL><IT>x</IT></OVL>, the mean of the n sample observations x1, x2, ... , xn, then m = n, and furthermore, for i = 1, 2, ... , n
<IT>k</IT><SUB><IT>i</IT></SUB> = 1/<IT>n</IT> and <IT>X</IT><SUB><IT>i</IT></SUB> ∼ <IT>N</IT> (&mgr;, &sfgr;<SUP>2</SUP>)
Therefore
Ave {<IT>L</IT>} = <LIM><OP>∑</OP><LL><IT>i</IT>=1</LL><UL><IT>n</IT></UL></LIM> <IT>k</IT><SUB><IT>i</IT></SUB>&mgr;<SUB><IT>i</IT></SUB> = <LIM><OP>∑</OP><LL><IT>i</IT>=1</LL><UL><IT>n</IT></UL></LIM> &mgr;/<IT>n</IT> = <IT>n</IT>⋅(&mgr;/<IT>n</IT>) = &mgr; = Ave {<OVL><IT>x</IT></OVL>}
and
Var {<IT>L</IT>} = <LIM><OP>∑</OP><LL><IT>i</IT>=1</LL><UL><IT>n</IT></UL></LIM> <IT>k</IT><SUP>2</SUP><SUB><IT>i</IT></SUB>&sfgr;<SUP>2</SUP><SUB><IT>i</IT></SUB> = <LIM><OP>∑</OP><LL><IT>i</IT>=1</LL><UL><IT>n</IT></UL></LIM> &sfgr;<SUP>2</SUP>/<IT>n</IT><SUP>2</SUP> = <IT>n</IT>⋅(&sfgr;<SUP>2</SUP>/<IT>n</IT><SUP>2</SUP>) = &sfgr;<SUP>2</SUP>/<IT>n</IT> = Var {<OVL><IT>x</IT></OVL>}

Tolerance intervals. A tolerance interval identifies the bounds that are expected to contain some percentage of a population, not just a single population parameter such as the mean (41). If a normal distribution has mean µ and variance sigma 2, which are known, then the 100phi% tolerance interval is
[&mgr; − {<IT>z</IT><SUB>(1−ϕ)/2</SUB>⋅&sfgr;}, &mgr; + {<IT>z</IT><SUB>(1−ϕ)/2</SUB>⋅&sfgr;}]
where z(1-phi )/2 is the 100[1 - {(1 - phi)/2}]th percentile from the standard normal distribution, i.e., N(0, 1). This tolerance interval covers exactly 100phi% of the distribution. If phi = 0.95, then z(1-phi )/2 = 1.96. For the population that represented the change in systolic blood pressure after some intervention (see USING SAMPLES TO LEARN ABOUT POPULATIONS), µ -15 and sigma  = 20; therefore, the exact 95% tolerance interval is
[−54, +24]

In practice, the sample statistics <OVL><IT>y</IT></OVL> and s are used to estimate the population parameters µ and sigma . This element of uncertainty about the values of µ and sigma  is handled by replacing z(1-phi )/2 with the confidence coefficient k, where k depends on phi as well as the sample size n. Therefore, the estimated 100phi% tolerance interval is
[ <OVL><IT>y</IT></OVL> − <IT>ks</IT>, <OVL><IT>y</IT></OVL> + <IT>ks</IT>]
[If phi = 0.95 and n = infinity , then k = z(1-phi )/2 = 1.96 as above, when µ and sigma  were known.] The coefficient k is chosen to enable the declaration, with 100(1 - alpha )% confidence, that the estimated tolerance interval covers 100phi% of the distribution (see Table XIV in Ref. 41).

For the observations listed in USING SAMPLES TO LEARN ABOUT POPULATIONS, <OVL><IT>y</IT></OVL> = −8.2 and s = 15.2. Suppose we want to estimate with 95% confidence a 90% tolerance interval based on these results. When we use these percentages and the sample size of 10, the coefficient k = 2.839. Therefore, the tolerance interval is
[−51, +35]
In other words, we can declare, with 95% confidence, that 90% of persons will have a change in systolic blood pressure of between -51 and +35 mmHg after the intervention. Note that this statement differs markedly from our previous assertion, made also with 95% confidence, that the population mean µ was included in the interval [-19.1, +2.7].

The tolerance intervals outlined above are appropriate only if the distribution of the underlying population is normal; other formulas exist to construct tolerance intervals when the population is distributed nonnormally.

Equations for the significance sampling exercise. For two samples of equal size n, the standard error of the difference between sample means,