Departments of Pediatrics and of Preventive Medicine and Biometrics,
School of Medicine, University of Colorado Health Sciences Center,
Denver, 80262; and Department of Mathematics, University of Colorado at
Denver, Denver, Colorado 80217-3364
Fundamental
concepts in statistics form the cornerstone of scientific inquiry. If
we fail to understand fully these fundamental concepts, then the
scientific conclusions we reach are more likely to be wrong. This is
more than supposition: for 60 years, statisticians have warned that the
scientific literature harbors misunderstandings about basic statistical
concepts. Original articles published in 1996 by the American
Physiological Society's journals fared no better in their handling of
basic statistical concepts. In this review, we summarize the two main
scientific uses of statistics: hypothesis testing and estimation. Most
scientists use statistics solely for hypothesis testing; often,
however, estimation is more useful. We also illustrate the concepts of
variability and uncertainty, and we demonstrate the essential
distinction between statistical significance and scientific importance.
An understanding of concepts such as variability, uncertainty, and
significance is necessary, but it is not sufficient; we show also that
the numerical results of statistical analyses have limitations.
 |
INTRODUCTION |
There are very few things which we know, which are not capable
of being reduc'd to a Mathematical Reasoning, ... and where a Mathematical Reasoning can be had, it's as great folly to make use of
any other, as to grope for a thing in the dark when you have a Candle
standing by you.
John Arbuthnot (1692)
STATISTICS IS ONE KIND of
mathematical reasoning. Its concepts and principles are ubiquitous in
science: as researchers, we use them to design experiments, analyze
data, report results, and interpret the published findings of others.
Indeed, it is from this foundation of statistical concepts and
principles that scientific knowledge is accumulated. If we fail to
understand fully these fundamental statistical concepts and
principles
if our statistical reasoning is faulty
then we are more
likely to reach wrong scientific conclusions. Wrong conclusions based
on faulty reasoning is shoddy science; it is also unethical (1, 21,
30).
Regrettably, faulty reasoning in statistics rears its head in the
practice of science: for 60 years, statisticians have documented statistical errors in the scientific literature (3, 4, 17, 33, 50). In
part, these errors exist because many introductory textbooks of
statistics paradoxically hinder literacy in statistics: they emphasize
methods rather than concepts, they contain glaring errors, or they
perpetuate misconceptions (4, 11, 12).
In his editorial prelude to a series of statistical papers, Yates (51)
wrote that the papers were designed to raise statistical consciousness
and thereby reduce statistical errors in journals published by the
American Physiological Society. Rather than reinforce concepts, these
papers reviewed methods: analysis of variance (20), linear regression
(37, 46), mathematical modeling (22, 29, 40), risk assessment (36), and
statistical packages (34). The proper use of any statistical technique,
however, requires an understanding of the fundamental statistical
concepts behind the technique.
How well do physiologists understand fundamental concepts in
statistics? One way to answer this question is to examine the empirical
incidence of basic statistical quantities such as standard deviations,
standard errors, and confidence intervals. These quantities characterize different statistical features: standard deviations characterize variability in the population, whereas standard errors and
confidence intervals characterize uncertainty about the estimated values of population parameters, e.g., means. Of the original articles
published in 1996 by the American Physiological Society, the
overwhelming majority (69-93%, range) report standard errors, apparently not as estimates of uncertainty but as estimates of variability (Table 1). Virtually no
articles (0-2%, range) report confidence intervals, recommended
by statisticians (2, 5, 9, 10, 28, 39) as interval estimates of
uncertainty about the values of population parameters. Moreover, few
articles (4-15%, range) report precise P values, which
precludes personal assessment of statistical significance.
View this table:
[in this window]
[in a new window]
|
Table 1.
Manuscripts for the American Physiological Society's journals in 1996:
use of statistics and statisticians
|
|
In this review, we summarize the primary scientific uses of statistics.
Then, we illustrate several fundamental concepts: variability,
uncertainty, and significance. Last, we illustrate that although an
understanding of concepts such as variability, uncertainty, and
significance is necessary, it is not sufficient: it is essential to
realize also that the numerical results of statistical analyses have
limitations.
Glossary
 |
Critical significance level
|
| Ave {q} |
Average of the quantity q
|
| µ |
Population mean
|
 |
Degrees of freedom
|
| n |
Number of observations
|
N (µ, 2) |
Normal (Gaussian) distribution with mean µ and variance
2
|
| P |
Achieved significance level
|
| Pr {A} |
Probability of event A
|
 |
Population standard deviation
|
 |
Standard deviation of the sampling distribution of the sample mean
|
| s |
Sample standard deviation
|
2 |
Population variance
|
| s2 |
Sample variance
|
| SE {q} |
Standard error of the quantity q
|
| Var {q} |
Variance of the quantity q
|
| Y |
Random variable Y
|
| yi |
Sample observation i, where i = 1, 2, ... ,
n
|
 |
Sample mean
|
 |
SCIENTIFIC USES OF STATISTICS |
In science, there are two main uses of statistics: hypothesis testing
and estimation. Most researchers use statistics solely for hypothesis
testing. In many situations, statisticians play down hypothesis testing
and prefer estimation instead.
Hypothesis testing.
To test a scientific hypothesis, a researcher must formulate the
hypothesis before any data are collected, then design and execute an
experiment that is relevant to it. Because the hypothesis is most often
one of no difference, the hypothesis is called, by tradition, the null
hypothesis.1 Using data from the
experiment, the researcher must next compute the observed value
T of a test statistic. Finally, the researcher must compare the
observed value T with some critical value T *, chosen from the distribution of the test statistic that is based on the
null hypothesis. If T is more extreme than T *, then
that is a surprising result if the null hypothesis is true, and the researcher is entitled, on statistical grounds, to become skeptical about the scientific validity of the null hypothesis.
The statistical test of a null hypothesis is useful because it assesses
the strength of the evidence: it helps guard against an unwarranted
conclusion, or it helps argue for a real experimental effect (19, 48).
Nevertheless, a null hypothesis is often an artificial construct:
before any data are recorded, the investigator knows
at least,
suspects
that the null hypothesis is not exactly true. Moreover, the
only question a hypothesis test can answer is a trivial one: is there
anything other than random variation here?2
Statisticians have emphasized repeatedly the limited value of
hypothesis testing (2, 4, 9, 18, 24, 28, 31, 38, 50). In fact, the
P values that result from hypothesis tests have been described
as "absurdly academic"3
(25) and as having a "strictly limited role" (19) in data analysis. Within the scientific community, unwarranted focus on hypothesis testing has blurred the distinction between statistical significance and scientific importance (3, 13, 19). Most investigators
appear to reach scientific conclusions that are based not on their
knowledge of science but solely on the probabilities of test statistics
(16); this is an untenable approach to scientific discovery.
The limited utility of hypothesis testing can be demonstrated with an
example. Suppose a clinician wants to assess the impact of a placebo
and the
-blockers bisoprolol and metoprolol on heart rate
variability in patients with left heart failure. Suppose also that the
clinician constructs the null and alternative hypotheses, H0 and H1, as
The result of this hypothesis test will fail to convey any
information about the direction or magnitude of the treatment effects
on heart rate variability. Direction and magnitude are important: in
patients with left heart failure, decreases in heart rate variability
are associated with increases in the risk of sudden cardiac catastrophe
(49). Direction and magnitude of an effect reflect scientific
importance; they are obtained by estimation.
Estimation.
Regardless of the statistical result of a hypothesis test, the crucial
question concerns the scientific result: is the experimental effect big
enough to be relevant? A point estimate of a population parameter4 and an interval
estimate of the uncertainty about the value of that parameter help
answer this question. For example, one point estimate of a population
mean is the sample mean; one interval estimate of the uncertainty about
the value of the populations mean is a confidence interval. Interval
estimates circumvent the drawbacks inherent to hypothesis testing, yet
they provide the same statistical information as a hypothesis test (15,
18, 28, 38). More important, point and interval estimates convey information about scientific importance.
Practical considerations.
Estimation focuses attention on the magnitude and uncertainty of the
experimental results. We must emphasize that hypothesis testing can
have value beyond assessing the strength of the experimental evidence:
for example, hypothesis testing is useful if an investigator wants to
evaluate the importance of between-subject variability in an
experiment. In practice, estimation should be done whenever it is
relevant and feasible; the precise P value from the associated hypothesis test should be reported with the point and interval estimates. When more than one hypothesis is tested in an experiment, the problem of multiple comparisons becomes relevant. Nevertheless, a
discussion of the issues involved in multiple-comparison procedures is
beyond the scope of this review; Refs. 2, 9, 42, and 48 summarize these
issues.
For the rest of this review, we focus our attention on several aspects
of estimation.
 |
USING SAMPLES TO LEARN ABOUT POPULATIONS |
As researchers, we use samples to make inferences about populations. A
sample interests us not because of its own merits but because it helps
us estimate selected characteristics of the underlying population: for
example, the sample mean
estimates the population mean
µ.5
As an illustration, suppose the random variable Y represents
the change in systolic blood pressure after some intervention. Suppose
also that the distribution of Y conforms to a normal
distribution. A normal distribution is specified completely by two
parameters: the mean and variance. The population mean µ conveys the
location of the center of the distribution; the population standard
deviation
, the square root of the population variance
2, conveys the spread of the distribution. The
distribution of possible outcomes of the random variable Y is
described by the normal probability density function ( f ),
which incorporates µ and
2
|
(1)
|
In Fig. 1, the distributions for
three different populations are theoretical: each depicts the
distribution of population values as if we had observed the entire
population.6

View larger version (16K):
[in this window]
[in a new window]
|
Fig. 1.
Using samples to learn about populations: 3 normal distributions. These
distributions differ in location, reflected in the mean µ, or spread,
reflected in the standard deviation . A normal probability density
function (Eq. 1) describes the distribution of each
population.
|
|
Suppose we want to estimate µ1 =
15, the mean of
population 1, in Fig. 1. To do this, we would measure the
change in systolic blood pressure in a sample of n independent
observations, y1, y2, ... , yn, from the
population. For simplicity, assume we limit the sample to 10 observations. One random sample is
The average of these sample observations is the sample mean
|
(2)
|
Because of intrinsic variability in the population, the
sample mean
differs from
the population mean µ1; only because this is a contrived
example do we know the true magnitude of the
discrepancy.7 Next, we review
measures that estimate variability in the population.
 |
ESTIMATING VARIABILITY IN THE POPULATION |
The preceding sample observations,
33,
15, ... ,
7,
differ because the population from which they were drawn is distributed over a range of possible values. This intrinsic variability is more
than a distraction: it is an integral part of statistics, and the
careful study of variability may reveal something about underlying
scientific processes (25). The most common measure of the variability
among sample observations is the sample standard deviation s,
the square root of the sample variance s2
(See also Refs. 2, 9, 42, and 48.) The sample standard
deviation characterizes the typical distance of an observation from the
distribution center; in other words, it reflects the dispersion of
individual sample observations about the sample mean. The sample
standard deviation s also estimates the population standard
deviation
: the standard deviation of the sample observations
33,
15, ... ,
7 is s = 15.2, which estimates
= 20.
Most journals would publish the preceding sample mean and standard
deviation as
The ± symbol, however, is superfluous: the standard
deviation is a single positive number. A standard deviation can be
reported clearly with notation of this form
In a table, the symbol SD can be omitted without loss of
clarity as long as the table legend identifies the parenthetical value
as a standard deviation.
The standard deviation is often a useful index of variability, but in
many experimental situations it may be a deceptive one: even subtle
departures from a normal distribution can render useless the standard
deviation as an index of variability (43); often, the distribution of a
biological variable differs grossly from a normal distribution. As one
example, the distribution of values for plasma creatinine (26)
resembles the skewed distribution depicted in Fig.
2. When the tails of a distribution are
elongated, as is the right tail of this skewed distribution, the sample
standard deviation will be an inflated measure of variability in the
population (43, 48). There are two remedies to this misrepresentation of variability by the standard deviation: use another measure of
variability, or transform the data.

View larger version (13K):
[in this window]
[in a new window]
|
Fig. 2.
Estimating variability in the population: a skewed distribution. The
lognormal probability density function (Eq. A1) describes
this skewed distribution in which the Pr {Y 6.1} = 0.50 and the Pr {2.1 Y 16.4} = 0.68 (gray area).
For a normal distribution with the same mean and variance
(inset), the Pr {Y 10.0} = 0.50, and the
Pr { 3.1 Y 23.1} = 0.68 (gray area). See
APPENDIX for further explanation.
|
|
Alternative measures of variability.
Two measures of variability that are useful with a variety of
distributions are the mean absolute deviation and the interquartile range. The mean absolute deviation
(Ave {|dev|}) is the average distance of the sample observations from the sample mean
The interquartile range (often designated as IQR) encompasses the
middle 50% of a distribution and is the difference between the 75th
and 25th percentiles. For 0 <
< 1, the 100
th percentile is
the value below which 100
% of the distribution is found.
Data transformation.
When the sample observations happen to be drawn from a population that
has a skewed distribution (e.g., a constituent of blood or the growth
rate of a tumor), a transformation may change the shape of their
distribution so that the distribution of the transformed observations
is more symmetric (14, 23, 26, 32, 48). Common transformations include
the logarithmic, inverse, square root, and arc sine transformations.
The APPENDIX reviews a useful family of data
transformations.
In the next section, we revisit the unknown discrepancy between the
sample estimate of a population parameter and the population parameter
itself.
 |
ESTIMATING UNCERTAINTY ABOUT A POPULATION PARAMETER |
In the sampling exercise from USING SAMPLES TO LEARN ABOUT
POPULATIONS, the sample mean
=
8.2 (Eq. 2)
estimated the population mean µ1 =
15. If we had
calculated this sample mean from experimental observations, then we
would be uncertain about the magnitude of the discrepancy between the
sample estimate
and the
population parameter µ1. The ability to estimate the
level of uncertainty about the value of a population parameter by using
the sample estimate of that parameter is a powerful aspect of
statistics (47).
Suppose we measure the same response variable, the change in systolic
blood pressure, in a second sample of 10 independent observations drawn
from the same population. We know beforehand that because of random
sampling the mean of the second sample,
will differ from
the mean of the first sample,
=
8.2. If we
measure the change in systolic blood pressure in 100 samples of 10 independent observations, then we expect 100 different estimates of the
population mean µ1; for example
|
|
If we treat these 100 observed sample means as 100 observations, then we can calculate their mean and standard deviation, designated as
and
|
|
We can generalize from this empirical distribution of sample means to a
theoretical distribution of the sample mean for a sample of size
n. Consider a random variable Y that is distributed normally with mean µ and variance
2, which are known;
the notation for this normal distribution is Y ~ N(µ,
2). If an infinite number of
samples, each with n independent observations, is drawn from
this normal distribution, then the sample means
will also be distributed
normally.8 The average of the
sample means,
is the
population mean µ, but the variance of the sample means
is
smaller than the population variance
2 by a factor of
1/n
(The APPENDIX derives these expressions. Figure
3 develops these expressions using
empirical examples.) Therefore, the standard deviation of the
theoretical distribution of the sample mean,
is
If the sample size n increases, then the standard
deviation
will
decrease: that is, the more sample observations we have, the more
certain we will be that the point estimate
is near the actual
population mean µ.

View larger version (10K):
[in this window]
[in a new window]
|
Fig. 3.
Estimating uncertainty about a population parameter: empirical
distributions of sample means. These distributions are based on 1,000 samples of 5 (A), 10 (B), 20 (C), or 40 (D) observations drawn at random from population 1, for which the mean µ = 15 and the variance 2 = 400. For each empirical distribution, the average of the sample means,
happens
to be 15.1. As sample size increases, however, the sample means
become concentrated more closely about
When
sample size doubles, the variance of the sample means,
is
approximately halved.
|
|
The standard deviation of the theoretical distribution of the sample
mean is known also as the standard error of the sample mean,
that is
In estimation, the standard error of the mean has no
particular value; instead, it is useful because of its role in the
calculation of a confidence interval for the population mean
µ.9
Confidence intervals.
When we construct a confidence interval for the population mean, we
assign numerical bounds to the expected discrepancy between the sample
mean
and the population
mean µ. In essence, a confidence interval is a range that we expect,
with some level of confidence, to include the actual value of the
population mean. Below, we use the theoretical distribution of the
sample mean to derive the confidence interval for the population mean
µ.10
In the theoretical distribution of the sample mean, 100(1
)%
of the possible sample means is included in the interval
|
(4)
|
where the allowance a is
|
(5)
|
In Eq. 5, z
/2 is the
100[1
(
/2)]th percentile from the standard normal
distribution, i.e., a normal distribution with mean 0 and variance 1, and
is
defined by Eq. 3. Therefore, when the population standard
deviation
is known, 95% of the possible sample means are within
of the population mean µ.
The interval in Eq. 4 can be written as the probability
expression
which declares that the probability is 1
that a
sample mean lies within the interval [µ
a,
µ + a]. After algebraic rearrangement, this expression can
be written
but note that the randomness resides in the parameter
estimate
not in the actual
parameter µ. In this form, the interval
|
(6)
|
is called the 100(1
)% confidence interval for the
population mean µ.
In practice, the sample standard deviation s estimates the
population standard deviation
, which means that
estimates the standard error of the mean (Eq. 3). In
calculating a 100(1
)% confidence interval for the mean µ,
this uncertainty about the actual value of
is handled by replacing
z
/2 in Eq. 5 with
t
/2,
, the 100[1
(
/2)]th percentile
from a Student t distribution with
= n
1 degrees of freedom. Therefore, the allowance applied to the sample mean
to obtain the 100(1
)% confidence interval for the population
mean (Eq. 6) is
where
Note that this allowance exceeds the allowance in Eq. 5: there
is greater uncertainty about the value of the population mean µ. This
happens because if
<
, then t
/2,
> z
/2 for all values of
.
Suppose we want to calculate a confidence interval for the population
mean µ1 =
15 by using the observations
33,
15, ... ,
7 of the first sample. The mean and standard
deviation of these 10 observations are
=
8.2 and s = 15.2. Therefore, the estimated standard error of the mean is
Because n = 10, there are
= n
1 = 9 degrees of freedom. If we want a 95% confidence interval, then
= 0.05, t
/2,
= 2.26, and the allowance
a = 2.26 × 4.81 = 10.9. Therefore, the 95% confidence
interval is
In other words, we can declare, with 95% confidence, that
the population mean is included in the interval [
19.1, +2.7].
Bear in mind that a single confidence interval either does or does not
include the value of the population parameter; in experimental situations, we are uncertain about which of these outcomes has occurred. Instead, the level of confidence in a confidence interval is
based on the concept of drawing a large number of samples, each with
n observations, from the population. When we measured the
change in systolic blood pressure in 100 random samples, we obtained
100 different sample means and 100 different sample standard deviations. As a consequence, we will calculate 100 different 100(1
)% confidence intervals; we expect
~100(1
)% of these observed confidence intervals to include
the actual value of the population mean (see Fig.
4).

View larger version (25K):
[in this window]
[in a new window]
|
Fig. 4.
Estimating uncertainty about a population parameter: 95% confidence
intervals for a population mean. These confidence intervals are for 100 samples of 10 observations drawn at random from population 1 in
Fig. 1. It is because of the random sampling that the position and
length of the confidence interval vary from sample to sample. About 95 of these intervals the actual number will vary are expected to cover
the population mean of 15 mmHg. In this example, 98 of the
confidence intervals cover the population mean µ; the 2 exceptions
are highlighted (heavy black lines numbered 1 and
2).
|
|
A confidence interval characterizes the uncertainty about the estimated
value of a population parameter. Sometimes, an investigator may be
interested less in the value of the population parameter and more in
the distribution of individual observations. A tolerance interval
characterizes the uncertainty about the estimated distribution of those
individual observations (see APPENDIX).
Next, we illustrate the distinction between statistical significance
and scientific importance. Last, we show that the numerical results of
statistical analyses have limitations.
 |
STATISTICAL AND SCIENTIFIC SIGNIFICANCE DIFFER |
Hypothesis testing, as the primary scientific use of statistics, has a
drawback: the result of a hypothesis test conveys mere statistical
significance. In contrast, estimation conveys scientific significance.11 This
distinction is obvious if we use the results of a recent clinical
trial. In this trial, the Systolic Hypertension in the Elderly Program
(SHEP) Cooperative Research Group (45) evaluated the impact of
antihypertensive drugs on the incidence of stroke in persons with
isolated systolic hypertension. When compared with placebo, these drugs
reduced by 36% (P = 0.0003) the incidence of stroke.
Associated with this reduced incidence of stroke was a greater decrease
in systolic blood pressure.
To appreciate the distinction between statistical significance and
scientific importance, consider two populations that represent the
theoretical distributions of the decreases in systolic blood pressure
for the two groups. Let the decrease in systolic blood pressure of the
placebo group be designated Y1 and that of the drug
treatment group be designated Y2. Assume that
Y1 and Y2 are distributed
normally
The normal probability density function (Eq. 1),
in which approximate values for the observed sample means and variances from the SHEP trial,
and
s2i, are substituted for the
population means and variances, generates the population distributions
depicted in Fig. 5

View larger version (14K):
[in this window]
[in a new window]
|
Fig. 5.
Statistical and scientific significance differ: placebo (black) and
drug-treatment (gray) populations. The populations represent
theoretical distributions of changes in systolic blood pressure during
year 5 of the Systolic Hypertension in the Elderly Program
clinical trial (see Ref. 45). The distributions are described by the
normal probability density function (Eq. 1) in which the
sample means and variances,
and
s2i, are substituted for the
population means and variances. To generate samples of size n
from each population, observations (Obs) were drawn at random from the
placebo population; corresponding observations from the drug-treatment
population were obtained by subtracting 10 from each placebo
observation. The sampling procedure is illustrated for n = 2.
|
|
Suppose our objective is to estimate the difference between population
means
The SHEP group established convincingly that the difference
µ2
µ1, which represents the greater
decrease in systolic blood pressure after drug therapy, was important.
To estimate µ2
µ1, we would sample at
random from each population: the difference between sample means,
estimates the difference between population means, µ2
µ1.
By drawing samples of 2-128 observations from each population
(Table 2) and by forcing
=
10 (see Fig. 5), the distinction between statistical significance and scientific importance becomes clear. As sample size n
grows, the statistical significance increases, from P = 0.71 for n = 2 to P < 0.001 for n = 128. Regardless of sample size, one aspect of scientific importance, that
reflected by the difference
remains constant. As sample size increases, uncertainty about the
actual difference µ2
µ1, another aspect
of scientific importance characterized by the numerical bounds of the
confidence interval, decreases.
Practical considerations.
In experimental situations, the distinction between statistical
significance and scientific importance can be maintained by routinely
addressing two questions: how likely is it that the experimental effect
is real, and is the experimental effect large enough to be relevant?
The first question can be answered simply: compare the P value,
obtained in the hypothesis test, with the critical significance level
, chosen before any data are collected; if P <
, then
the experimental effect is likely to be real. The second question can
be answered in two steps: calculate a confidence interval for the
population parameter, and then assess the numerical bounds of that
confidence interval for scientific importance; if either bound of the
confidence interval is important from a scientific perspective, then
the experimental effect may be large enough to be relevant.
Consider the results when 15 sample observations were drawn from the
placebo and drug treatment populations: when compared with placebo, the
greater decrease in systolic blood pressure after drug therapy was
unconvincing from a statistical perspective (P = 0.18).
Because the 95% confidence interval was [
25, +5], uncertainty
about the actual impact of drug treatment on systolic blood pressure is
relatively large. Note, however, that the additional decrease in
systolic blood pressure gained by drug treatment may have been as
pronounced as 25 mmHg. From a scientific perspective, further studies,
designed with greater statistical power, are warranted.
To illustrate that a significant statistical result may have little
scientific importance, imagine that systolic blood pressure had been
measured in mmH2O rather than in mmHg. Consider the results when 128 sample observations were drawn from the two populations: the
greater decrease in systolic blood pressure after drug therapy was
compelling from a statistical perspective (P < 0.001). If the confidence interval [
15,
5] is expressed in mmHg (by
dividing each bound by 13.6), then the investigator can declare, with
95% confidence, that the magnitude of the greater decrease in systolic blood pressure was 0.4-1.1 mmHg. In this example, the investigator can be quite certain of a trivial experimental effect.
Whatever the statistical result of a hypothesis test, assessment of the
corresponding confidence interval incorporates the scientific
importance of the experimental result.
 |
LIMITATIONS OF STATISTICS |
Although the process of scientific discovery requires an understanding
of fundamental concepts in statistics, the use of statistics does have
limitations. For example, not many of us would accept, solely on the
basis of a close temporal relationship, that solar radiation governs
stock market prices (Fig. 6). The
limitations of statistics are more subtle if an association is
plausible.

View larger version (12K):
[in this window]
[in a new window]
|
Fig. 6.
Limitations of statistics: solar radiation and New York stock market
prices during 1929 (after Ref. 27). In general, increases in stock
prices were associated with decreases in solar radiation. This
nonsensical association illustrates the phenomenon of spurious
correlation.
|
|
Imagine this scenario: a neurological syndrome results from impaired
production of some neurotransmitter. Drugs A and B,
derivatives of the same parent compound, both stimulate production of
this neurotransmitter. Just one of the drugs, however, continues to increase neurotransmitter production over its entire therapeutic range.
At higher doses, the second drug becomes less effective at boosting
neurotransmitter production and causes neurotoxicity. For each drug,
Table 3 lists administered drug
concentrations and measured increases in neurotransmitter production.
If you rely on only the regression statistics in Table 3, which drug is
which? If you are unfortunate and happen to have this hypothetical syndrome, then your choice assumes added importance.
From the regression statistics alone, it is impossible to differentiate
the drugs. Their identities are plain, however, when the data are
plotted (Fig. 7): drug A increases
neurotransmitter production over the entire range of drug
concentrations; the increase in neurotransmitter production begins to
fall at higher concentrations of drug B.

View larger version (8K):
[in this window]
[in a new window]
|
Fig. 7.
Limitations of statistics: scatterplots of drug concentration x
and increase in neurotransmitter production y. For each drug,
the fitted first-order model = 3 + 0.5x and
corresponding regression statistics are identical (see Table 3). For
only drug A, however, is this first-order relationship
plausible. For drug B, a second-order model of the form
Y = 0 + 1X + 2X 2 + is required.
|
|
Practical considerations.
Data graphics are essential also if the requisite assumptions behind a
particular statistical technique are to be verified. For examples in
regression, see chapt. 3 in Ref. 23.
 |
SUMMARY |
It is depressing to find how much good biological work is in
danger of being wasted through incompetent and misleading analysis ...
Frank Yates and Michael J. R. Healy (1964)
This scathing remark, written almost 35 years ago (50) but relevant
even now (4), reflects the frustrations felt by statisticians over the
statistical misconceptions held by scientists. These misconceptions
exist in large part because of shortcomings in the cursory statistics
education we received in graduate or medical school (4, 11, 12). The
major defect in most introductory courses in statistics is that
fundamental concepts in statistics, the cornerstone of scientific
inquiry (47), are neglected rather than emphasized (4, 7, 17, 44, 50).
Statisticians share responsibility with other faculty for ensuring that
introductory courses in statistics are relevant and sound (7, 44, 50).
In this review, we have reiterated the primary role of statistics
within science to be one of estimation: estimation of a population
parameter or estimation of the uncertainty about the value of that
parameter. Moreover, we have demonstrated the essential distinction
between statistical significance and scientific importance; of the two,
scientific importance merits more consideration. We have shown also
that without data graphics, data analysis is a game of chance. And
last, that this review was written by a physiologist and two
statisticians embodies one of the most basic notions in all science:
collaboration.