|
|
||||||||
Departments of Pediatrics and of Preventive Medicine and Biometrics, School of Medicine, University of Colorado Health Sciences Center, Denver, 80262; and Department of Mathematics, University of Colorado at Denver, Denver, Colorado 80217-3364
| |
ABSTRACT |
|---|
|
|
|---|
Fundamental concepts in statistics form the cornerstone of scientific inquiry. If we fail to understand fully these fundamental concepts, then the scientific conclusions we reach are more likely to be wrong. This is more than supposition: for 60 years, statisticians have warned that the scientific literature harbors misunderstandings about basic statistical concepts. Original articles published in 1996 by the American Physiological Society's journals fared no better in their handling of basic statistical concepts. In this review, we summarize the two main scientific uses of statistics: hypothesis testing and estimation. Most scientists use statistics solely for hypothesis testing; often, however, estimation is more useful. We also illustrate the concepts of variability and uncertainty, and we demonstrate the essential distinction between statistical significance and scientific importance. An understanding of concepts such as variability, uncertainty, and significance is necessary, but it is not sufficient; we show also that the numerical results of statistical analyses have limitations.
confidence interval; estimation; tolerance interval; uncertainty; variability
| |
INTRODUCTION |
|---|
|
|
|---|
There are very few things which we know, which are not capable of being reduc'd to a Mathematical Reasoning, ... and where a Mathematical Reasoning can be had, it's as great folly to make use of any other, as to grope for a thing in the dark when you have a Candle standing by you.
John Arbuthnot (1692)
STATISTICS IS ONE KIND of
mathematical reasoning. Its concepts and principles are ubiquitous in
science: as researchers, we use them to design experiments, analyze
data, report results, and interpret the published findings of others.
Indeed, it is from this foundation of statistical concepts and
principles that scientific knowledge is accumulated. If we fail to
understand fully these fundamental statistical concepts and
principles
if our statistical reasoning is faulty
then we are more
likely to reach wrong scientific conclusions. Wrong conclusions based
on faulty reasoning is shoddy science; it is also unethical (1, 21,
30).
Regrettably, faulty reasoning in statistics rears its head in the practice of science: for 60 years, statisticians have documented statistical errors in the scientific literature (3, 4, 17, 33, 50). In part, these errors exist because many introductory textbooks of statistics paradoxically hinder literacy in statistics: they emphasize methods rather than concepts, they contain glaring errors, or they perpetuate misconceptions (4, 11, 12).
In his editorial prelude to a series of statistical papers, Yates (51) wrote that the papers were designed to raise statistical consciousness and thereby reduce statistical errors in journals published by the American Physiological Society. Rather than reinforce concepts, these papers reviewed methods: analysis of variance (20), linear regression (37, 46), mathematical modeling (22, 29, 40), risk assessment (36), and statistical packages (34). The proper use of any statistical technique, however, requires an understanding of the fundamental statistical concepts behind the technique.
How well do physiologists understand fundamental concepts in statistics? One way to answer this question is to examine the empirical incidence of basic statistical quantities such as standard deviations, standard errors, and confidence intervals. These quantities characterize different statistical features: standard deviations characterize variability in the population, whereas standard errors and confidence intervals characterize uncertainty about the estimated values of population parameters, e.g., means. Of the original articles published in 1996 by the American Physiological Society, the overwhelming majority (69-93%, range) report standard errors, apparently not as estimates of uncertainty but as estimates of variability (Table 1). Virtually no articles (0-2%, range) report confidence intervals, recommended by statisticians (2, 5, 9, 10, 28, 39) as interval estimates of uncertainty about the values of population parameters. Moreover, few articles (4-15%, range) report precise P values, which precludes personal assessment of statistical significance.
|
In this review, we summarize the primary scientific uses of statistics. Then, we illustrate several fundamental concepts: variability, uncertainty, and significance. Last, we illustrate that although an understanding of concepts such as variability, uncertainty, and significance is necessary, it is not sufficient: it is essential to realize also that the numerical results of statistical analyses have limitations.
Glossary
![]() |
Critical significance level |
| Ave {q} | Average of the quantity q |
| µ | Population mean |
![]() |
Degrees of freedom |
| n | Number of observations |
N (µ, 2) |
Normal (Gaussian) distribution with mean µ and variance
2
|
| P | Achieved significance level |
| Pr {A} | Probability of event A |
![]() |
Population standard deviation |
![]() |
Standard deviation of the sampling distribution of the sample mean |
| s | Sample standard deviation |
2 |
Population variance |
| s2 | Sample variance |
| SE {q} | Standard error of the quantity q |
| Var {q} | Variance of the quantity q |
| Y | Random variable Y |
| yi | Sample observation i, where i = 1, 2, ... , n |
![]() |
Sample mean |
| |
SCIENTIFIC USES OF STATISTICS |
|---|
In science, there are two main uses of statistics: hypothesis testing and estimation. Most researchers use statistics solely for hypothesis testing. In many situations, statisticians play down hypothesis testing and prefer estimation instead.
Hypothesis testing. To test a scientific hypothesis, a researcher must formulate the hypothesis before any data are collected, then design and execute an experiment that is relevant to it. Because the hypothesis is most often one of no difference, the hypothesis is called, by tradition, the null hypothesis.1 Using data from the experiment, the researcher must next compute the observed value T of a test statistic. Finally, the researcher must compare the observed value T with some critical value T *, chosen from the distribution of the test statistic that is based on the null hypothesis. If T is more extreme than T *, then that is a surprising result if the null hypothesis is true, and the researcher is entitled, on statistical grounds, to become skeptical about the scientific validity of the null hypothesis.
The statistical test of a null hypothesis is useful because it assesses the strength of the evidence: it helps guard against an unwarranted conclusion, or it helps argue for a real experimental effect (19, 48). Nevertheless, a null hypothesis is often an artificial construct: before any data are recorded, the investigator knows
at least,
suspects
that the null hypothesis is not exactly true. Moreover, the
only question a hypothesis test can answer is a trivial one: is there
anything other than random variation here?2
Statisticians have emphasized repeatedly the limited value of
hypothesis testing (2, 4, 9, 18, 24, 28, 31, 38, 50). In fact, the
P values that result from hypothesis tests have been described
as "absurdly academic"3
(25) and as having a "strictly limited role" (19) in data analysis. Within the scientific community, unwarranted focus on hypothesis testing has blurred the distinction between statistical significance and scientific importance (3, 13, 19). Most investigators
appear to reach scientific conclusions that are based not on their
knowledge of science but solely on the probabilities of test statistics
(16); this is an untenable approach to scientific discovery.
The limited utility of hypothesis testing can be demonstrated with an
example. Suppose a clinician wants to assess the impact of a placebo
and the
-blockers bisoprolol and metoprolol on heart rate
variability in patients with left heart failure. Suppose also that the
clinician constructs the null and alternative hypotheses, H0 and H1, as
|
|
|
|
Estimation. Regardless of the statistical result of a hypothesis test, the crucial question concerns the scientific result: is the experimental effect big enough to be relevant? A point estimate of a population parameter4 and an interval estimate of the uncertainty about the value of that parameter help answer this question. For example, one point estimate of a population mean is the sample mean; one interval estimate of the uncertainty about the value of the populations mean is a confidence interval. Interval estimates circumvent the drawbacks inherent to hypothesis testing, yet they provide the same statistical information as a hypothesis test (15, 18, 28, 38). More important, point and interval estimates convey information about scientific importance.
Practical considerations. Estimation focuses attention on the magnitude and uncertainty of the experimental results. We must emphasize that hypothesis testing can have value beyond assessing the strength of the experimental evidence: for example, hypothesis testing is useful if an investigator wants to evaluate the importance of between-subject variability in an experiment. In practice, estimation should be done whenever it is relevant and feasible; the precise P value from the associated hypothesis test should be reported with the point and interval estimates. When more than one hypothesis is tested in an experiment, the problem of multiple comparisons becomes relevant. Nevertheless, a discussion of the issues involved in multiple-comparison procedures is beyond the scope of this review; Refs. 2, 9, 42, and 48 summarize these issues.
For the rest of this review, we focus our attention on several aspects of estimation.| |
USING SAMPLES TO LEARN ABOUT POPULATIONS |
|---|
As researchers, we use samples to make inferences about populations. A
sample interests us not because of its own merits but because it helps
us estimate selected characteristics of the underlying population: for
example, the sample mean
estimates the population mean
µ.5
As an illustration, suppose the random variable Y represents
the change in systolic blood pressure after some intervention. Suppose
also that the distribution of Y conforms to a normal
distribution. A normal distribution is specified completely by two
parameters: the mean and variance. The population mean µ conveys the
location of the center of the distribution; the population standard
deviation
, the square root of the population variance
2, conveys the spread of the distribution. The
distribution of possible outcomes of the random variable Y is
described by the normal probability density function ( f ),
which incorporates µ and
2
|
(1) |
|
|
Suppose we want to estimate µ1 =
15, the mean of
population 1, in Fig. 1. To do this, we would measure the
change in systolic blood pressure in a sample of n independent
observations, y1, y2, ... , yn, from the
population. For simplicity, assume we limit the sample to 10 observations. One random sample is
|
|
(2) |
differs from
the population mean µ1; only because this is a contrived
example do we know the true magnitude of the
discrepancy.7 Next, we review
measures that estimate variability in the population.
| |
ESTIMATING VARIABILITY IN THE POPULATION |
|---|
The preceding sample observations,
33,
15, ... ,
7,
differ because the population from which they were drawn is distributed over a range of possible values. This intrinsic variability is more
than a distraction: it is an integral part of statistics, and the
careful study of variability may reveal something about underlying
scientific processes (25). The most common measure of the variability
among sample observations is the sample standard deviation s,
the square root of the sample variance s2
|
: the standard deviation of the sample observations
33,
15, ... ,
7 is s = 15.2, which estimates
= 20.
Most journals would publish the preceding sample mean and standard deviation as
|
|
The standard deviation is often a useful index of variability, but in many experimental situations it may be a deceptive one: even subtle departures from a normal distribution can render useless the standard deviation as an index of variability (43); often, the distribution of a biological variable differs grossly from a normal distribution. As one example, the distribution of values for plasma creatinine (26) resembles the skewed distribution depicted in Fig. 2. When the tails of a distribution are elongated, as is the right tail of this skewed distribution, the sample standard deviation will be an inflated measure of variability in the population (43, 48). There are two remedies to this misrepresentation of variability by the standard deviation: use another measure of variability, or transform the data.
|
Alternative measures of variability. Two measures of variability that are useful with a variety of distributions are the mean absolute deviation and the interquartile range. The mean absolute deviation (Ave {|dev|}) is the average distance of the sample observations from the sample mean
|
< 1, the 100
th percentile is
the value below which 100
% of the distribution is found.
Data transformation. When the sample observations happen to be drawn from a population that has a skewed distribution (e.g., a constituent of blood or the growth rate of a tumor), a transformation may change the shape of their distribution so that the distribution of the transformed observations is more symmetric (14, 23, 26, 32, 48). Common transformations include the logarithmic, inverse, square root, and arc sine transformations. The APPENDIX reviews a useful family of data transformations.
In the next section, we revisit the unknown discrepancy between the sample estimate of a population parameter and the population parameter itself.| |
ESTIMATING UNCERTAINTY ABOUT A POPULATION PARAMETER |
|---|
In the sampling exercise from USING SAMPLES TO LEARN ABOUT
POPULATIONS, the sample mean
=
8.2 (Eq. 2)
estimated the population mean µ1 =
15. If we had
calculated this sample mean from experimental observations, then we
would be uncertain about the magnitude of the discrepancy between the
sample estimate
and the
population parameter µ1. The ability to estimate the
level of uncertainty about the value of a population parameter by using
the sample estimate of that parameter is a powerful aspect of
statistics (47).
Suppose we measure the same response variable, the change in systolic
blood pressure, in a second sample of 10 independent observations drawn
from the same population. We know beforehand that because of random
sampling the mean of the second sample,
will differ from
the mean of the first sample,
=
8.2. If we
measure the change in systolic blood pressure in 100 samples of 10 independent observations, then we expect 100 different estimates of the
population mean µ1; for example
|
and
|
We can generalize from this empirical distribution of sample means to a
theoretical distribution of the sample mean for a sample of size
n. Consider a random variable Y that is distributed normally with mean µ and variance
2, which are known;
the notation for this normal distribution is Y ~ N(µ,
2). If an infinite number of
samples, each with n independent observations, is drawn from
this normal distribution, then the sample means
will also be distributed
normally.8 The average of the
sample means,
is the
population mean µ, but the variance of the sample means
is
smaller than the population variance
2 by a factor of
1/n
|
is
|
will
decrease: that is, the more sample observations we have, the more
certain we will be that the point estimate
is near the actual
population mean µ.
|
The standard deviation of the theoretical distribution of the sample
mean is known also as the standard error of the sample mean,
that is
|
Confidence intervals.
When we construct a confidence interval for the population mean, we
assign numerical bounds to the expected discrepancy between the sample
mean
and the population
mean µ. In essence, a confidence interval is a range that we expect,
with some level of confidence, to include the actual value of the
population mean. Below, we use the theoretical distribution of the
sample mean to derive the confidence interval for the population mean
µ.10
)%
of the possible sample means is included in the interval
|
(4) |
|
(5) |
/2 is the
100[1
(
/2)]th percentile from the standard normal
distribution, i.e., a normal distribution with mean 0 and variance 1, and
is
defined by Eq. 3. Therefore, when the population standard
deviation
is known, 95% of the possible sample means are within
of the population mean µ.
The interval in Eq. 4 can be written as the probability
expression
|
that a
sample mean lies within the interval [µ
a,
µ + a]. After algebraic rearrangement, this expression can
be written
|
not in the actual
parameter µ. In this form, the interval
|
(6) |
)% confidence interval for the
population mean µ.
In practice, the sample standard deviation s estimates the
population standard deviation
, which means that
estimates the standard error of the mean (Eq. 3). In
calculating a 100(1
)% confidence interval for the mean µ,
this uncertainty about the actual value of
is handled by replacing
z
/2 in Eq. 5 with
t
/2,
, the 100[1
(
/2)]th percentile
from a Student t distribution with
= n
1 degrees of freedom. Therefore, the allowance applied to the sample mean
to obtain the 100(1
)% confidence interval for the population
mean (Eq. 6) is
|
Note that this allowance exceeds the allowance in Eq. 5: there
is greater uncertainty about the value of the population mean µ. This
happens because if
<
, then t
/2,
> z
/2 for all values of
.
Suppose we want to calculate a confidence interval for the population
mean µ1 =
15 by using the observations
33,
15, ... ,
7 of the first sample. The mean and standard
deviation of these 10 observations are
=
8.2 and s = 15.2. Therefore, the estimated standard error of the mean is
|
= n
1 = 9 degrees of freedom. If we want a 95% confidence interval, then
= 0.05, t
/2,
= 2.26, and the allowance
a = 2.26 × 4.81 = 10.9. Therefore, the 95% confidence
interval is
|
19.1, +2.7].
Bear in mind that a single confidence interval either does or does not
include the value of the population parameter; in experimental situations, we are uncertain about which of these outcomes has occurred. Instead, the level of confidence in a confidence interval is
based on the concept of drawing a large number of samples, each with
n observations, from the population. When we measured the
change in systolic blood pressure in 100 random samples, we obtained
100 different sample means and 100 different sample standard deviations. As a consequence, we will calculate 100 different 100(1
)% confidence intervals; we expect
~100(1
)% of these observed confidence intervals to include
the actual value of the population mean (see Fig.
4).
|
| |
STATISTICAL AND SCIENTIFIC SIGNIFICANCE DIFFER |
|---|
Hypothesis testing, as the primary scientific use of statistics, has a drawback: the result of a hypothesis test conveys mere statistical significance. In contrast, estimation conveys scientific significance.11 This distinction is obvious if we use the results of a recent clinical trial. In this trial, the Systolic Hypertension in the Elderly Program (SHEP) Cooperative Research Group (45) evaluated the impact of antihypertensive drugs on the incidence of stroke in persons with isolated systolic hypertension. When compared with placebo, these drugs reduced by 36% (P = 0.0003) the incidence of stroke. Associated with this reduced incidence of stroke was a greater decrease in systolic blood pressure.
To appreciate the distinction between statistical significance and scientific importance, consider two populations that represent the theoretical distributions of the decreases in systolic blood pressure for the two groups. Let the decrease in systolic blood pressure of the placebo group be designated Y1 and that of the drug treatment group be designated Y2. Assume that Y1 and Y2 are distributed normally
|
and
s2i, are substituted for the
population means and variances, generates the population distributions
depicted in Fig. 5
|
|
|
Suppose our objective is to estimate the difference between population means
|
µ1, which represents the greater
decrease in systolic blood pressure after drug therapy, was important.
To estimate µ2
µ1, we would sample at
random from each population: the difference between sample means,
estimates the difference between population means, µ2
µ1.
By drawing samples of 2-128 observations from each population
(Table 2) and by forcing
=
10 (see Fig. 5), the distinction between statistical significance and scientific importance becomes clear. As sample size n
grows, the statistical significance increases, from P = 0.71 for n = 2 to P < 0.001 for n = 128. Regardless of sample size, one aspect of scientific importance, that
reflected by the difference
remains constant. As sample size increases, uncertainty about the
actual difference µ2
µ1, another aspect
of scientific importance characterized by the numerical bounds of the
confidence interval, decreases.
|
Practical considerations.
In experimental situations, the distinction between statistical
significance and scientific importance can be maintained by routinely
addressing two questions: how likely is it that the experimental effect
is real, and is the experimental effect large enough to be relevant?
The first question can be answered simply: compare the P value,
obtained in the hypothesis test, with the critical significance level
, chosen before any data are collected; if P <
, then
the experimental effect is likely to be real. The second question can
be answered in two steps: calculate a confidence interval for the
population parameter, and then assess the numerical bounds of that
confidence interval for scientific importance; if either bound of the
confidence interval is important from a scientific perspective, then
the experimental effect may be large enough to be relevant.
25, +5], uncertainty
about the actual impact of drug treatment on systolic blood pressure is
relatively large. Note, however, that the additional decrease in
systolic blood pressure gained by drug treatment may have been as
pronounced as 25 mmHg. From a scientific perspective, further studies,
designed with greater statistical power, are warranted.
To illustrate that a significant statistical result may have little
scientific importance, imagine that systolic blood pressure had been
measured in mmH2O rather than in mmHg. Consider the results when 128 sample observations were drawn from the two populations: the
greater decrease in systolic blood pressure after drug therapy was
compelling from a statistical perspective (P < 0.001). If the confidence interval [
15,
5] is expressed in mmHg (by
dividing each bound by 13.6), then the investigator can declare, with
95% confidence, that the magnitude of the greater decrease in systolic blood pressure was 0.4-1.1 mmHg. In this example, the investigator can be quite certain of a trivial experimental effect.
Whatever the statistical result of a hypothesis test, assessment of the
corresponding confidence interval incorporates the scientific
importance of the experimental result.
| |
LIMITATIONS OF STATISTICS |
|---|
Although the process of scientific discovery requires an understanding of fundamental concepts in statistics, the use of statistics does have limitations. For example, not many of us would accept, solely on the basis of a close temporal relationship, that solar radiation governs stock market prices (Fig. 6). The limitations of statistics are more subtle if an association is plausible.
|
Imagine this scenario: a neurological syndrome results from impaired production of some neurotransmitter. Drugs A and B, derivatives of the same parent compound, both stimulate production of this neurotransmitter. Just one of the drugs, however, continues to increase neurotransmitter production over its entire therapeutic range. At higher doses, the second drug becomes less effective at boosting neurotransmitter production and causes neurotoxicity. For each drug, Table 3 lists administered drug concentrations and measured increases in neurotransmitter production. If you rely on only the regression statistics in Table 3, which drug is which? If you are unfortunate and happen to have this hypothetical syndrome, then your choice assumes added importance.
|
From the regression statistics alone, it is impossible to differentiate the drugs. Their identities are plain, however, when the data are plotted (Fig. 7): drug A increases neurotransmitter production over the entire range of drug concentrations; the increase in neurotransmitter production begins to fall at higher concentrations of drug B.
|
Practical considerations. Data graphics are essential also if the requisite assumptions behind a particular statistical technique are to be verified. For examples in regression, see chapt. 3 in Ref. 23.
| |
SUMMARY |
|---|
|
|
|---|
It is depressing to find how much good biological work is in danger of being wasted through incompetent and misleading analysis ...
Frank Yates and Michael J. R. Healy (1964)
This scathing remark, written almost 35 years ago (50) but relevant even now (4), reflects the frustrations felt by statisticians over the statistical misconceptions held by scientists. These misconceptions exist in large part because of shortcomings in the cursory statistics education we received in graduate or medical school (4, 11, 12). The major defect in most introductory courses in statistics is that fundamental concepts in statistics, the cornerstone of scientific inquiry (47), are neglected rather than emphasized (4, 7, 17, 44, 50). Statisticians share responsibility with other faculty for ensuring that introductory courses in statistics are relevant and sound (7, 44, 50).
In this review, we have reiterated the primary role of statistics within science to be one of estimation: estimation of a population parameter or estimation of the uncertainty about the value of that parameter. Moreover, we have demonstrated the essential distinction between statistical significance and scientific importance; of the two, scientific importance merits more consideration. We have shown also that without data graphics, data analysis is a game of chance. And last, that this review was written by a physiologist and two statisticians embodies one of the most basic notions in all science: collaboration.
| |
APPENDIX |
|---|
|
|
|---|
This APPENDIX reviews the lognormal distribution (a distribution that reveals limitations of the standard deviation as an estimate of variability), a versatile family of data transformations, the theoretical distribution of the sample mean, tolerance intervals, the statistical equations required to perform the significance sampling exercise, and the confidence interval for the difference between two population means.
Lognormal distribution.
The lognormal distribution is a common probability distribution model
for skewed data. The random variable Y is distributed lognormally if the logarithm of Y is distributed normally with mean
and variance
2, or ln Y ~ N(
,
2). Formally, the lognormal
probability density function g is
|
(A1) |
2g of the lognormal
distribution specified by Eq. A1 are
|
= 1.803 and
2 = 1; therefore, µg = 10 and
2g = 172.
A family of data transformations.
Box and Cox (14) have described a family of power transformations in
which an observed variable y is transformed into the variable
w by using the parameter
|
=
1) and square root transformations
(
= 0.5) are members of this family. Draper and Smith (Ref. 23, p. 225-226) summarize the steps required to estimate the parameter
so that the distribution of w is as normal (Gaussian) as
possible.
Theoretical distribution of the sample mean.
Suppose some random variable X is distributed normally with
mean µ and variance
2: that is, X ~ N(µ,
2). When a sample of n
independent observations, x1,
x2, ... , xn, is drawn
repeatedly from this distribution, the observed sample means can be
treated as observations. These sample means will be distributed
normally with mean µ and variance
2/n, or
|
|
2i).
The mean of L, Ave {L}, is
|
|
the mean of the n
sample observations x1,
x2, ... , xn, then
m = n, and furthermore, for i = 1, 2, ... , n
|
|
|
Tolerance intervals.
A tolerance interval identifies the bounds that are expected to contain
some percentage of a population, not just a single population parameter
such as the mean (41). If a normal distribution has mean µ and
variance
2, which are known, then the 100
% tolerance
interval is
|

)/2 is the
100[1
{(1
)/2}]th percentile from the standard
normal distribution, i.e., N(0, 1). This tolerance interval
covers exactly 100
% of the distribution. If
= 0.95, then
z(1
)/2 = 1.96. For the population that
represented the change in systolic blood pressure after some intervention (see USING SAMPLES TO LEARN ABOUT
POPULATIONS), µ =
15 and
= 20; therefore, the
exact 95% tolerance interval is
|
and s are used to
estimate the population parameters µ and
. This element of uncertainty about the values of µ and
is handled by replacing z(1
)/2 with the confidence coefficient
k, where k depends on
as well as the sample size
n. Therefore, the estimated 100
% tolerance interval is
|
= 0.95 and n =
, then
k = z(1
)/2 = 1.96 as above, when µ and
were known.] The coefficient k is chosen to enable
the declaration, with 100(1
)% confidence, that the estimated
tolerance interval covers 100
% of the distribution (see Table XIV
in Ref. 41).
For the observations listed in USING SAMPLES TO LEARN ABOUT
POPULATIONS,
and s = 15.2. Suppose we want to estimate with 95% confidence a 90% tolerance interval based on these results. When we use these percentages and the
sample size of 10, the coefficient k = 2.839. Therefore, the
tolerance interval is
|
51 and +35 mmHg after the intervention. Note that this statement differs markedly from our previous assertion, made also with
95% confidence, that the population mean µ was included in the
interval [
19.1, +2.7].
The tolerance intervals outlined above are appropriate only if the
distribution of the underlying population is normal; other formulas
exist to construct tolerance intervals when the population is
distributed nonnormally.
Equations for the significance sampling exercise.
For two samples of equal size n, the standard error of the
difference between sample means,
is estimated as
|
)% confidence interval for µ2
µ1, the difference between population means, is
|
(A2) |
is
|
/2,
is the
100[1
(
/2)]th percentile from a Student t
distribution with
= 2n
2 degrees of freedom. In this
sampling exercise, we use the t distribution because we assume
the standard deviations of the populations are unknown (42).
The test statistic used to evaluate statistical significance of the
difference
is
|
(A3) |
Confidence interval for the difference between population means. In the significance sampling exercise (see STATISTICAL AND SCIENTIFIC SIGNIFICANCE DIFFER), we calculated a confidence interval for the difference between two population means. Rather than construct a confidence interval for this difference, a researcher could construct a confidence interval for each population mean: if the two confidence intervals fail to overlap, the researcher would conclude that the population means differ. This approach is conservative.
Consider the results when 32 sample observations were drawn from the placebo and drug treatment populations: when compared with placebo, drug therapy was associated with a greater decrease in systolic blood pressure (P = 0.04), and the 95% confidence interval for the difference between population means was [
19,
1]. That this
confidence interval excludes 0 corroborates that the population means
differ at the
= 0.05 level.
The observed sample means for the placebo and drug treatment groups
were
|
|
= n
1 = 31 degrees of freedom.
If we want a 95% confidence interval for each population mean (Eq. 6), then
= 0.05, t
/2,
= 2.04, and
the allowance a = 2.04 × 3.32 = 6.8. Therefore, the
95% confidence interval for the mean of the placebo population is
|
|
| |
ACKNOWLEDGEMENTS |
|---|
We thank Brenda B. Rauner, Publications Manager and Executive Editor, APS Publications, for providing the information about research manuscripts published by the American Physiological Society.
| |
FOOTNOTES |
|---|
This review was supported in part by the Dept. of Pediatrics (M. Douglas Jones, Jr., Chair); by a Grant-in-Aid from the American Heart Association of Colorado and Wyoming (to D. Curran-Everett); and by the National Science Foundation Grant DMS 95-10435 (to K. Kafadar).
1 The adjective "null" can be misleading: this hypothesis need not be one of no difference. The use of null persists because of historical inertia.
2 Kruskal (38) reviews other drawbacks to hypothesis testing.
3 Sir Ronald Fisher, the author of this phrase, developed many statistical procedures, including the analysis of variance.
4 A parameter is a numerical constant: for example, the population mean.
5 References 2, 9, 42, and 48 discuss other aspects of sampling.
6 Statistical calculations and exercises were executed by using SAS Release 6.04 (SAS Institute, Cary, NC, 1987).
7 We address the discrepancy between the value of the sample estimate of a population parameter and the value of the population parameter itself in ESTIMATING UNCERTAINTY ABOUT A POPULATION PARAMETER.
8 The Central Limit Theorem states that the theoretical distribution of the sample mean will be approximately normal, regardless of the distribution of the original observations (35, 42). If the distribution of the original observations happens to be normal, then the theoretical distribution of the sample mean will be exactly normal.
9 References 2, 9, 42, and 48 discuss the calculation of confidence intervals for other population parameters.
10 Moses (Ref. 42, p. 113-117) illustrates further the concept of a confidence interval by using empirical examples.
11 The word "significance," when used to refer to scientific consequence, is ambiguous. Hereafter, we use the word "importance."
Address for reprint requests: D. Curran-Everett, Dept. of Pediatrics, B-195, Univ. of Colorado Health Sciences Center, 4200 East 9th Ave., Denver, CO 80262 (E-mail: dcurranevere{at}castle.cudenver.edu).
| |
REFERENCES |
|---|
|
|
|---|
1.
Altman, D. G.
Misuse of statistics is unethical.
In: Statistics in Practice, edited by S. M. Gore,
and D. G. Altman. London: Br. Med. Assoc., 1982, p. 1-2.
2.
Altman, D. G.
Practical Statistics for Medical Research. New York: Chapman & Hall, 1991.
3.
Altman, D. G.
Statistics in medical journals: developments in the 1980s.
Stat. Med.
10:
1897-1913,
1991[Medline].
4.
Altman, D. G.,
and
J. M. Bland.
Improving doctors' understanding of statistics.
J. R. Stat. Soc. Ser. A
154:
223-267,
1991.
5.
Altman, D. G.,
S. M. Gore,
M. J. Gardner,
and
S. J. Pocock.
Statistical guidelines for contributors to medical journals.
Br. Med. J.
286:
1489-1493,
1983.
6.
Anscombe, F. J.
Graphs in statistical analysis.
Am. Statistician
27:
17-21,
1973.
7.
Appleton, D. R.
What statistics should we teach medical undergraduates and graduates?
Stat. Med.
9:
1013-1021,
1990[Medline].
8.
Arbuthnot, J.
Of the Laws of Chance. London: Benj. Motte, 1692.
9.
Armitage, P.,
and
G. Berry.
Statistical Methods in Medical Research (3rd ed.). Cambridge, MA: Blackwell Scientific, 1994.
10.
Bailar, J. C., III,
and
F. Mosteller.
Guidelines for statistical reporting in articles for medical journals.
Ann. Intern. Med.
108:
266-273,
1988.
11.
Bland, J. M.,
and
D. G. Altman.
Caveat doctor: a grim tale of medical statistics textbooks.
Br. Med. J.
295:
979,
1987
12.
Bland, J. M.,
and
D. G. Altman.
Misleading statistics: errors in textbooks, software and manuals.
Int. J. Epidemiol.
17:
245-247,
1988
13.
Boring, E. G.
Mathematical vs. scientific significance.
Psychol. Bull.
16:
335-338,
1919.
14.
Box, G. E. P.,
and
D. R. Cox.
An analysis of transformations.
J. R. Stat. Soc. Ser. B
26:
211-243,
1964.
15.
Burkholder, D. L.,
and
J. Pfanzagl.
Estimation.
In: International Encyclopedia of the Social Sciences, edited by D. L. Sills. New York: Macmillan & The Free Press, 1968, vol. 5, p. 142-157.
16.
Burnand, B.,
W. N. Kernan,
and
A. R. Feinstein.
Indexes and boundaries for "quantitative significance" in statistical decisions.
J. Clin. Epidemiol.
43:
1273-1284,
1990[Medline].
17.
Colditz, G. A.,
and
J. D. Emerson.
The statistical content of published medical research: some implications for biomedical education.
Med. Educ.
19:
248-255,
1985[Medline].
18.
Colton, T.
Statistics in Medicine. Boston, MA: Little, Brown, 1974.
19.
Cox, D. R.
Statistical significance tests.
Br. J. Clin. Pharmacol.
14:
325-331,
1982[Medline].
20.
Denenberg, V. H.
Some statistical and experimental considerations in the use of the analysis-of-variance procedure.
Am. J. Physiol.
246 (Regulatory Integrative Comp. Physiol. 15):
R403-R408,
1984
21.
Denham, M. J.,
A. Foster,
and
D. A. J. Tyrrell.
Work of a district ethical committee.
Br. Med. J.
2:
1042-1045,
1979[Medline].
22.
DiStefano, J. J., III,
and
E. M. Landaw.
Multiexponential, multicompartmental, and noncompartmental modeling. I. Methodological limitations and physiological interpretations.
Am. J. Physiol.
246 (Regulatory Integrative Comp. Physiol. 15):
R651-R664,
1984.
23.
Draper, N. R.,
and
H. Smith.
Applied Regression Analysis (2nd ed.). New York: Wiley, 1981.
24.
Evans, S. J. W.,
P. Mills,
and
J. Dawson.
The end of the p value?
Br. Heart J.
60:
177-180,
1988
25.
Fisher, R. A.
Statistical Methods and Scientific Inference (3rd ed.). New York: Hafner, 1973.
26.
Flynn, F. V.,
K. A. J. Piper,
P. Garcia-Webb,
K. McPherson,
and
M. J. R. Healy.
The frequency distributions of commonly determined blood constituents in healthy blood donors.
Clin. Chim. Acta
52:
163-171,
1974[Medline].
27.
Garcia-Mata, C.,
and
F. I. Shaffner.
Solar and economic relationships: a preliminary report.
Q. J. Economics
49:
1-51,
1934.
28.
Gardner, M. J.,
and
D. G. Altman.
Confidence intervals rather than P values: estimation rather than hypothesis testing.
Br. Med. J.
292:
746-750,
1986.
29.
Garfinkel, D.,
and
K. A. Fegley.
Fitting physiological models to data.
Am. J. Physiol.
246 (Regulatory Integrative Comp. Physiol. 15):
R641-R650,
1984.
30.
Gray, B. H.,
R. A. Cooke,
and
A. S. Tannenbaum.
Research involving human subjects.
Science
201:
1094-1101,
1978
31.
Healy, M. J. R.
Significance tests.
Arch. Dis. Child.
66:
1457-1458,
1991
32.
Healy, M. J. R.
Data transformations.
Arch. Dis. Child.
69:
260-264,
1993
33.
Hill, A. B.
Principles of medical statistics. XII
Common fallacies and difficulties.
Lancet
i:
706-708,
1937.
34.
Hofacker, C. F.
Abuse of statistical packages: the case of the general linear model.
Am. J. Physiol.
245 (Regulatory Integrative Comp. Physiol. 14):
R299-R302,
1983.
35.
Hogg, R. V.,
and
A. T. Craig.
Introduction to Mathematical Statistics (4th ed.). New York: Macmillan, 1978.
36.
Iberall, A. S.
The problem of low-dose radiation toxicity.
Am. J. Physiol.
244 (Regulatory Integrative Comp. Physiol. 13):
R7-R13,
1983.
37.
Jackson, T. E.
Comparison of a class of regression equations.
Am. J. Physiol.
246 (Regulatory Integrative Comp. Physiol. 15):
R271-R276,
1984.
38.
Kruskal, W. H.
Tests of significance.
In: International Encyclopedia of the Social Sciences, edited by D. L. Sills. New York: Macmillan & The Free Press, 1968, vol. 14, p. 238-250.
39.
Land, T. A.,
and
M. Secic.
How to Report Statistics in Medicine. Philadelphia, PA: Am. College Physicians, 1997.
40.
Landaw, E. M.,
and
J. J. DiStefano III.
Multiexponential, multicompartmental, and noncompartmental modeling. II. Data analysis and statistical considerations.
Am. J. Physiol.
246 (Regulatory Integrative Comp. Physiol. 15):
R665-R677,
1984.
41.
Montgomery, D. C.,
and
G. C. Runger.
Applied Statistics and Probability for Engineers. New York: Wiley, 1994, p. 361-363.
42.
Moses, L. E.
Think and Explain with Statistics. Reading, MA: Addison-Wesley, 1986.
43.
Mosteller, F.,
and
J. W. Tukey.
Data Analysis and Regression. Reading, MA: Addison-Wesley, 1977.
44.
Murray, G. D.
How we should approach the future.
Stat. Med.
9:
1063-1068,
1990[Medline].
45.
SHEP Cooperative Research Group.
Prevention of stroke by antihypertensive drug treatment in older persons with isolated systolic hypertension. Final results of the systolic hypertension in the elderly program (SHEP).
JAMA
265:
3255-3264,
1991
46.
Slinker, B. K.,
and
S. A. Glantz.
Multiple regression for physiological data analysis: the problem of multicollinearity.
Am. J. Physiol.
249 (Regulatory Integrative Comp. Physiol. 18):
R1-R12,
1985.
47.
Snedecor, G. W.
The statistical part of the scientific method.
Ann. NY Acad. Sci.
52:
792-799,
1950.
48.
Snedecor, G. W.,
and
W. G. Cochran.
Statistical Methods (7th ed.). Ames: Iowa State Univ. Press, 1980.
49.
Tuininga, Y. S.,
D. J. van Veldhuisen,
J. Brouwer,
J. Haaksma,
H. J. G. M. Crijns,
A. J. Man in't Veld,
and
K. I. Lie.
Heart rate variability in left ventricular dysfunction and heart failure: effects and implications of drug treatment.
Br. Heart J.
72:
509-513,
1994
50.
Yates, F.,
and
M. J. R. Healy.
How should we reform the teaching of statistics?
J. R. Stat. Soc. Ser. A
127:
199-210,
1964.
51.
Yates, F. E.
Contribution of statistics to ethics of science.
Am. J. Physiol.
244 (Regulatory Integrative Comp. Physiol. 13):
R3-R5,
1983.
This article has been cited by other articles:
![]() |
D. Curran-Everett Explorations in statistics: hypothesis tests and P values Advan Physiol Educ, June 1, 2009; 33(2): 81 - 86. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett Explorations in statistics: confidence intervals Advan Physiol Educ, June 1, 2009; 33(2): 87 - 90. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Statistics, authors, and reviewers: the heart of the matter Advan Physiol Educ, March 1, 2009; 33(1): 80 - 80. [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Reply to B. Kay Advan Physiol Educ, December 1, 2008; 32(4): 335 - 335. [Full Text] [PDF] |
||||
![]() |
N. Tessitore, V. Bedogna, A. Poli, W. Mantovani, G. Lipari, E. Baggio, G. Mansueto, and A. Lupo Adding access blood flow surveillance to clinical monitoring reduces thrombosis rates and costs, and improves fistula patency in the short term: a controlled cohort study Nephrol. Dial. Transplant., November 1, 2008; 23(11): 3578 - 3584. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett Explorations in statistics: standard deviations and standard errors Advan Physiol Educ, September 1, 2008; 32(3): 203 - 208. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Guidelines for reporting statistics in journals published by the American Physiological Society: the sequel Advan Physiol Educ, December 1, 2007; 31(4): 295 - 298. [Full Text] [PDF] |
||||
![]() |
P. K. Rangachari Statistics: not a confidence trick. A commentary on "Guidelines for reporting statistics in journals published by the American Physiological Society: the sequel" Advan Physiol Educ, December 1, 2007; 31(4): 300 - 301. [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Last Word on Perspectives "Guidelines for reporting statistics in journals published by the American Physiological Society: the sequel" Advan Physiol Educ, December 1, 2007; 31(4): 306 - 307. [Full Text] [PDF] |
||||
![]() |
A. G. Kostyk, K. M. Dahl, M. W. Wynes, L. A. Whittaker, D. J. Weiss, R. Loi, and D. W.H. Riches Regulation of Chemokine Expression by NaCl Occurs Independently of Cystic Fibrosis Transmembrane Conductance Regulator in Macrophages Am. J. Pathol., July 1, 2006; 169(1): 12 - 20. [Abstract] [Full Text] [PDF] |
||||
![]() |
T. Koehnle, D. Curran-Everett, and D. J. Benos The proof is not in the P value Am J Physiol Regulatory Integrative Comp Physiol, March 1, 2005; 288(3): R777 - R778. [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Guidelines for reporting statistics in journals published by the American Physiological Society Advan Physiol Educ, September 1, 2004; 28(3): 85 - 87. [Full Text] [PDF] |
||||
![]() |
N. Tessitore, G. Lipari, A. Poli, V. Bedogna, E. Baggio, C. Loschiavo, G. Mansueto, and A. Lupo Can blood flow surveillance and pre-emptive repair of subclinical stenosis prolong the useful life of arteriovenous fistulae? A randomized controlled study Nephrol. Dial. Transplant., September 1, 2004; 19(9): 2325 - 2333. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Guidelines for reporting statistics in journals published by the American Physiological Society Physiol Genomics, August 11, 2004; 18(3): 249 - 251. [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Guidelines for reporting statistics in journals published by the American Physiological Society J Neurophysiol, August 1, 2004; 92(2): 669 - 671. [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Guidelines for reporting statistics in journals published by the American Physiological Society J Appl Physiol, August 1, 2004; 97(2): 457 - 459. [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Guidelines for reporting statistics in journals published by the American Physiological Society Am J Physiol Gastrointest Liver Physiol, August 1, 2004; 287(2): G307 - G309. [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Guidelines for reporting statistics in journals published by the American Physiological Society Am J Physiol Cell Physiol, August 1, 2004; 287(2): C243 - C245. [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Guidelines for reporting statistics in journals published by the American Physiological Society Am J Physiol Endocrinol Metab, August 1, 2004; 287(2): E189 - E191. [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Guidelines for reporting statistics in journals published by the American Physiological Society Am J Physiol Lung Cell Mol Physiol, August 1, 2004; 287(2): L259 - L261. [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Guidelines for reporting statistics in journals published by the American Physiological Society Am J Physiol Regulatory Integrative Comp Physiol, August 1, 2004; 287(2): R247 - R249. [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Guidelines for reporting statistics in journals published by the American Physiological Society Am J Physiol Heart Circ Physiol, August 1, 2004; 287(2): H447 - H449. [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett and D. J. Benos Guidelines for reporting statistics in journals published by the American Physiological Society Am J Physiol Renal Physiol, August 1, 2004; 287(2): F169 - F171. [Full Text] [PDF] |
||||
![]() |
B. Braun, P. B. Rock, S. Zamudio, G. E. Wolfel, R. S. Mazzeo, S. R. Muza, C. S. Fulco, L. G. Moore, and G. E. Butterfield Women at altitude: short-term exposure to hypoxia and/or {alpha}1-adrenergic blockade reduces insulin sensitivity J Appl Physiol, August 1, 2001; 91(2): 623 - 631. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Kent-Braun and A. V. Ng Skeletal muscle oxidative capacity in young and older women and men J Appl Physiol, September 1, 2000; 89(3): 1072 - 1078. [Abstract] [Full Text] [PDF] |
||||
![]() |
D. Curran-Everett Multiple comparisons: philosophies and illustrations Am J Physiol Regulatory Integrative Comp Physiol, July 1, 2000; 279(1): R1 - R8. [Abstract] [Full Text] [PDF] |
||||
![]() |
J. A. Kent-Braun, A. V. Ng, and K. Young Skeletal muscle contractile and noncontractile components in young and older women and men J Appl Physiol, February 1, 2000; 88(2): 662 - 668. [Abstract] [Full Text] [PDF] |
||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| Visit Other APS Journals Online |