## Abstract

The volume of red blood cells (V_{RBC}) is used routinely in the diagnostic workup of polycythemia, in assessing the efficacy of erythropoietin administration, and to study factors affecting oxygen transport. However, errors of various methods of measurement of V_{RBC} and related parameters are not well characterized. We meta-analyzed 346 estimates of error of measurement of V_{RBC} for techniques based on Evans blue (V_{RBC,Evans}), ^{51}chromium-labeled red blood cells (V_{RBC,51Cr}), and carbon monoxide (CO) rebreathing (V_{RBC,CO}), as well as hemoglobin mass with the carbon-monoxide method (M_{Hb,CO}), in athletes and active and inactive subjects undergoing various experimental and control treatments lasting minutes to months. Subject characteristics and experimental treatments had little effect on error of measurement, but measures with the smallest error showed some increase in error with increasing time between trials. Adjusted to 1 day between trials and expressed as coefficients of variation, mean errors for M_{Hb,CO} (2.2%; 90% confidence interval 1.4–3.5%) and V_{RBC,51Cr} (2.8%; 2.4–3.2%) were much less than those for V_{RBC,Evans} (6.7%; 4.9–9.4%) and V_{RBC,CO} (6.7%; 3.4–14%). Most of the error of V_{RBC,Evans} was due to error in measurement of volume of plasma via Evans blue dye (6.0%; 4.5–7.8%), which is the basis of V_{RBC,Evans}. Most of the error in V_{RBC,CO} was due to estimates from laboratories with a relatively large error in M_{Hb,CO}, the basis of V_{RBC,CO}. V_{RBC,51Cr} and M_{Hb,CO} are the best measures for research on blood-related changes in oxygen transport. With care, V_{RBC,Evans} is suitable for clinical applications of blood-volume measurement.

- reliability
- hemoglobin mass
- volume of red blood cells
- Evans blue dye
- carbon monoxide

blood volume (v_{Blood}) and its component volumes of red blood cells (V_{RBC}) and plasma (V_{Plasma}) have been studied for more than 100 years (58). As oxygen transport is closely associated with circulating mass of hemoglobin (M_{Hb}) (13, 49, 65), this component of the V_{RBC} has also been studied extensively. Early investigators of V_{Blood} parameters were concerned mostly with establishing the methodology and providing normal values for body size, age, and sex (133). In the last 50 yr, the techniques have been applied clinically in the diagnostic workup of polycythemia rubra vera and anemia (31, 32, 105, 106), assessing response to erythropoietin administration (103), and in red cell survival studies (141). Additionally, V_{Blood} and related measures have featured, in studies of relationships between exercise and aging (24), changes in V_{Blood}, V_{Plasma}, and V_{RBC} with bed rest (90, 104, 108) and spaceflight (1), and, among chronic exercisers (136), changes in cardiac function with training-induced hypervolemia (134) and body fluid redistribution during exercise (97), as well as in investigations of thermoregulatory stress (124), the contributions of V_{Plasma} and hemoglobin to oxygen transport and performance (28, 45, 53, 125), mechanisms of blood doping (27, 103), and adaptation to altitude (48, 59, 88).

The methods used to measure V_{Blood} values are all indirect and based on dilution of tracers injected into the circulation (31). The tracers are red blood cells labeled with radioactive chromium (^{51}Cr) for measurement of V_{RBC} (51), albumin labeled with radioactive iodine (^{131}I or ^{125}I) for measurement of V_{Plasma} (128), and the dye Evans blue, which delineates V_{Plasma} by staining plasma proteins (43). Measurement of M_{Hb} is also based on dilution of a tracer, inhaled carbon monoxide (CO), which binds to and changes the color of hemoglobin (58). The ^{51}Cr method for estimating V_{RBC} (V_{RBC,51Cr}) is regarded as the criterion method by the International Committee for Standardization in Haematology, on the “basis of reliability, reproducibility and ease of use in routine clinical use” (75). Iodine-labeled albumin is recommended for estimating V_{Plasma} (74, 75) in clinical settings, whereas researchers concerned about the health risks of radioactive iodine have used the Evans blue (12, 47, 90, 104, 108) or CO (25, 55, 99, 113, 151) methods extensively.

Application of a patient's V_{RBC} measurement to a clinical assessment requires an appreciation of population distribution of values (106) and of normal physiological within-person variation (31). Without supporting data, it is often assumed that the patient's V_{RBC} has been measured reliably; that is, if the test were repeated a few days later, a similar value would be obtained (32). However, meaningful interpretation of serial changes in blood parameters in myeloproliferative disease or experimentally induced erythropoiesis requires quantification of the errors of the measurement techniques (31). A reliable method to determine V_{RBC} or M_{Hb} will have small measurement error; moreover, reliability is a prerequisite of validity, the extent to which a test actually measures what it intends to measure. Measurement error is calculated as a coefficient of variation (CV) (109, 110) and is usually expressed as a percentage. The measurement error is also known as typical error (67) and includes random error (analytic error arising from using the method-specific apparatus and intraindividual biological variation) but not systematic error (bias). Measurement error can be estimated from studies that include interventions, where it will include the analytic error, the day-to-day biological variation, and the interindividual variation in response to the intervention. Measurement errors can arise in numerous ways for the different methods used to estimate V_{Blood} or M_{Hb}, as described in Table 1.

There has been no systematic review of errors in the various measures of V_{Blood} and related parameters, and it was apparent to us that the errors for V_{RBC}, V_{Plasma}, and M_{Hb} ranged widely, between 1 and 10%. We have, therefore, performed a meta-analysis of the errors of measurement to characterize the contribution of sampling variation, differences between laboratories, effects of subject and study characteristics, and true differences between the errors of methods used to estimate V_{Blood} values and M_{Hb}.

## METHODS

### Data Sources, Techniques, and Method Variations

The data used in the meta-analysis were obtained from original publications that reported values of individual subjects and from published studies whose authors provided us with their de-identified raw data (1, 2, 4–6, 10–12, 15–17, 21, 24, 25, 33, 35–37, 39–44, 47–50, 55, 63, 65, 77–80, 83, 86–91, 93, 94, 98–102, 104, 107, 108, 111, 113, 114, 117–120, 122, 132, 137–139, 141, 143, 144, 146, 148–151). Three unpublished data sets have also been included in the analysis, courtesy of the respective senior investigator responsible for data collection (Gore CJ, Slater GJ and Schmidt W, personal communications). The first two of these studies were approved by the Australian Institute of Sport Human Ethics Committee and the third was approved by the Human Ethics Committee of the Friedrich-Alexander-Universität Erlangen-Nürnberg, Germany.

The V_{Blood} techniques selected for the meta-analysis are those most commonly used to estimate V_{RBC} and M_{Hb} (Table 1). We were able to obtain acceptable amounts of data for each of the following: CO rebreathing for M_{Hb} (M_{Hb,CO}); CO rebreathing for V_{RBC} (V_{RBC,CO}); V_{RBC,51Cr}; and Evans blue dye for V_{Plasma}, V_{Blood}, and V_{RBC} (V_{Plasma,Evans}, V_{Blood,Evans,} and V_{RBC,Evans}). Some data sets included values for hematocrit (Hct) and/or hemoglobin concentration ([Hb]). Only one study (30) provided sufficient data for V_{Plasma}, estimated using ^{131}I-labeled albumin, and none provided data for ^{125}I.

Modification of the CO-rebreathing technique may affect the magnitude of measurement error for M_{Hb,CO} and its related volumes. We have previously recommended that, to minimize likely sources of error, researchers should use relatively large doses of CO, a small rebreathing volume, and at least four replicate measures of percent carboxyhemoglobin (%HbCO) (15). Others have generally used smaller doses of CO, and/or a larger rebreathing volume, and/or duplicate or single measures of %HbCO. We took this opportunity to compare the error of measurement of the “Burge and Skinner” (15) method (4–6, 15, 48–50, 122) with that of others (10, 39–41, 55, 63, 65, 117, 119, 120).

Methodological variations of the Evans blue technique could affect the magnitude of measurement error for V_{Plasma,Evans} and its related volumes. These variations are in terms of extraction (column chromatography) or not of Evans blue dye from plasma. There are also variants that do or do not back-extrapolate a multitime point disappearance curve. Some authors do not back-extrapolate but use the simple change in plasma dye concentration obtained from one pre- and one postdye injection specimen (with post being obtained 10–20 min afterwards) (11, 24, 47, 55). The use of just one postinjection sample is justified by the statement of Greenleaf et al. (52) that a single 10-min post-dye-injection specimen gives the same V_{Plasma} value as using the more rigorous back-extrapolation method, of which there are also several mathematical approaches (43) (Table 1). The three common method variations for V_{Plasma,Evans} in the data that we acquired were as follows: extracted/not back-extrapolated (11, 12, 24, 44, 47, 55), not extracted/not back-extrapolated (2, 16, 21, 33, 36, 37, 42, 43, 50, 77, 78, 86–91, 93, 94, 101, 102, 104, 107, 108, 111, 118, 132, 143, 149, 150), and not extracted/back-extrapolated (17, 146).

Classically, [Hb] has been determined by pipetting blood into Drabkin's reagent, to form cyanmethemoglobin that is measured spectrophotometrically at 540 nm (76). We distinguished between those studies that used the traditional, manual cyanmethemoglobin method to determine [Hb] and those that used fully automated analyzers.

There are no substantial variants of the CO-based method for estimation of V_{RBC,CO}. Several altitude studies from which we obtained data (55, 98, 113, 119, 151) calculated the circulatory volumes by combining M_{Hb} with [Hb] or Hct (Table 1: secondary equations for CO); other studies (5, 40, 64, 117, 122, 127) reported only M_{Hb,CO} (15) (Table 1, primary method equation for CO).

### Calculation of Error of Measurement

Studies were included in the meta-analysis if there were at least two assays of V_{RBC}, V_{Blood}, V_{Plasma}, or M_{Hb} in at least three individuals receiving the same experimental treatment. The error of measurement for each blood-related measure was calculated for pairs of assays as follows (67): after log-transformation, the standard deviation of the differences was calculated, and the result was divided by and back-transformed to a CV, which was recorded along with its degrees of freedom (df = sample size − 1). These calculations were performed in a spreadsheet designed for analysis of crossovers (68). Each estimate of error was inflated by a factor 1 + 1/(4 df) to correct for small-sample bias (56).

In studies in which more than two assays were performed per subject, one assay was chosen as the reference, and the error of measurement and change in the mean were calculated for each of the other assays paired against this reference. Where two or more baseline assays were obtained before an experimental treatment, the reference was the assay closest in time to the experimental treatment. Time between pairs of assays was recorded in days.

The values of individual subjects giving rise to an estimate of error of measurement were examined for outliers in a plot of the subjects' change score between assays against the subjects' value of the reference assay. The following outliers were excluded from the error of measurement calculation: a subject with a 40% decline in V_{Plasma,CO} in Poulsen et al. (111), and a subject with ∼40% changes in several Evans blue measures in Grover et al. (55). Individual values for the residuals and random effects in the meta-analyses (see below) were also displayed graphically for identification of other possible outliers (*t* values ∼3 or more), but none was observed. These analyses, therefore, were performed with essentially no subjective filtering of data, and any apparently large errors of measurement obtained in our analysis are not attributable to the presence of outliers.

### Coding of Predictor Variables

#### Experimental treatments.

The treatment that occurred between each pair of assays was coded either as “altitude,” “none,” or “other.” Treatments were coded as altitude for assays separated by a period of real or simulated altitude of 600–7,000 m, for exposures of 0.5–24 h/day, for durations of 0.3–40 days. Altitude treatments that included any time spent at or near sea level before, during, and/or after the altitude exposure were coded as altitude. Treatments were coded as none for pairs of assays from reliability studies, pairs of assays during a baseline period in experimental studies, and pairs of assays in any control group receiving no intervention that would be expected to change V_{Blood} parameters. Treatments coded as other were as follows: training, ingestion of propranolol during altitude exposure, effects of menstrual cycle and/or oral contraceptive pill or placebo in combination with altitude, feeding before and after Evans blue injection, bed rest with head-down tilt and/or supine cycle ergometer training, acute exercise, living 12.8 m below the ocean surface, spaceflight, weight loss before rowing races, V_{Plasma} expansion, heat exposure, iron supplementation, and venesection.

#### Subject characteristics.

Sex and fitness were included in some meta-analyses. Sex was coded as a covariate representing the proportion of men in the sample (examples: 0 for all women; 0.375 for 5 women and 3 men; 1 for all men). Two studies using the ^{51}Cr method (114, 144) and one study using the Evans blue method (143) did not state the sex of the subjects. For these studies, we assigned a sex covariate value of 0.6, which was the mean value of all of the studies. Fitness was coded as “inactive” (sedentary), “active” (but not a competitive athlete), and “athlete” (competitive). Hospital inpatients [2 studies (30, 143)] were coded as “inactive.” Four studies using the Evans blue method (24, 43, 98, 111), two using the ^{51}Cr method (114, 144), and three using V_{RBC,CO} (25, 80, 98) did not indicate physical activity or fitness of the subjects and were coded as inactive.

#### Laboratories.

The effect of individual laboratories (or research groups) on error of measurement was modeled as a random effect in the meta-analysis. If a laboratory changed a particular technique substantially, a new identifier was assigned to the subsequent estimates.

### Meta-analysis

The details of our unique, novel meta-analytic approach and its interpretation are contained in the appendix. In summary, we used a mixed-model meta-analysis in which the dependent variable was the log-transformed error of measurement, the fixed effects were the method of measurement and characteristics of the study, and the random effects were within- and between-study variation. An unbiased weighting factor for each estimate was derived from the estimate's df. Inferences about the substantiveness of true differences between two errors of measurement were made in accordance with extent of overlap of the confidence interval of their ratio with the thresholds for substantial ratios (0.9 and 1.1).

## RESULTS

### Raw Data

The numbers of estimates of error of measurement for each blood parameter were as follows: 69 M_{Hb,CO}, 32 V_{Blood,Evans}, 64 V_{Plasma,Evans}, 29 V_{RBC,CO}, 9 V_{RBC,51Cr}, 69 V_{RBC,Evans}, 83 [Hb], and 57 Hct. The mean df for each parameter ranged from 5 to 10. The number of estimates for each of the methods for measurement of V_{Plasma,Evans} were as follows: 19 extracted/not back-extrapolated, 41 not extracted/back-extrapolated, 3 not extracted/not back-extrapolated, and 1 unclear. Thirty-six of the estimates of error for M_{Hb,CO} used the method of Burge and Skinner, and 33 used the other method. Of the 83 estimates for [Hb], 52 used automated methods, and 31 used the cyanmethemoglobin method. The median time between measurements giving rise to the estimates of error was 22–30 days for M_{Hb,CO}, V_{Blood,Evans}, V_{Plasma,Evans}, V_{RBC,CO}, V_{RBC,Evans}, [Hb], and Hct, but only 1 day for V_{RBC,51Cr}. There were 159 estimates involving real or simulated altitude, 132 estimates involving other treatments, and 121 involving no treatment (from control groups, baseline pairs of measurements in treatment groups, or reliability studies). The median real or simulated altitude was 3,000 m (range 600–7,000 m), and median time spent at altitude was 12 days (0.3–40 days). The estimates were collected on 84 recreationally active, 165 athletic, and 161 inactive groups. For the entire data set, there was a greater proportion of men (60%) than women (40%).

### Four Comparable Hematology Parameters for the Red Cell Compartment

The measurement errors for M_{Hb,CO}, V_{RBC,CO}, V_{RBC,51Cr}, and V_{RBC,Evans} are displayed in Fig. 1 as a function of time between measurements. A 10-fold increase in time between measurements had a substantial effect on the error for M_{Hb,CO} (by a factor of 1.5; 90% confidence limits 1.2–1.9), trivial small effects on the errors for V_{RBC,51Cr} (by a factor of 1.1; 1.0–1.2) and V_{RBC,Evans} (by a factor of 1.1; 0.9–1.2), and an unclear trivial effect on the error for V_{RBC,CO} (by a factor of 1.0; 0.6–1.6). The mean error of measurement predicted for 1 day between measurements and averaged over treatments ranged from 2.2 to 6.7% for the four methods (Table 2). M_{Hb,CO} had substantially less error than V_{RBC,51Cr} (ratio of errors of M_{Hb,CO} to V_{RBC,51Cr} = 0.8; 0.5–1.3), although the confidence limits indicate that the real difference was unclear. Errors for both methods were clearly less than (about one-third) those for V_{RBC,CO} and V_{RBC,Evans}. Estimates of the typical between-laboratory variations in the errors of measurement for M_{Hb,CO}, V_{RBC,CO}, and V_{RBC,Evans} had considerable uncertainty, but were substantial over the range of their confidence limits: observed values (and confidence limits) were ×/÷1.8 (1.5–3.0), ×/÷1.5 (1.3–4.2), and ×/÷1.6 (1.3–2.8), respectively. Thus, for example, a new user of the M_{Hb,CO} method could anticipate that, in their hands, the error of measurement could easily be consistently as high as 4.0% (= 2.2 × 1.8%) or as low as 1.2% (= 2.2 ÷ 1.8%). The between-laboratory variation for V_{RBC,51Cr} was not estimable, owing to a paucity of data.

The clearly smaller magnitude of measurement errors for M_{Hb,CO} and V_{RBC,51Cr} implies that these blood measures are better than V_{RBC,CO} and V_{RBC,Evans} for quantifying effects of subject characteristics and experimental treatment on error of measurement. Only M_{Hb,CO} had enough estimates of error for such an analysis. The effect of a 10-fold increase in time between measurements on the error of measurement differed substantially among the three treatments (factors of 1.3, 1.1, and 1.5 for none, altitude, and other, respectively), but these differences were unclear, owing to large uncertainty in their ratios. The other effects on error for M_{Hb,CO} were, therefore, estimated under the assumption that the effect of a 10-fold increase in time was the same for all three treatments; this effect was reasonably clear (a factor of 1.3; confidence limits 0.9–1.8), albeit somewhat less than that provided by the simpler model above. The observed error of M_{Hb,CO} for women was slightly greater than that for men (ratio 1.1; 0.9–1.4), indicating that women might have substantially greater error than men but were very unlikely to have less error. Athletes were likely to have substantially greater error of measurement than active subjects (ratio 1.4; 1.0–1.9), but the greater error for athletes compared with inactive subjects was unclear, owing to a wide confidence interval for the comparison (ratio 1.5; 0.8–2.8). The active/inactive subjects comparison was also unclear (ratio 1.1; 0.6–2.0). The mean measurement error of the Burge and Skinner method of estimating M_{Hb,CO} (1.7%; 1.0–2.8%) was less than one-half that of the other variant (3.9%; 2.4–6.4%); the difference was definitive (ratio of Burge and Skinner/other 0.4; 0.3–0.8). There was little observed effect of treatment on error (pairwise ratios for the three treatments all 1.0), but confidence limits for the ratios (all ×/÷1.2) allowed for the possibility of small, true differences.

### V_{Blood} Compartments with Evans Blue

The mean error of measurement for V_{Blood,Evans}, V_{Plasma,Evans}, and V_{RBC,Evans} was ∼5–7% (Table 3), and there were no clear differences between the ratios of their errors. However, the variation in error between laboratories for these measures was substantial: ×/÷1.6 (1.4–2.0). A 10-fold increase in the time between measurements did not substantially increase the errors in V_{Blood,Evans}, V_{Plasma,Evans}, and V_{RBC,Evans}; the ratio for the pooled data of all three volumes was ×/÷1.0 (0.9–1.1). There was little difference between errors for the pooled treatments (6.8%; 5.6–8.4%), altitude treatments (6.3%; 5.1–7.7%), and no treatment (6.0%; 4.9–7.3%), although confidence limits allowed for small differences between some treatment groups (range of confidence limits for ratios: 0.8–1.2).

The mean errors of measurement for V_{Plasma,Evans}, determined with or without extraction and/or back-extrapolation, differed substantially but not conclusively, owing to wide confidence limits. The errors (and confidence limits) were 4.8% (2.3–10%) for not extracted/not back-extrapolated, 5.7% (3.4–9.6%) for extracted/not back-extrapolated, and 6.7% (4.5–10%) for not extracted/back-extrapolated. The estimate of the typical between-laboratory variation in the errors of measurement was substantial: ×/÷1.6 (1.4–2.9).

### V_{Plasma} with ^{131}I

In the only study providing usable data for the ^{131}I method, the subjects were five female and eight male patients, and there was no treatment between the two trials, which were separated by 150 min. The error of measurement for V_{Plasma} was 4.9% (confidence limits 3.7–7.0%).

### [Hb] and Hct

The mean errors of measurement for [Hb] were 2.5% (confidence limits 2.3–2.7%) and 2.1% (1.8–2.5%) for the automated and cyanmethemoglobin methods, respectively, whereas that for Hct was 3.2% (2.6–4.0%). The variation in error between laboratories for [Hb] was trivial (×/÷1.1), and its upper confidence limit was only marginally substantial (×/÷1.2). On the other hand, the variation in error between laboratories for Hct was substantial, although small: ×/÷1.2 (1.1–2.8). There was little difference between men and women for the error in [Hb] (ratio men/women 1.1; 1.0–1.3) and Hct (ratio 1.0; 0.8–1.2). A 10-fold increase in time between measurements was associated with a decisive substantial increase in the error for [Hb] (ratio 1.3; 1.3–1.4); in contrast, the error in Hct was, at most, only marginally larger (ratio 1.1; 1.0–1.2) for a 10-fold increase in time.

## DISCUSSION

The most important results in the present study are the meta-analytic estimates of error of measurement of V_{RBC} and M_{Hb,} the blood parameters directly related to oxygen transport. The short-term errors for V_{RBC,51Cr} and M_{Hb,CO} were ∼2.5%, whereas those for V_{RBC,Evans} or V_{RBC,CO} were about threefold greater. Over a period of 1 mo, the errors for V_{RBC,51Cr} and M_{Hb,CO} were ∼3.5%, about one-half of those for V_{RBC,Evans} or V_{RBC,CO}. The errors of measurement for M_{Hb,CO}, V_{RBC,51Cr}, and V_{RBC,Evans} also showed wide variation between laboratories, typically by a factor of approximately ×/÷1.6. Thus a poor laboratory assessing M_{Hb,CO} and a good laboratory assessing V_{RBC,Evans} could have similar errors of measurement (∼4%) and obtain similar precision in the estimates of effects on M_{Hb} and red cell volume with a given sample size, but an even greater disparity between the two methods is also possible. Unfortunately, we were unable to estimate whether V_{RBC,51Cr} shows substantial variation from laboratory to laboratory, owing to a paucity of data.

The substantial increase in the error for M_{Hb,CO} with increasing time between measurements can be accounted for partly by an increase in the contribution of biological variation. The error of measurement consists of biological variation and analytic error that are independent and combine as variances: (error of measurement)^{2} = (biological variation)^{2} + (analytic error)^{2}. If we assume that the 1-day error for M_{Hb,CO} (2.2%) comprises minimal biological variation, the 30-day error (∼4%) indicates additional error of 3.3% during this interval, irrespective of treatment. The sources of technical error (Table 1) should be independent of time between measurements, so this additional error appears to be entirely biological. An intriguing possibility is that it arises from cyclical variation in hematopoiesis with a period of approximately weeks, similar to but of lesser magnitude than that described in some hematological disorders (62). On the other hand, the additional error could be an artifact of studies of longer duration being conducted by laboratories with poorer measurement error. Frequent serial measurements of M_{Hb} and other blood parameters over several months should resolve this issue.

The measurement errors of V_{RBC,CO}, V_{Blood,Evans}, V_{Plasma,Evans}, and V_{RBC,Evans} did not increase substantially with increasing time between measures, in contrast to that of M_{Hb,CO}. This finding is likely attributable to large measurement error swamping any biological variation. For example, the measurement error for V_{Plasma,Evans} was 6% after 1 day, which, combined with biological variation of ≤3% (as described above), results in a total error of 6.2% [] for 30 days between measures.

### Sources of Error

Sources of measurement error of the common blood measurement methods are summarized in Table 1. Several of these sources of method-specific measurement error are discussed below in relation to the results obtained in the meta-analysis.

#### M_{Hb,CO}.

The major contributions to test-retest random analytic error for the CO rebreathing method include gas leaks in the mouthpiece, noseclip, and rebreathing system, inadequate CO dose, and using a rebreathing bag with excessive volume (15). The importance of careful attention to these sources of error is reinforced by the result that, in our hands, the Burge and Skinner method has less than one-half the error of the equivalent method used by others. An adequate dose of CO becomes particularly important if estimating %HbCO levels with commercial CO-oximeters that display readings only to a single decimal place (usually ±0.1%) (15). With progressively lower doses of CO, a 0.1% difference in the %HbCO is associated with a substantial increase in the measurement error of M_{Hb,CO} (15); for example, doses of 75 and 25 ml produce errors of 1.3 and 4.1%, respectively, for a woman with 600 g of hemoglobin. Investigators can perform replicate measurements with each specimen to improve the single-replicate precision (15, 18). We have successfully used a CO dose of 1.25–1.5 ml/kg in athletes with low body mass index to induce a pre- to postdose change in %HbCO of ∼6.5% and performed at least five replicate measures of %HbCO (OSM-3, Radiometer, Copenhagen, Denmark) on each blood sample (50). In clinical situations, a CO dose of 1 ml/kg at sea level (aiming to increase %HbCO by ∼6%) to a maximum dose of 100 ml is adequate, with appropriate dose reduction for patients with significant anemia and/or morbid obesity. Use of high-precision gas chromatography to measure %HbCO (23) allows much smaller doses of CO to be administered but requires well-trained technical staff. On the other hand, well-maintained CO-oximeters require minimal staff training and are routinely found in major hospital emergency departments and intensive care units and now increasingly in exercise physiology laboratories. The increased CO dose required when using CO-oximetry to obtain adequate precision makes no substantial difference to the safety of the method (15).

Our meta-analysis indicates that the measurement error of M_{Hb,CO} is small in most laboratories. Nevertheless, the CO method has sustained criticism of its validity, particularly the claim that inspired CO leaves the vascular space and binds to extravascular porphyrin moieties, such as myoglobin (73, 115, 126), leading to overestimation of M_{Hb,CO} (84, 100, 123, 126). However, at relatively low HbCO saturation (<15% HbCO) and at the elevated blood oxygen tension induced in the CO method (15, 48), HbCO remains stable (9), and oxygen rather than CO is taken up by myoglobin (19, 95). Observations on the effects of ischemia on deoxymyoglobin in resting muscle in the presence of 20% HbCO using magnetic resonance spectroscopy (116) support our earlier findings (15) that measures of M_{Hb,CO} do not need to be corrected for loss of CO to myoglobin. Mathematical modeling of CO uptake during 40 min of rebreathing (14) indicates that loss of CO to extravascular sites is, at most, a negligible ∼1% of the CO dose over 10 min. Thus concerns of poor validity of M_{Hb} with CO rebreathing appear to be unwarranted. Additionally, the low rate of extravascular loss of CO appears to be consistent within an individual and, therefore, will have little or no substantial effect on the random error of ∼2%.

#### V_{RBC,51Cr}.

V_{RBC,51Cr} is the defined criterion method of the International Committee for Standardization in Haematology (75). Our meta-analysis confirms the low measurement error of V_{RBC,51Cr}, as concluded 25 years ago (75) and more recently (31, 32). A common misconception, however, is that the ^{51}Cr method primarily measures V_{RBC}. The primary estimate is actually V_{Blood,51Cr} (32), because counts in a whole blood specimen are compared with the assay reference counts. V_{RBC,51Cr} must then be secondarily derived from V_{Blood,51Cr} using the Hct (Table 1). It is difficult, and potentially unethical, to perform multiple ^{51}Cr studies on individuals simply for the sake of establishing the method measurement error. With multiple estimations in the same subject, residual radioactivity from previous estimations increases measurement error, unless compensated by progressively larger doses of ^{51}Cr (0.5–4 MBq for one estimate, 4–20 MBq after three consecutive estimates within ∼28 days). The radiation exposure is thus a concern, especially if the studies are conducted in healthy people for nondiagnostic purposes. Hence, use of the ^{51}Cr method is rare. A variation of the chromium-labeling of red blood cells that may warrant further investigation is the use of the nonradioactive isotope ^{53}Cr (142).

Fairbanks and colleagues (32) simulated best case scenario sources of variability affecting the measurement error of V_{RBC,51Cr} as follows: whole blood ^{51}Cr dose (∼8,000 counts: pipetting error 1%, blood resuspension and reinjection errors 2.0%), whole blood specimen (pipetting error 1.0%, mixing error 1.0%), scintillation count error (1.1%), and Hct (1.7% biological within-person variability, 1.0% analytic variability), giving an overall estimate of the measurement error of V_{RBC,51Cr} of 3.4%. Scintillation count errors of 1.5% (100) and 1.6% (148) have also been reported. The measurement error obtained in our meta-analysis for V_{RBC,51Cr} was of similar magnitude to that from Fairbanks' modeling (Table 2).

In our meta-analysis, the mean measurement error for V_{RBC,51Cr} was substantially more than that for M_{Hb,CO}, but the confidence limits of the ratio of errors indicate that the true difference between the methods is unclear (Table 2). If the real difference is substantial, it could be due to the fact that V_{RBC,51Cr}, unlike M_{Hb,CO}, requires Hct for its estimation (Table 1). An interesting anomaly obtained in our results, however, was that the measurement error for V_{RBC,51Cr} is somewhat less than our estimate of the measurement error for Hct, which is not possible in reality (31). Our result can be explained as follows: the studies that we meta-analyzed for V_{RBC,51Cr} all had measurement errors for Hct < 3%, which is near the lower 90% confidence limit for Hct. Our results are consistent with the model of Fairbanks and associates (32) that the random, biological variation in Hct from day to day is the major source of error in V_{RBC,51Cr} (31).

#### V_{RBC,Evans}.

The primary estimate of the Evans blue method is V_{Plasma,Evans}, and V_{RBC,Evans} must be derived subsequently with the Hct (Table 1). It follows that the measurement errors for V_{RBC,Evans} consist of those for V_{Plasma,Evans} (6.0%) and Hct (3.2%) that combine as variances: V_{RBC,Evans}^{2} = error in V_{Plasma,Evans}^{2} + error in Hct^{2}. Thus error in , which is consistent with our meta-analytic estimate of 6.7%. It is, therefore, clear that most of the error in V_{RBC,Evans} derives from the measurement of V_{Plasma,Evans}.

The existence of a wide variety of Evans blue methods is suggestive of attempts by researchers to improve method reliability. However, errors of V_{Plasma,Evans} were not decisively different, whether a dye-extraction procedure was used (11, 12, 47, 55) or not used (43, 90, 94, 104, 108, 118), and whether (43, 88, 93, 111, 118, 143) or not (11, 12, 17, 24, 55) multiple data points were back-extrapolated to obtain the volume of distribution of Evans blue. Consequently, our analysis does not support use of one method variant over another. Analysis of multiple replicates of specimens improves analytic precision, and it is noteworthy that Gordon et al. (47), who achieved one of the lowest measurement errors for V_{Plasma,Evans}, routinely performed triplicate spectrophotometric assays of Evans blue dye concentration. Among the other studies from which we acquired data for meta-analysis, only Branch et al. (11, 12), Chien et al. (17), and Levine et al. (personal communication) reported at least duplicate spectrophotometric assays. Different numbers of assayed replicates likely contribute to some of the variation in measurement error between laboratories, but it is unclear whether the use of multiple assays would be sufficient to make the error for V_{Plasma,Evans} comparable to that of M_{Hb,CO}.

The Evans blue method has sustained criticism about its validity, particularly that variable loss of the dye from the vascular space causes overestimation of V_{Plasma} as well as a relatively high measurement error (8, 29, 54). Evans blue is primarily an estimate of the albumin space (112), but the dye also binds to globulins (85), fibrinogen (8), and connective tissue (129). The albumin space includes variable flux among the vascular, interstitial, and lymphatic spaces (8, 34); postinjection disappearance curves are consistent with a rapid mixing phase and a slow disappearance phase (143), as initially described by Gibson and Evans in 1937 (43). The rates of decline in the two phases vary between individuals (96), and substantial change within an individual between two trials would increase measurement error, especially if there were changes in extravascular flux due to increased vascular permeability. Capillary permeability to albumin increases during acute exposure to high altitude (60, 61, 92, 152) but appears to be an inconsistent response (81, 82) at the moderate altitudes used by athletes. Serum vascular endothelial growth factor [VEGF; also known as vascular permeability factor (20, 130, 131)] increases substantially at moderate altitude (3), secondary to increased oxygen-regulated gene expression (147) mediated by hypoxia inducible factor-1 (26). An increase in VEGF provides a possible means by which vascular permeability is increased at moderate altitude, and its mechanism of action appears related to disruption of endothelial tight junctions (145). The chronic increase in vascular permeability resulting from an increase in VEGF persists for at least 24 h, but not 72 h (7). Consequently, Evans blue estimates of V_{Plasma} conducted at altitude, or 1 or 2 days after return to sea level, may be affected by an enhanced rate of loss of dye. Our analysis indicates little difference in measurement error for Evans blue volumes with various treatments, but additional error arising from a small and variable change in vascular permeability at or after altitude may have been overwhelmed by the noise of the Evans blue method.

#### V_{RBC,CO}.

The meta-analysis demonstrated that there was threefold greater 1-day error for V_{RBC,CO} than for M_{Hb,CO}. Part of this difference arises from the contributions that [Hb] and Hct make, along with the contribution of M_{Hb}, to the error in V_{RBC,CO} (Table 1). If we assume little or no biological variation in M_{Hb}, [Hb], and Hct over 1 day, the errors in these three measures are methodological and, therefore, independent. Thus (V_{RBC,CO} error)^{2} = (M_{Hb,CO} error)^{2} + (Hct error)^{2} + ([Hb] error)^{2}. In the meta-analysis, the 1-day error was 2.2% for M_{Hb,CO}, 3.2% for Hct, and 2.5% for [Hb] via the cyanmethemoglobin method. The expected error in V_{RBC,CO} is, therefore, , which is still substantially less than the meta-analyzed value of 6.7%. The simple explanation for this difference is that errors in V_{RBC,CO} came from studies with a higher than average error for M_{Hb}, which is certainly the case for the studies that provided estimates for both of these measures (55, 119, 120).

Our estimate of total error for Hct agrees well with Thirup's (140) value of 4.2% for day-to-day variation in centrifuged micro-Hct, which he estimated as consisting of 3% biological variation and 3% analytic variation. Analytic error has been reported for automated Hct to be 2.3% and 0.8% for [Hb] (38). Using the lowest values from above, the likely minimal measurement error for V_{RBC,CO} with 1 day between measures = . This indicates that, in careful hands, the total error for V_{RBC,CO} should approach that of M_{Hb,CO}. However, the additional propagated error in V_{RBC,CO} from the measurement of Hct and [Hb] does little to support the former method if one needs to monitor small changes in the red cell compartment.

### Implications for Monitoring Individuals

What are the implications of measurement error for a clinician's uncertainty in the assessment of an individual? Applying statistical first principles, the observed value of a measurement plus or minus the error of measurement is 68% confidence limits for the true value of a normally distributed single measurement, or 52% confidence limits for a change between two such measurements; the observed value plus or minus twice the measurement error is, respectively, 95 and 84% confidence limits (66). For the clinician to be reasonably confident that an observed small but substantial positive value is not, in reality, substantially negative, it follows that the error of measurement has to be less than the least clinically important difference (66, 69, 70). In other words, to measure a signal confidently, the noise must be less than the signal (67). A measure with an error of 7% is, therefore, useful only for characterizing differences >7%. Differences in V_{Blood} of this magnitude are probably the least clinically important difference in most clinical situations, when one considers that a healthy individual can donate ∼500 ml of blood from a total volume of ∼5 liters with little risk. Evans blue used carefully would, therefore, be suitable in many clinical settings, as an alternative to ^{131}I or ^{125}I (32). Clinicians should, nevertheless, be aware that the error of measurement with Evans blue will sometimes produce unrealistically large differences in V_{Blood}. Discounting such differences to an extent guided by clinical experience is sensible and an appropriate application of Bayesian reasoning (22, 70).

### Implications for Controlled Trials

Adequacy of sample size is one of the most important issues in quantitative studies. With the traditional requirement of 80% chance of statistical significance at the 5% level for the smallest clinically important difference, statistical first principles show that the sample size of each group in a randomized controlled trial needs to be 32 d^{2}/e^{2}, where d is the smallest difference and e is the standard error of measurement. In studies of athletes, a 2% increase in M_{Hb} or V_{RBC} would be important, because it would likely produce a useful change in endurance performance. A researcher interested in detecting such an increase in M_{Hb} using the CO method with an error of measurement of ∼2% would, therefore, need 32 subjects in each of the control and experimental groups. Samples in the studies that we reviewed were typically less than one-half this size. Detecting the same change in V_{RBC} using an Evans blue method with an error of ∼7% would require an unprecedented nine times as many subjects in each group. Less conservative approaches to sample size estimation based on adequate precision of estimation would make the usual sample sizes adequate for characterizing the smallest changes in M_{Hb,CO}, but these sample sizes would be adequate for changes in V_{RBC,Evans} only when the changes are greater than the error of measurement for V_{RBC,Evans}: ∼7% for the typical laboratory.

In our view, mean changes in V_{RBC} in excess of 7% are likely only following radical interventions, such as exposure to high altitude, administration of large doses of erythropoietin, and direct substantial manipulation of red cell volume by venesection or blood transfusion. There are, nevertheless, claims in the literature for even larger changes in V_{RBC,Evans} following less severe interventions, for example 4 wk of training at sea level (88) and 4 wk living and training at moderate altitude (16, 88). The fact that the findings were not consistent in similar studies of V_{RBC,CO} at higher altitude (55, 113, 151) points to type I error as an alternative explanation to such large physiological effects. Publication and other biases, such as deleting apparent outliers before analysis (e.g., a physiologically improbable large negative change in V_{RBC}), would make the rate of reporting of such large effects substantially greater than the usual type I error rate of 5%. It might help to limit such biases if researchers routinely reported the magnitude of the error of measurement in their studies, not only from any reliability study but also from the data in the control and experimental groups.

### Methods with the Smallest Error

For researchers and clinicians seeking to improve their measurement error, we have identified the studies that demonstrated the smallest measurement errors for each of the four methods illustrated in Fig. 1.

#### M_{Hb,CO}.

The lowest measurement error of 0.9% [90% confidence limits (CL), 0.5–1.5%] was obtained by Burge and Skinner (15).

#### V_{RBC,51Cr}.

The smallest measurement error of 1.4% (90% CL = 0.6–2.9%) was obtained by Johnson et al. (79).

#### V_{RBC,Evans}.

Although the lowest estimates were ∼2%, these were derived from studies of only three to four subjects [Faura and Reynefarje (33), and a subset of the 1992–2002 data from Levine and coworkers (16, 42, 86–88, 150)]. The 90% uncertainties in these estimates are, therefore, a factor of approximately ×/÷2.5. Allowing also for regression to the mean with extreme values when there is such large sampling error, it is likely that the true error in these studies was substantially greater.

#### V_{RBC,CO}.

This method would also have acceptable error in a laboratory, with a good technique for M_{Hb,CO} and good reliability for Hct and [Hb]. The lowest errors (∼2–4%) were obtained by Myhre et al. (99), although the small sample size (4–5 subjects) implies considerable uncertainty in these estimates.

Finally, it is critical that, for all measures reliant on Hct or [Hb] and to derive V_{RBC}, V_{Plasma}, or V_{Blood}, subjects should adopt a consistent posture for at least 20 min before blood sampling (15). Venous stasis of any duration should also be avoided because it introduces substantial and unquantifiable error due to regionally increased hydrostatic pressure and localized hemoconcentration.

In conclusion, M_{Hb,CO} has error of measurement similar to that of V_{RBC,51Cr}, which is considered the gold standard (32, 75, 106). Both V_{RBC,51Cr} and M_{Hb,CO} have only about one-third of the error of Evans blue dye or V_{RBC,CO} and thus are more suitable to monitor small changes in red cell volume and M_{Hb}, respectively. Given the relative ease of handling CO compared with ^{51}Cr, arising from the fact that M_{Hb,CO} is independent from biological variation in Hct, and the shorter biological half-life of CO (46), this review supports the routine use of CO rebreathing in clinical as well as research situations to monitor changes. Our results also reinforce the importance of researchers estimating and reporting the error of measurement of the method in their hands to improve the analysis and interpretation of their data.

## APPENDIX: DETAILS OF THE META-ANALYTIC APPROACH AND ITS INTERPRETATION

### Meta-Analysis

The main outcome from a meta-analysis is a weighted mean of values of the statistic of interest (measurement error in this instance) from the various studies, where the weighting factor is the inverse of the square of the sampling standard error of the statistic. We performed mixed-linear model meta-analyses, where effects of blood measures, treatment, and subject characteristics on the estimates were estimated as fixed effects, and the remaining unexplained true variation (heterogeneity) within and between studies was estimated as one or more random effects. Mixed-model meta-analysis is more realistic than traditional meta-analysis (in which “outlier” estimates are progressively eliminated until the test of heterogeneity is no longer statistically significant).

#### Data transformation.

Meta-analysis of untransformed errors of measurement was not an option, because it would result potentially in negative values for means and confidence limits, which is not possible in reality. Analysis of the errors as variances was attractive for several reasons but suffers from the same problem. We, therefore, opted for log transformation, which was used successfully in an earlier meta-analysis of error of measurement (assessing reliability of power in physical performance tests) (72). Meta-analysis with mixed linear modeling provides confidence limits based on the assumption that random effects (including the residuals) are uniform and that any nonnormality in the individual observations is normalized in the outcome statistics by the central limit theorem. Gross departures from normality in the distribution of the individual observations are, therefore, to be avoided. We, therefore, used simulation to check that the sampling distribution of a log-transformed standard deviation derived from small sample sizes (>3) had an acceptably near-normal distribution.

#### Weighting factor.

The sampling standard error for the log-transformed error of measurement was derived semi-empirically by initially assuming it was defined approximately by the 68.4% confidence limits of the standard error of measurement. These limits were derived from the usual formula involving the χ^{2} distribution and df, and then converted to a single times/divide factor (×/÷ factor) by taking the square root of the upper limit divided by the lower limit. The squared log of this ×/÷ factor should approximate the inverse of the weighting factor used in the meta-analysis. The accuracy of the inverse weighting factor was checked by comparing the average of its value computed for a standard deviation from each of 10,000 samples of a given size with the mean value of the variance of the log of the standard deviations of the same samples. The weighting factor was found to be biased low for small sample sizes, and a correction factor of 1 + 1/[2(df + 1)] was found by trial and error. This factor corrected the bias to within 1% for df ≥ 2.

#### Fixed effects.

The fixed-effects model varied with the blood measures under analysis as follows.

##### M_{HB}, V_{RBC,EVANS}, V_{RBC,CO}, AND V_{RBC,51CR}.

For this main comparison of error of measures, a paucity of data for V_{RBC,51Cr} dictated a simple model, consisting of the interaction of measure (representing these four blood parameters) with treatment (two levels: none and other) and with log_{10} time (the base-10 logarithm of time between pairs of measurements). Sex and fitness level were excluded from the analysis because the subtle effects of these predictors were masked by the considerably larger errors of measurement for V_{RBC,Evans} and V_{RBC,CO}.

##### M_{HB}.

A more complex fixed-effects model was used in the analysis of M_{Hb}, because of its low error of measurement (see results) and a relatively large amount of available data. The predictors were sex (proportion of men in the sample), fitness (three levels: athlete, active nonathlete, and inactive), treatment (three levels: none, altitude, and other), method (Burge and Skinner and other), and log_{10} time. The interaction of treatment with log_{10} time was included in a preliminary analysis.

##### EVANS BLUE VOLUMES.

The predictors were measure (three levels: V_{Blood,Evans}, V_{Plasma,Evans}, and V_{RBC,Evans}), treatment (three levels: none, altitude, and other), and log_{10} time.

##### EVANS BLUE METHOD VARIATIONS.

The error of measurement of V_{Plasma} was the dependent variable in a model where three Evans-blue method variations (three levels: extracted/not back-extrapolated, not extracted/not back-extrapolated, and not extracted/back-extrapolated) were coded as levels of measure. The other predictors were treatment (two levels: none and other) and log_{10} time.

##### [HB] AND HCT.

In separate analyses for these parameters, the models were the same as for M_{Hb}, with the inclusion of method (two levels: automated and cyanmethemoglobin) for [Hb].

All meta-analyzed estimates of measurement error for a given blood parameter shown in the results are values predicted for 1 day between measurements and, where relevant, for equal contributions from each level of predictors in the model and for equal proportions of men and women. Comparisons of effects of different levels of individual predictors (e.g., no treatment vs. all other treatments) on measurement error are shown as ×/÷ factors, derived by back-transformation of the log-transformed measurement error. For example, the difference between the error of measurement (ratio of CV between measures) for M_{Hb,CO} and that for V_{RBC,Evans} is a ratio of 3.0 (Table 2). Ratios can be reinterpreted as percent differences, for example, a ratio of 1.25 represents 25% more error, but we have opted to present all differences as ratios to minimize potential confusion with the percent units of the error of measurement. Because the time between measurements was log_{10} transformed, its effect on measurement error is shown as a factor per 10-fold multiple of time.

#### Random effects.

For the main comparison of blood parameters and for the analyses of M_{Hb,CO}, [Hb], and Hct, the random effects coded for each measure were an identifier for the estimate (to provide the pure within-laboratory, between-estimate variation), an identifier for the laboratory technique (to provide pure between-laboratory variation), and the residual. Owing to a paucity of data for some Evans blue volumes and Evans blue methods, the between-laboratory random effect in each analysis was modeled to have the same magnitude for the three volumes and for the three method variations. The between-laboratory variance for [Hb] was negative and, therefore, set to zero; the confidence limits for the between-estimate variation were estimable only by allowing the lower confidence limit of the variance to be negative. In initial analyses, an identifier for each series of estimates of error coming from the same subjects was included as a random effect, but the paucity of data too often resulted in negative variance for one or more of the random effects. This identifier was, therefore, excluded from all analyses.

When the residual variance is scaled to unity (153), the standard deviation of the estimate identifier gives the typical variation of mean measurement error (CV) obtained for a method performed routinely in a specified laboratory (within-laboratory, between-estimate variation). The standard deviation of combined variances for the estimate and laboratory identifiers represents typical variation in mean method measurement error obtained between laboratories. The df for the combined variances was estimated by using the Satterthwaite (121) approximation and applied to the χ^{2} distribution to estimate confidence limits for the combined variance (in the same manner that confidence limits for the individual variances are estimated in the statistics program). The square root of the variances and their confidence limits were back-transformed to times/divide factors. As an example, a between-laboratory typical variation of ×/÷1.80 obtained for a blood measure with mean measurement error (CV) of 6.7% is interpreted as follows: researchers applying the method will obtain, on average, a true CV of 6.7%, but the true value for any given researcher could be typically between 12.1% (6.7 × 1.80%) and 3.7% (6.7 ÷ 1.80%). The uncertainty is actually wider, because uncertainty represented by the confidence limits for the 6.7% have not been taken into account in this example. However, since the focus of the present study is primarily examination of the fixed effects, we have shown the random effects only to illustrate the generally wide variation in the errors of measurement between assays (within laboratories) and between laboratories. These estimates of error of measurement apply only to true values or to values from large sample sizes, because sampling variation will further inflate the variation between observed estimates of error of measurement when sample sizes are small (for example, by a factor of ×/÷1.52 for samples of 10 subjects).

#### SAS mixed-modeling procedure.

The meta-analyses were performed with the mixed-modeling procedure (Proc Mixed) in the Statistical Analysis System (version 8.2, SAS Institute, Cary, NC). Key elements of the code of the Proc Mixed step, adapted from Yang (153), are shown below for the analysis of the four related blood measures: M_{Hb,CO}, V_{RBC,Evans}, V_{RBC,CO}, and V_{RBC,51Cr}. The statements for estimating the fixed effects and for back-transforming the fixed and random effects have been omitted:

proc mixed covtest cl alpha=0.1;

class ErrorID LabID Measure Treatment;

weight inverr;

model LnError=Measure*Treatment Measure*Log10Time/cl

ddfm = sat outp=pred;

random ErrorID LabID /s group=Measure;

parms(10)(10)(10)(10)(10)(10)(10)(10)(1)/hold=9;

#### Statistical inferences.

We made inferences about population (true) values of statistics via precision of estimation (using confidence limits) rather than via hypothesis testing (using *P* values and statistical significance). We also used 90% rather than 95% confidence limits to discourage attempts to reinterpret the limits in terms of statistical significance at the 5% level (135). Furthermore, 90% confidence limits represent adequate precision for making probabilistic inferences, because the probability that the true value is less than the lower confidence interval and more than the upper confidence interval is both only 0.05, which we interpreted as very unlikely. Inferences about the substantiveness of true differences between two errors of measurement were made by interpreting the confidence limits of the ratio in relation to the thresholds for substantial ratios (0.9 and 1.1). Ratios >1.1 or <0.9 were considered substantial on the basis of the impact of error on sample size: in controlled trials, sample size is inversely proportional to the square of the error of measurement (71), so an increase in error by a factor of 1.1 represents an increase in sample size of 1.21, or 21%. The difference between two errors was inferred to be clear, decisive, or conclusive in relation to one or other threshold, if the confidence interval for the ratio of the errors did not overlap the threshold. Thus one error would be clearly greater than another, if the entire confidence interval of their ratio was >1.1; one error would be possibly greater than another, if the confidence interval overlapped 1.1 but not 0.9; the difference in the errors would be clearly or decisively trivial, if the confidence interval was within or abutted 0.9 to 1.1; and the difference would be unclear or indecisive, if the confidence interval overlapped 0.9 and 1.1. Overlap of a threshold that was slight relative to the width of the confidence interval was characterized by terms such as “marginal” and “unlikely,” whereas greater overlap was characterized by “possible” and “likely.”

## Acknowledgments

We thank all those who contributed de-identified raw data for us to meta-analyze. These include Clarence Alfrey, Michael Ashenden, David Branch, Kevin Davy, Birgit Friedmann, Christopher Gordon, Robert Grover, Katja Heinicke, Benjamin Levine, Jack Loepkky, Jack Reeves, Paul Robach, Philo Saunders, Walter Schmidt, Gary Slater, Ian Stewart, Jim Stray-Gundersen, and Darren Warburton.

A detailed summary table of all data sources used in the meta-analysis is available from the corresponding author on request.

## Footnotes

The costs of publication of this article were defrayed in part by the payment of page charges. The article must therefore be hereby marked “

*advertisement*” in accordance with 18 U.S.C. Section 1734 solely to indicate this fact.

- Copyright © 2005 the American Physiological Society