Changes in lung function and structure were studied using hyperpolarized 3He MRI in an elastase-induced murine model of emphysema. The combined analysis of the apparent diffusion coefficient (ADC) and fractional ventilation (R) were used to distinguish emphysematous changes and also to develop a model for classifying sections of the lung into diseased and normal. Twelve healthy male BALB/c mice (26 ± 2 g) were randomized into healthy and elastase-induced mice and studied ∼8–11 wk after model induction. ADC and R were measured at a submillimeter planar resolution. Chord length (Lx) data were analyzed from histology samples from the corresponding imaged slices. Logistic regression was applied to estimate the probability that an imaged pixel came from a diseased animal, and bootstrap methods (1,000 samples) were used to compare the regression results for the morphological and imaging results. Multivariate ANOVA (MANOVA) was used to analyze transformed ADC (ADCBC), and R (RBC) data and also to control for the experiment-wide error rate. MANOVA and ANOVA showed that elastase induced a statistically measureable change in the average transformed Lx and ADCBC but not in the average RBC. Marginal mean analysis demonstrated that ADCBC was on average 0.19 [95% confidence interval (CI): 0.16, 0.22] higher in the emphysema group, whereas RBC was on average 0.05 (95% CI: 0.04, 0.06) lower. Logistic regression supported the hypothesis that ADCBC and RBC, together, were better at differentiating normal from diseased tissue than either measurement alone. The odds ratios for ADCBC and RBC were 7.73 (95% CI: 5.23, 11.42) and 9.14 × 10−5 (95% CI: 3.33 × 10−5, 25.06 × 10−5), respectively. Using a 50% probability cutoff, this model classified 70.6% of pixels correctly. The sensitivity and specificity of this model at the 50% cutoff were 74.9% and 65.2%, respectively. The area under the receiver operating characteristic curve was 0.76 (95% CI: 0.74, 0.78). The regression model presented can be used to map MRI data to disease probability maps. These probability maps present a future possibility of using both measurements in a more clinically feasible method of diagnosing this disease.
- lung physiology
- relationship of lung function and structure
- regional fractional ventilation
- apparent diffusion coefficient
- hyperpolarized gas magnetic resonance imaging
defined by the Global Initiative for Chronic Obstructive Lung Disease (GOLD) as a disease state characterized by airflow limitation that is not fully reversible (6), chronic obstructive pulmonary disease (COPD) is the fourth leading cause of death in the United States, afflicting an estimated 16 million Americans (3). Emphysema is a type of COPD characterized by the destruction of alveolar walls and enlargement of airspaces distal to the terminal bronchioles (6, 9, 30, 41). Because emphysema is defined anatomically, it is a presumptive diagnosis for most patients. Slightly <2% of noninstitutionalized adults in the United States have at some point been diagnosed with emphysema (4).
One troubling aspect of emphysema is that the disease can go undetected for extended periods of time. Physical examination of patients in the early stages of emphysema is often unable to detect the disease, and many COPD patients experience symptoms such as cough, sputum production, and dyspnea on exertion (39, 41) for months to years before seeking medical attention (54). Since patients may have been afflicted long before diagnosis, an objective and accurate method for disease classification and staging is necessary for patient prognosis and management. Because pulmonary function tests have been shown to be a good prognostic indicator (9, 54), the GOLD spirometry classification of severity is used today as the primary method for staging COPD (31, 41). Computed tomographic imaging can also be used to assist with the classification of COPD and is the current definitive test for establishing the presence or absence of emphysema in living patients (20, 21, 24).
Recently, there has been considerable interest within the scientific community in using hyperpolarized (HP) 3He magnetic resonance (MR) imaging (MRI) technology to study emphysema, since it can be used to directly measure alveolar destruction, airflow obstruction, and gas exchange (15, 36, 38, 48). Most HP MRI emphysema studies have regarded each of these aspects of the disorder as separate indicators (5, 45).
Since emphysema is graded using morphological measures of airway size and is staged using spirometric measures of airway obstruction (40), we hypothesized that combined HP MRI measures of airway size and obstruction can be used to distinguish emphysematous regions from normal regions within the lung. More specifically, we hypothesized that the combined analysis of the apparent diffusion coefficient (ADC; a measure of the lung's microstructure) and fractional ventilation (R; a measure of regional ventilation) will more accurately classify the lung into diseased and normal segments than either measure alone. We based this hypothesis on three suppositions. The first is that ADC will be a sensitive measure of alveolar destruction and that it can be used to grade tissue destruction radiographically, comparable to the use of chord length (Lx) to grade emphysema using histological samples (18, 33, 52). The second supposition is that HP MRI fractional ventilation is a sensitive measure of the airflow obstruction induced by emphysema with the potential to determine contrasts significant enough to differentiate diseased from normal lung segments (13, 45). The third supposition is that these measurements are only partially correlated and can thus supply independent information to the classification process (46), in analogy to the clinical phenotyping of COPD as primarily chronic bronchitic or emphysematous disease. These hypotheses were tested in a controlled murine experiment. As part of the analysis, we also introduced a probability distribution function that can be used to map HP MRI results to the regional likelihood that a section of lung is emphysematous or normal. We speculated that these probability maps may be of clinical interest.
MATERIALS AND METHODS
This study was performed under protocols approved by the respective Institutional Animal Care and Use Committees of the University of Pennsylvania and GlaxoSmithKline. Twelve healthy male BALB/c mice (Charles River Laboratories, Kingston, NY) of 3–4 mo in age with a mean body weight of 26 ± 2 g were randomized into two cohorts: 1) a healthy control group and 2) an elastase-induced emphysema group. Each group contained 6 mice. Animals were sedated with 4% isoflurane (Abbott Laboratories, Abbot Park, IL). Normal saline (50 μl) was instilled across both nares of each mouse in the control group, whereas 1.2 IU of type III porcine pancreatic elastase (Sigma, St. Louis, MO) was instilled across both nares of each mouse in the emphysema group. Elastase was diluted with normal saline for a total instillation volume of 50 μl. The method and volume of instillation were based on the work of Southam et al. (43), and the dose of elastase used was derived from Ito et al. (27). Based on previous studies on elastase-induced mice, we assumed that model induction reached steady-state conditions in 8 wk and that minimal airway remodeling occurred after this time point (2, 16, 25, 26, 40).
Mice were imaged over a time period of 3 wk at a frequency of two imaging sessions per week, with one animal from each cohort per imaging session. Control and emphysema animals 1, 2, 3, 4, 5, and 6 were imaged 8, 8.5, 9, 9.5, 10, and 10.5 wk, respectively, after model induction (see Table 1) Animals were sedated with 100 mg/kg ip ketamine and 10 mg/kg ip xylazine, a tracheotomy was performed, and a 1.5-mm endotracheal tube was used to intubate the trachea. Blood O2 saturation and heart rate were monitored using an MR-compatible rodent pulseoximeter (Nonin Medical, Plymouth, MN). Temperature was monitored with an MR-compatible rectal probe (SA Instruments, Stony Brook, NY) and controlled using a flow of warm air through the bore of the MRI scanner. The anesthesia was maintained by subsequent injections of ketamine through an intraperitoneal line accessible from outside the scanner bore.
Mice were connected to an MRI-compatible and programmable custom-built ventilator and ventilated with a tidal volume of 1.2 ml/100 g body wt at 110 breaths/min; the inhale-to-exhale ratio was 1:2, and the fraction of inspired O2 was 21%. The ventilator valves and delivery lines were integrated in the radiofrequency (RF) coil in the vicinity of the animal to minimize dead space and to allow for fast switching of valves. The ventilator is capable of delivering a specified tidal volume with an error of <50 μl and a dynamic dead space of <100 μl. During imaging, the ventilation gas was switched to a mixture of HP 3He and O2 in a four-to-one ratio with identical ventilation parameters. The helium gas (Spectra Gases, Branchburg, NJ), with a nominal concentration of 99.19% 3He and 0.81% N2, was hyperpolarized using the spin exchange optical pumping method, as previously described by Walker et al. (50), using a custom-made helium polarizer. A typical polarization level of ∼30% was achieved after 14 h of optical pumping.
Imaging was performed on a 50-cm 4.7-T MRI scanner (Varian, Palo Alto, CA) equipped with 12-cm 25 G/cm gradients. Mice were placed in a supine position in the bore of the magnet with a 1-in. surface coil tuned to the 3He resonance frequency of 152.95 MHz placed upon their chest. Animals were localized in space with scout images, and the middle coronal slice was selected as the imaging plane for the experiment. Pulse width calibration was performed on the loaded RF coil to determine the applied flip angle (α) for each animal.
Images were acquired using a fast gradient echo pulse sequence with the following parameters: field of view = 3 × 3 cm2, slice thickness = 4 mm, α = 20° (R) and 45° (ADC), matrix size = 64 × 64 pixels (planar resolution of ∼470 μm), and echo time/repetition time = 2.4/3.2 ms (R) and 4.3/5.5 ms (ADC). Animals were fixed in position during the experiment to ensure that the images would be coregistered.
R values were measured using the technique previously described by Deninger et al. (11). This technique is based on the buildup of HP 3He magnetization during a series of washin experiments. During each cycle of the experiment, an animal is ventilated with a large number of room air breaths to wash out all HP 3He from the lungs. A series of HP 3He breaths is then delivered in succession, and an image is obtained during the final HP breath under breath-hold conditions. The process is repeated several times with an ever-increasing number of polarized breaths per cycle. The available MR signal in the lungs at each step of this sequence [S(j)], in its simplified form, can be recursively expressed for the jth HP breath as follows: (1) where SS is the source magnetization, t(j) is the time elapsed from the beginning of the measurement, and τ is the duration of each breath including the imaging breath hold. T1,bag and T1,O2 are the relaxation time constants of HP 3He in the reservoir bag due to wall effects and in the lungs due to interactions with O2, respectively. T1,O2 is a function of Po2, i.e., T1,O2 = ξ/Po2, where ξ ≈ 2.6 bar·s under physiological conditions (42). Equation 1 can be made to fit the image data to derive the desired R values. In this study, ventilation images were acquired corresponding to 1, 2, 3, 4, 5, 8, 13, and 20 HP 3He breaths, and images from the three slices were acquired during a 500-ms breath hold after each series of HP 3He breaths. Only the central slice corresponding to the ADC imaging plane was used for data analysis. Two sets of normalization images were acquired right before and after the R sequence to estimate the T1,bag value based on the overall depolarization of 3He in the reservoir (∼30–40 min for all experiments). From a practical standpoint, the O2 decay effect has a negligible effect on signal buildup (14), and, therefore, a nominal value of Po2 = 140 mbar was assumed to hold throughout the lung.
Apparent diffusion coefficient.
ADC images were obtained by introducing an interleaved diffusion-sensitizing bipolar gradient in the gradient echo imaging pulse sequence, with diffusion time = 1 ms, gradient ramp time = 180 μs, and b values = 0.0 and 2.18 s/cm2. The gradients were applied along the phase encoding direction. The images corresponding to the nonzero diffusion gradient were acquired using both positive and negative bipolar gradient values (for a total number of 4 b values). The geometric mean of the two images was then used to fit to the following equation to yield the ADC value for each pixel: (2) where S0 and S1 correspond to the MR signal of zero and nonzero b values, respectively, and α is the applied RF pulse flip angle. ADC images were generated using 64 HP 3He breaths. Each k-space line position was acquired at the end of each breath by applying four RF pulses and acquiring the signal corresponding to each of the four b values. This procedure was repeated 64 times until the entire k-space for the imaged slice was completely covered. α was estimated using the two nondiffusion-weighted images, and ADC was estimated from Eq. 2. This approach allowed us to use a much larger α (∼45°), which yielded an improved signal-to-noise ratio.
Histology and measurement of Lx.
After being imaged, each animal was euthanized using intravenous potassium chloride. The trachea and lungs were immediately removed en bloc. The lungs were inflated with 10% formalin solution for 24 h, as described by Churg et al. (7, 8), and at a transpulmonary pressure of 20 cmH2O, similar to Massaro et al. (32). The lungs were then embedded in paraffin, and representative central 10-μm-thick whole lung coronal sections were cut from the paraffin blocks. These sections were stained with hematoxylin and eosin. All lungs changed <20% in size during the tissue-processing step.
Histology slides were scanned using a ×20 slide scanner (AperioTechnoligies, Vista, CA), and the resulting images were analyzed using MATLAB (The MathWorks, Natick, MA). Each 18 × 18-mm2 lung slide, encompassing the whole coronal lung field, was scanned at 0.5-μm isotropic resolution; the resulting images were downsampled to 20% of their original size, yielding 7,000 × 7,000-pixel images. The central airways and vasculature were masked, and the images were converted to a binary scale by thresholding. The thresholding process effectively removed the background signal but left holes in the alveolar walls that were typically smaller than two pixels wide. A ball, one pixel in diameter, was rolled around each connected region of pixels above threshold, and the ball's path was added to the image to close these gaps. The ball rolling process thickened the alveolar walls, so the boundaries of the connected regions were subsequently skeletonized to the original wall thickness, and then a low-pass filter was applied to remove background speckle noise. A grid of parallel lines spaced at 150 μm was superimposed on each image, and Lx, defined by the intercept of the lines with the airspace walls, was measured and stored. Lx values of <7.5 μm and >250 μm were discarded. The low-pass and high-pass cutoffs eliminate optical and histological noise and are similar to those used in previous Lx analyses (29, 44). The upper bound also excludes measurements from airspaces much larger than alveoli not accounted for by masking; 7.7% of airspace measurements were discarded, and 1.6% of the measurements were above the upper bound. Mean linear intercept values depend on the their direction relative to the lung tissue slice (35). For better conformity to ADC measurements, Lx values were measured in the direction of the diffusion sensitizing gradients.
We chose a rather large spacing between Lx measurements in an effort to reduce the correlation between measurements, i.e., to minimize the number of times that the same airspace was sampled, obviating the need for a repeated-measure design in our data analysis. Bootstrap analysis of subsampled data suggested that this assumption would hold for our analysis (the Huynh-Feldt epsilon is close to 1).
We measured the area of each enclosed airspace by counting the number of pixels in each airspace and measured scaling by the area of an individual pixel. An airspace was defined as a connected region of white space in the preprocessed binary airway image. Connectivity was defined using the four-pixel connected neighborhood method. We then calculated the corresponding equivalent airspace diameter (di) using the following equation: (3) where Ai is the area of the enclosed region i. di values were used to derive the first and second ratios of di moments (D1 and D2, respectively) as follows: (4) (5) as proposed by Parameswaran et al. (37). μ and σ2 are the mean and variance of the of the airspace diameters for a region of interest (ROI), and γ is the skewness of the diameter distribution.
To elucidate the effects of aggregating data on classification accuracy, histological images were segmented in a piecewise fashion into 200 × 200-pixel (500 × 500-μm) ROIs. This size was chosen to approximate the MRI imaging resolution. D1, D2, and mean Lx (Lm) were calculated for each ROI. Airspaces and corresponding Lx values that were shared by adjacent ROIs were assigned to a particular ROI based on the location of their centroids. To increase the samples of Lx values per ROI, Lx values were sampled along lines spaced at 25-μm intervals instead of the 150 μm used for the whole lung analysis. No thresholding was used for the D1 and D2 analyses.
Morphological data were analyzed using nested ANOVA to discern the difference between diseased and normal control animals using a standard signficance value of α = 0.05. Regression analysis of the ANOVA was used to clarify the model results. We also performed logistic regression to derive estimates of the probability that sections of the lung came from diseased animals or diseased sections of the lung. Imaging provided two metrics for analysis (ADC and R), so multivariate ANOVA was used to analyze these data and control for the experiment-wide error rate. Post hoc ANOVA and multivariate regression were used to further explain the results. Because both the sample size and effect sizes were small, the significance level for this portion of the study was set at α = 0.10. Logistic regression was used to estimate the probability that an imaged pixel came from a diseased animal. Bootstrap methods were used to estimate distributions for the classification results to permit comparisons between the logistic regression results for the morphological and imaging results. One thousand bootstrap samples were used to estimate each distribution. Bootstrap samples were drawn from normalized confusion matrixes for the sensitivity and specificity comparisons (17, 22, 23). Data analysis was performed using STATA 10 (StataCorp LP, College Station, TX) and Matlab 7.6 (The MathWorks).
Representative histological sections and Lx measurements from the diseased and normal animal groups are shown in Fig. 1. The histograms of the measured Lx values were positively skewed (average skewness: 2.00) and peaked (average kurtosis: 7.03). We used logarithmic transformation to transform the data to normality (4). The average skewness and kurtosis of the transformed variables were 0.54 and 2.56, respectively. We assumed that our statistical methods were resilient to violations of normality due to the large sample size (226,176 data points) and our transformation (3, 19). The data were heteroscedastic (see Table 2), and Bartlett's test for equal variances was significant χ2(11) = 7.5 × 103 (P < 0.001). We performed a Monte Carlo simulation to determine the effect that unequal variances of the order encountered in this study had on ANOVA and concluded that the F test would be robust under our conditions of unequal variances; therefore, we proceeded with our analysis. The means of the natural logarithm of Lx [l̄og(Lx)] for mice in the control and emphysema groups are shown in Table 2. The l̄og(Lx) values in the emphysema group were larger than those in the control group, suggesting that we succeeded in increasing airspace size in the emphysema group. To assess significance, we tested the hypothesis that the l̄og(Lx) of the control group was different from the emphysema group using nested two-factor ANOVA. The animal factor with six levels (one for each animal) was nested within the disease factor that had two levels (control and emphysema). The animal factor was assumed random and the disease factor fixed. The control and emphysema groups were significantly different from each other [F(1, 10) = 7.07, P = 0.024, η2 = 0.01]. Marginal means calculated from the regression model showed that the emphysema group was on average 0.16 [95% confidence intervals (CIs): 0.15, 0.17] larger than the control group. The animal nested in disease factor was also statistically significant [F(10, 226,602) = 427.83, P < 0.001, η2 = 0.02].
Values of l̄og(Lx) for animals 3 and 5 in the emphysema group appeared close to normal. Post hoc testing was performed to compare control animals together with the individual animals in the emphysema group. Bonferroni's method was used to account for multiple comparisons. P values for animals 1, 2, 4, and 6 in the emphysema group were all <0.01. For animals 3 and 5 in the emphysema group, we found P = 0.10 [F(1, 226,602) = 5.79] and P = 1.00 [F(1, 226,602) = 1.92], respectively. Histopathological analysis showed that animal 5 in the emphysema group was indistinguishable from a normal animal, i.e., we were unsuccessful at inducing emphysema in this animal, so we removed it from further analysis. Animal 3 had a region of emphysema induced in the upper lung fields, and we retained it for further analysis. This balanced the final study design because the MRI scanner malfunctioned during the data collection of animal 3 in the control group, so this animal was dropped from further data analysis as well.
Aggregate histograms of log(Lx) values for the control and emphysema groups are shown in Fig. 2. The significant overlap between the two groups explains the small effect size for the disease factor. The histograms had a bimodal distribution and were fit to a sum of two Gaussian functions; the distributions with the lower means were indistinguishable based on their first two cumulants [control: μ = 3.30 (95% CI: 3.25, 3.35) and σ2 = 0.42 (95% CI: 0.39, 0.46); and emphysema: μ = 3.30 (95% CI: 3.23, 3.36) and σ2 = 0.43 (95% CI: 0.39, 0.47)]. We hypothesized that the distributions described by these histograms mainly corresponded to normal alveolar-sized airspaces and biased estimates of larger airways, whereas the distributions with higher means corresponded to larger airway structures and portions of the lung in which model induction was effective. We formally separated the log(Lx) distributions into high and low groups using k-means clustering. This gave a cut value of 3.86 for separating the two groups.
We performed three-factor nested ANOVA with post hoc testing to compare l̄og(Lx) values in the control and emphysema groups. L̄og(Lx) values as a function of cut value are shown in Table 3. We were particularly interested in the similarities and differences between the high and low groups. The ANOVA model used was similar to the one presented earlier with the addition of a new two-member (high and low) fixed-group factor that interacts with the disease and animal terms; the cut point for the high/low group was determined from the k-means clustering presented earlier. The disease factor remained significant [F(1, 8) = 10.27, P = 0.01, η2 = 0.002]. The group factor was also significant [F(1, 8) = 2,705.17, P < 0.001, η2 = 0.66]. This effect size was large, as was expected from the histograms shown in Fig. 2, i.e., there was a large spread between the high and low modes. Marginal mean analysis of the regression model showed that the high and low groups differed on average by 1.28 (95% CI: 1.275, 1.283). Post hoc testing was performed to compare the average of l̄og(Lx) values for the control low group with the individual l̄og(Lx) values in the low emphysema group. All P values were >0.05. In contrast, for the high group, all P values were <0.01. These results support our conjecture that the low groups are indistinguishable from each other and that all of the contrast between the control and emphysema cohorts will come from the high group.
For later comparisons, we performed logistic regression to calculate the probability that a Lx measurement came from a diseased animal [P(Disease)]. We used a binary variable as the dependent variable. The variable equaled one if the Lx measurement came from an animal in the emphysema group and zero if it came from the control group. log(Lx) served as the independent variable. The results showed that the probability of disease increased as log(Lx) increased. The odds ratio for the log(Lx) term was 1.39 and was statistically significant (z = 50.84, P < 0.001). The overall model was also significant at less than the 0.001 level according to the model likelihood ratio χ2-statistic [χ2(1) = 2,625, P < 0.001]. Coefficients of the regression model were −1.29 (95% CI: −1.34,−1.25) for the constant term and 0.33 (95% CI: 0.32,0.34) for the log(Lx) term. This led to the following expression for the probability distribution function: (6) Using a 50% probability cutoff, this model correctly classified 55.77% of Lx values. The 50% probability value corresponded to a log(Lx) of 3.91, which is close to the high/low group cut point determined from cluster analysis. The sensitivity and specificity were 36.0% and 73.7%, respectively.
We were interested in obtaining the probability that a Lx comes from a diseased segment of the lung. We speculated that, because of incomplete induction and the stochastic nature of Lx measures, the diseased animal high group was a better surrogate for disease than was the binary variable indicating whether model induction was performed. Therefore, we created a new dependent variable indicating whether a Lx belongs to this class. We repeated the logistic regression with this dependent variable, retaining log(Lx) as the independent variable. As before, the probability of disease increased as log(Lx) increased. The odds ratio using this model became 17.93 (95% CI: 17.41,18.45), dramatically higher than before. The model likelihood ratio χ2-statistic [χ2(1) = 73, 181.12] was significant (P < 0.001). The model coefficients were −13.06 (95% CI: −13.18, −12.94) for the constant term and 2.89 (955 CI: 2.85, 2.92) for the log(Lx) term, yielding the following probability distribution: (7) This model correctly classified 85.6% of the Lx values. The sensitivity and specificity of the model were 50.0% and 93.4%, respectively, and the 50% cut point corresponded to a log(Lx) of 4.52. Plots of Eqs. 1 and 2 are shown in Fig. 3. The receiver operating characteristic (ROC) curves corresponding to classification schemes based on Eqs. 1 and 2 are shown in Fig. 4.
Some comments on the upper Lx bound (250 μm) are in order. The upper bound has minimal impact on hypothesis testing outcomes, e.g., the 95% CIs for the difference between the bound and unbound F-ratios for the first ANOVA presented include zero: (−0.934, 0.579). These differences were determined using 1,000 bootstrap samples/ANOVA. This is a consequence of the large sample size, Lx distribution of the transformed variable, and the relatively small number of measurements above the threshold. There was a statistically significant difference [χ2(1) = 28.40, P < 0.001] between the bound [0.549 (95% CI: 0.547, 0.551)] and unbound [0.558 (95% CI: 0.556, 0.560)] area under the ROC curves for log(Lx). The increase in area is expected, since the large Lx values come predominantly from the diseased animals and thus have a high probability of being correctly classified. This difference is small, and the significance must be interpreted in light of the large sample size. The upper bound is conservative with respect to the apparent change in Lm induced by elastase, i.e., the difference in Lm between diseased and normal animals is roughly 40% smaller with thresholding than without. This makes model induction appear less effective in the bound Lm domain, as shown in Table 2.
Figure 5 shows R and ADC maps and distribution histograms for representative animals from the control and disease groups. The ADC histograms were slightly skewed (average skewness: 1.86) and peaked (average kurtosis: 10.34). The R histograms showed a similar behavior (average skewness: 1.01 and average kurtosis: 8.45). The Box-Cox (4) power transformation was used to transform each data set to normality; λ for the ADC and R transformations were 0.40 and 0.65, respectively. The average skewness after transformation for ADC and R were both <0.001. The average kurtosis dropped to 4.75 and 4.95 for ADC and R, respectively. We assumed that our statistical methods were robust to violations in normality due to the transformation and large sample size (4, 19, 51), >4,000 data points for each imaging modality. The means of the transformed ADC (ĀDCBC) and R (R̄BC) and their respective SDs are shown in Table 4. Both data sets were heteroscedastic [Bartlett's tests for equal variances were significant: χADC2(9) = 36.74, P < 0.001, and χR2(9) = 714.05, P < 0.001], but Monte Carlo simulations suggested that our statistical methods were valid under these conditions. A scatterplot of the transformed variables is shown in Fig. 6. The variables appeared poorly correlated, and the Pearson product moment correlation coefficient was low [r = −0.172 (95% CI: −0.190, −0.137)].
We note that the Box-Cox transformation was rank preserving and that the ĀDCBC values were larger in the emphysema group than in the normal group. In contrast, R̄BC values appeared to be lower in the emphysema group than in the control group. We hypothesized that these observations were statistically significant. Nested two-factor MANOVA was performed to test this hypothesis. This study design paralleled the Lx ANOVA study presented earlier. The independent variables were ADCBC and RBC. The animal factor with five levels (one for each animal in a group) was nested in the two-level disease factor (control group and emphysema group). As before, the animal factor was assumed random and the disease factor fixed. We choose α = 0.10 as our level of significance for this portion of the analysis because of the small sample and effect size. MANOVA showed that the disease effect was significant [Wilk's λ: 0.47, Pillai's trace: 0.53, Lawley-Hotelling trace: 1.15, and Roy's largest root: 1.15; F(2, 7) = 4.01 for all, P = 0.069 for all]. Analysis of the multivariate regression MANOVA model showed that ADCBC was on average 0.19 (95% CI: 0.16, 0.22) higher in the emphysema group than in the control group, whereas RBC was on average 0.05 (95% CI: 0.04,0.06) lower in the emphysema group than in the control group. Post hoc ANOVA for the dependent variables ADCBC and RBC, using the same design, showed that the disease factor contrast came from the ADCBC term, i.e., the disease effect was significant in the ADCBC ANOVA [F(1, 8) = 10.72, P = 0.011, η2 = 0.06] but not in the RBC ANOVA [F(1, 8) = 1.99, P = 0.20, η2 = 0.09]. Further analysis indicated that animal 6 in the control group seemed to be an outlier and that the lack of significance could be attributed to this animal.
As with the Lx analysis, we were interested in determining the probability that a ROI in the MRI scan came from a diseased animal. We generated a new dependent variable indicating whether or not an image pixel came from a diseased or control animal and performed logistic regression using ADCBC as the independent variable. As we did with the Lx analysis, we found that the probability of disease increased as ADCBC increased. The odds ratio for ADCBC was 15.23 (95% CI: 11.10, 20.90); this ratio was statistically significant (z = 16.87, P < 0.001). The model likelihood ratio [χ2(1) = 338.71] was significant (P < 0.001). The model coefficients were 4.10 (95% CI: 3.63, 4.57) for the constant term and 2.72 (95% CI: 2.41, 3.04) for the ADCBC term; 62.9% of pixels were correctly classified using a cutoff probability of 50%, and the 50% cut point corresponded to an ADCBC of −1.51. The sensitivity and specificity were 68.8% and 56.3%, respectively. Repeating the regression using RBC as the independent variable showed that the probability of disease decreased as RBC increased, i.e., the odds ratio for RBC was 6.13 × 10−4 (95% CI: 3.05 × 10−4, 12.3 × 10−4). This observation was statistically significant (z = −20.81, P < 0.001), as was the model likelihood ratio [χ2(1) = 521.15, P < 0.001]. The coefficients for this model were −8.79 (95% CI: −9.51, −1,787) for the constant term and −7.40 (95% CI: −8.10, 6.70) for the RBC term. This model correctly classified 66.3% of pixels with a sensitivity and specificity of 61.8% and 70.8%, respectively. The 50% probability cut point used for classifying pixels corresponded to a RBC of −1.19.
Repeating the regression for a third time using both ADCBC and RBC as independent variables confirmed our earlier observation that diseased animals tended to have high ADCBC and low RBC, whereas normal animals tended to have low ADCBC and high RBC. The odds ratios for ADCBC and RBC were 7.73 (95% CI: 5.23, 11.42) and 9.14 × 10−5 (95% CI: 3.33 × 10−5, 25.06 × 10−5), respectively. Both ratios were statistically significant (ADCBC: z = 10.27, P < 0.001; RBC: z = −18.07, P < 0.001), as was the model as a whole [model likelihood ratio of χ2(1) = 521.15, P < 0.001]. The model coefficients were −7.66 (95% CI: −8.98, −6.33), 2.05 (95% CI: 1.65, 2.44), and −9.30 (95% CI: −10.31, −8.29) for the constant, ADCBC, and RBC terms, respectively, yielding the following expression for the probability that a ROI came from a diseased animal: (8) Using a 50% probability cutoff, this model classified 70.6% of pixels correctly. The sensitivity and specificity of this model at the 50% cutoff were 74.9% and 65.2%, respectively. A plot of the sensitivity and specificity as a function of cut probability is shown in Fig. 7. Likelihood ratio tests for the full model compared with nested models presented earlier were significant, suggesting that both interaction terms contribute significantly to the full model. ROC curves for the three image classification models are shown in Fig. 4. In the region of high specificity, where these models were tested, i.e., at the 50% cut probability, the combined model appeared to be the best classifier. Interestingly, as the specificity dropped, i.e., the cut probability increased, the combined curves and RBC curves approached each other. The areas for the combined, ADCBC, and RBC curves were 0.76 (95% CI: 0.74,0.78), 0.66 (95% CI: 0.64, 0.68), and 0.74 (95% CI: 0.72, 0.76), respectively. The areas were compared using the algorithm suggested by DeLong et al. (10); the differences between the areas were found to be statistically significant: χ2(2) = 277.51 (P < 0.001) for the simultaneous comparison of all three areas, χ2(1) = 14.19 (P < 0.001) for the comparison of combined and RBC areas, and χ2(1) = 40.32 (P < 0.001) for the comparison of ADCBC and RBC areas.
Histology versus imaging.
It is informative to compare classification results. In this experiment, ADCBC was better than Lx at determining whether a measurement came from a diseased or control animal. The 95% CIs for the difference in classification results (9.4%, 9.6%) did not include zero. RBC was slightly better than ADCBC at classifying pixels (1.7%, 2.9%), whereas ADCBC and RBC together were better than either HP gas-derived metric alone (4.5%, 5.0%) at categorizing pixels. Finally, we noted that when Lx was used to classify whether a measurement came from a diseased or normal segment of the lung, it performed statistically significantly better than when ADCBC and RBC together were used to determine if a pixel came from a diseased or normal animal (9.5%, 9.9%). Bonferroni's method was used to adjust CIs to account for multiple comparisons.
Comparison of the sensitivities and specificities of Eqs. 7 and 8, assuming a 50% cut value, showed that Eq. 8 was more sensitive than Eq. 7 (11.4%, 11.5%) but less specific (−11.4%, −11.5%). Note that the CIs for the sensitivity and specificity comparisons were for the normalized confusion matrixes, which is why the ranges do not contain the difference in sensitivities and specificities.
Effects of aggregation.
Aggregating morphological data increases classification performance. A direct comparison of regionally aggregated morphology data with imaging data is beyond the scope of this work because of registration issues. Nonetheless, it is still informative to examine classification performance of aggregated morphological data and then indirectly compare these results with imaging results. We choose three morphology parameters: Lm as well as D1 and D2, as proposed by Parameswaran et al. (37), for this analysis. Lm is a directional quantity like ADC. D1 and D2 are direction independent and include more information about the distribution of airspace sizes in their description of airspace morphology. The histology images were divided into square segments approximating the imaging pixel size, as described in materials and methods. These parameters were then calculated for the individual segments and used as inputs for logistic regression-based classifiers. Disease induction, as in the imaging-based classifiers, was used as the dependent variable. Average values for each parameter are shown in Table 5. We plotted the ROC curves for D1, D2, and Lm (Fig. 8). The regression equations are shown in Fig. 8. The area under the D1, D2, and Lm curves were 0.658 (95% CI: 0.643, 0.673), 0.637 (95% CI: 0.621, 0.652), and 0.730 (95% CI: 0.717, 0.744), respectively. The observed differences in area were significant [χ2(2) = 92.46, P < 0.001]. The D1 and D2 areas were similar in magnitude, whereas the Lm area under the ROC curve appeared to be larger than both D1 and D2 areas. Post hoc testing, corrected for multiple comparisons using the Bonferroni method, supported these observations [D1 to D2: χ2(1) = 3.88, P = 0.147; D1 to Lm: χ2(1) = 49.69, P < 0.001; D2 to Lm: χ2(1) = 81.57, P < 0.001].
The ADCBC ROC curve from Fig. 4 was also included in Fig. 8. Subjectively, over the entire parameter space, it appeared that the ADCBC and ratio of moments classifiers performed similarly, i.e., the areas under their respective ROC curves were similar [χ2(2) = 4.97, P = 0.0832], and that the Lm classifier overall performed best [χ2(3) = 97.48, P < 0.001]. Qualitatively, it appeared as if the Lm and D1 classifiers were the best at distinguishing normal from diseased segments of the lung when cut probabilities (and, consequently, their respective measurement values) were high, that is, the ROC curves rose sharply in the low false positive regime. It was also apparent that for low ratios, the ratio of moment classifiers did not perform well, i.e., the ROC curves approached the reference line when specificity was low. The ADCBC classifier displayed the best performance of all classifiers in the low specificity range. These differences were a consequence of the distinct contrast mechanisms of the different modalities.
The sensitivity, specificity, positive predictive value, negative predictive value, and percent correctly classified for all four aggregated classification schemes are shown in Table 6. At the 50% cut probability, the ADCBC classifier had the highest sensitivity followed by Lm, D2, and D1, in descending order. The difference between ADCBC and Lm sensitivities was statistically significant (95% CI: 0.029, 0.093, with 95% CIs for difference adjusted for multiple comparisons). The differences between Lm and D2 (95% CI: −0.024, 0.034) and D2 and D1 (95% CI: −0.002, 0.0599) sensitivities were not significant. The Lm classifier had the highest specificity followed by D1, ADCBC, and D2 classifiers, in descending order. The differences between Lm and D1 (95% CI: 0.068, 0.134) and D1 and ADCBC (95% CI: 0.045, 0.116) specificities were statistically significant, whereas the difference between ADCBC and D2 specificity (95% CI: −0.034, 0.036) was not.
Diagnosing emphysema is challenging because it shares many clinical features with chronic bronchitis. Pathological evaluation remains as the gold standard, but the invasive nature of this approach makes it impractical for routine clinical use (53). While a variety of clinical and experimental strategies for dealing with these issues exist, we are interested in determining if HP MRI can be used to differentiate between normal and emphysematous regions of the lung. This area of inquiry represents a major step in developing MRI-based methods for studying and eventually diagnosing emphysema. Our central hypothesis was that HP MRI measures of structure and function together will be superior to measures of structure or function alone in classifying disease.
Emphysema is defined anatomically by the destruction of alveoli (47, 49), so we choose the short time limit ADC as our measure of structure. This parameter is the most commonly used HP MRI measure for the study of emphysema and has been specifically evaluated in rodent models of this pulmonary disease (5, 13, 34, 46). Dugas et al. (12), for instance, evaluated 3He ADC changes in both smoking- and elastase-treated mice and observed an overall increase of 25% in ADC values in the elastase-treated group compared with controls. Comparing ADC measurement between different studies, however, requires taking into account the specifics of the diffusion sensitizing gradients. In the Dugas et al. study (12), ADC measurements were performed with diffusion time = 480 μs and δ = 400 μs compared with diffusion time = 1 ms and δ = 200 μs used in the present report. Noting that a smaller diffusion time in general results in a larger ADC for a given set of b values, they reported, overall, 25% larger ADC values of 0.20 ± 0.01 versus 0.25 ± cm2/s for healthy controls and elastase-treated animals, respectively, compared with 0.10 ± 0.01 and 0.13 ± 0.02 cm2/s in this work.
The level of airway obstruction, as measured by spirometery, is a prognostic indicator and forms the basis of the GOLD spirometry classification of severity (41). We assumed that regional measures of ventilation will also be sensitive to disease progression and chose R, a normalized ventilation defined as the increase in airspace volume in a region of interest during inspiration normalized by the total airspace volume of the ROI at the end of inspiration (11), as our second variable. Like regional ventilation, R decreases in areas of obstruction and increases with increasing tidal volume. Both have been found to indicate emphysema in animal models (13, 46).
There are four major findings in this report. First, elastase induces a statistically measureable change in the average transformed linear intercept and ADC but not in the average transformed R. The first two observations were expected and consistent with the known behavior of elastase-induced lung injury, but the third observation was unexpected and contrary to previously published works in which elastase-induced lung injury was found to decrease R (45). There are two likely explanations. The first is that the emphysema group may have been overventilated or the control group underventilated, such that the marginal means of one of the groups was artificially shifted into the range of the other. Because R, like regional ventilation, is dependent on inhalation volume, the tidal volume effect can be partially mitigated through a transformation process proposed by Emami et al. (13), which makes the transformed averaged ventilation a strong function of the variance and covariance of the regional ventilation measurements rather than their absolute values. However, this correction did not rectify the problem, suggesting that the unexpected result was not due to a tidal volume effect. The second possible explanation is that animal 6 in the control group is an outlier with an unusually low R̄BC that shifts the marginal means of the control group lower than what they should be. This possibility was evident when inspecting a scatterplot of R̄BC versus l̄og(Lx) (see Fig. 9). Imputation modeling confirmed that R̄BC of animal 6 in the control group was significantly lower than expected; reanalysis with the imputed value showed that the disease factor would have been statistically significant had the R̄BC variable been closer to the imputed value. We suspect R̄BC was low because of experimental error, most likely due to ventilator failure where less polarized gas was delivered to the animal than was anticipated, making the buildup of signal in the lungs appear slower than expected. This would lead to a lower measured R while preserving lung size (see Eq. 1). An alternate explanation is that low R̄BC is a property of some normal BALB/c mice. Further work will be required to elucidate which is the more likely explanation.
The results shown in Fig. 9 are also notable for their apparent piece-wise linear structure. In the normal to high normal range, R̄BC decreased slowly as a function of l̄og(Lx). This behavior continued until a critical point was reached, after which it dropped rapidly. This pattern suggests that the lungs are robust to elastase damage until a threshold is reached; once that threshold is exceeded, further injury causes dramatic changes in function. This seems to be consistent with the clinical observation that patients with early disease tend to do well with minor lung insult, whereas those with late disease can degrade rapidly. It also suggests that R may not be as sensitive to early changes in lung morphology but may become very sensitive once the disease exceeds a critical point. ĀDCBC and l̄og(Lx), as previously described (18, 28), are strongly correlated (product moment correlation coefficient: 0.85, P = 0.002), and ĀDCBC monotonically increases as l̄og(Lx) increases, so we anticipate that the ADC, like l̄og(Lx), will be sensitive to early changes in airway structure and, thus, suitable for early disease detection.
Our second major finding is that ADCBC and RBC are better at differentiating normal from diseased tissue than either measurement alone. The ROC curves for the three image-based classification schemes are shown in Fig. 9. It is interesting to note that the ROC curves for the combined ADCBC and RBC classifier and the RBC classifier are similar in the high sensitivity regime, but the RBC classifier regresses to the reference line more quickly than the combined classifier in the low sensitivity zone, suggesting that the combined test will be important when high specificity is required. It is also apparent that ADCBC alone has the least desirable ROC curve. The RBC and the combined classifier suffer substantially in the low specificity range because of the outlier: animal 6 in the control group. This also affects overall classification performance; for example, the ROC area for the RBC classifier with normal animal 6 removed was 0.796 (95% CI: 0.784, 0.809), whereas that for the combined classifier was 0.850 (95% CI: 0.834, 0.865). The performance advantage that R displays over ADC may be due to the severity of disease induced, with poorer performance expected with milder disease, as defined by the vertex shown in Fig. 9. The ROC curves for the two nonaggregated morphology classification schemes are also shown for comparison. The image ROC curves are bounded by the morphology curves.
The transformed Lx histograms provide insight into the classification process (see Fig. 2). The control and emphysema histograms are broad and have a bimodal distribution with nearly indistinguishable dominant lower modes. From a classification standpoint, this implies that large portions of the diseased lungs appear indistinguishable from their normal counterparts. Significant overlap also occurs between the higher and lower modes as well as between the higher modes themselves. These higher modes overlap but are shifted relative to each other, resulting in most of the contrast between the control and emphysema groups. The significant overlap between modes and the relatively low surface areas of the high modes demonstrates how challenging the classification process is and suggests that performance will be poor if model induction is used as the only reference variable because there is no ideal cut point that clearly separates the two histograms. By accounting for the commonality of the lower modes and assuming that the lower modes represent normal lung tissue, classification performance is greatly improved because the problem is changed to that of finding a cut point that adequately separates the high modes, rather than the entire histogram. This improvement is manifested in the large increase in area under the high disease reference variable classification scheme ROC curve relative to the disease (model induction) factor classification scheme ROC curve. This finding also signifies the fact that sample size calculation for studies of this type, and similar clinical studies, should not be made using the traditional estimate of the dominant mode of the histogram but with the second mode that more likely indicates the diseased portion of the lung. It seems reasonable to assume that the ADC-based classification suffers from the same pathology and that classification performance would be greatly enhanced if the sample size were large enough to perform a similar analysis for the ADC data. An alternate approach would be to use a second coregistered modality to further define the reference variable during the model development stage.
Individual Lx measurements represent a biased estimate of airway caliber due to the stochastic nature of the sampling process. The random sampling process also explains why the histograms are broad and overlap, despite disease induction. Averaging produces a less biased estimate and improves classification performance significantly. This is shown in Fig. 8. It is interesting to note that despite a strong correlation between ĀDCBC and l̄og(Lx) the classifiers based on these parameters performed quite differently with statistically significant differences in global measures of performance. In this case, Lm performed best, whereas ADCBC and the ratio of moments classifiers performed similarly with respect to area under the ROC curves. In our example, the combined ADCBC and RBC classifier and the RBC classifier globally performed better than the histology-based classifiers. These comparisons suffer from some limitations, an important one being the lack of registration between pathological and imaging data sets. This adds variability to the comparisons because, for example, the origin of segmenting grid used to divide the histology data set into imaging pixel-sized units is arbitrary. For our data set, the area under the curve for the primary histological classifier of interest, Lm, was robust to grid translations. The estimated SD for the area under the Lm ROC curve was 0.002 (calculated from a random sample of 6,199 origins), suggesting the trends presented are likely to be accurate. The effects of tissue distortion require further investigation.
The correct classification percentages for all classification schemes and the relative portion of the lung above the high/low cut point are shown in Table 7. These classification percentages were calculated at the 50% probability cut point and illustrate many of the points discussed above, but in tabular form.
Our third significant finding is that the structure of the data suggests a nested ANOVA study design may be a poor choice because a significant amount of discriminating power is lost during the partitioning process; note the degrees of freedom for the denominator for the disease factor and the relatively small effect sizes. In addition, as described above, the use of model induction or the presence of disease, which would be used in a clinical study, is suboptimal due to the distribution of the airway sizes. These findings suggest that large study sizes are ideal, particularly if they lead to partitioning of the diseased cohort's data distributions into normal and abnormal categories. If this is not possible, alternate strategies, like the use of another imaging modality, can be used to further refine the reference variables. It also seems reasonable to include nonimaging parameters and neighboring pixels values in the regression model to further enhance performance.
Finally, the regression model presented can be used to map the MRI data to probability maps, which graphically display the likelihood that a region of lung is diseased or normal. It is informative to look at example probability maps. In Fig. 10, we plotted the Lx, raw ADC, and raw R maps and the probability of disease generated from the regression models for a sample emphysema animal. We also plotted the aggregated histology results and the corresponding probability maps for comparison. The results shown in Fig. 10 illustrate some important points. The first is that there are obvious geometric differences between the histology and imaging data, which prevent direct coregistration of the data sets. There are two reasons for this. First, the histology sections are much thinner than the image slices; this implies that the histology data sets are undersampled from a morphological standpoint. The second is tissue-processing artifacts. Both of these effects make tissue-to-image registration difficult. Further study is required before the morphology can be used to drive and test imaging classification results on a pixel-by-pixel scale.
Additionally, interpretation of both the histological and imaging data sets is facilitated by generation of disease probability maps; the Lx probability map was generated from Eq. 7. We note that, in the Lx domain, the lung uniformly showed a high probability of disease with the exception of the bottom right lung field, which contained some normal-appearing lung segments (Fig. 10). It is interesting to note that the less biased aggregated morphology metrics confirmed that the airspaces are smaller in this area but that the airspaces may not be entirely normal, because they have an intermediate probability of disease, particularly in the Lm and D1 space. In this example, the probability maps generated from higher ordered ratio of moment metric D2 performed less well, in the sense that the certainty of the discrimination between the diseased and normal sections of the lung was less, as indicated by the greater amounts of green and yellow in Fig. 10. This suggests D2 provides less contrast between diseased and normal tissue under our experimental conditions. Despite the decreased discrimination, the D2 classifier's performance was, at the 50% cut probability, consistent with other morphological classifiers. In addition, the seemingly poor performance of D1 and D2 classifiers should not be interpreted as a defect or a suggestion that these parameters are inferior to other morphological methods; it is rather a comment on the method by which the classifier was constructed and the definition of disease. It is clear from the structure of the D1 and D2 distributions that given more samples to perform the same analysis as Eq. 7 that the improvement for classifier's performance, particularly for D2, would be dramatic.
The ADC probability maps showed a uniformly high likelihood of all voxels coming from a diseased animal. The R probability maps were also relatively uniform and showed an intermediate to high probability of disease. The difference between morphology and imaging results in the bottom right lung field area probably comes from sampling error and tissue distortion from the processing, as the lung does not appear entirely normal in this area. We note a qualitative difference between the center of the lung and the periphery in both the ADC and R maps and attribute this to an imaging artifact. The combined map definitively shows that the entire lung appears to come from a diseased animal. The probability map showed similar contrast to the individual ADC and R maps, but it had several advantages. First, it simplified assimilation of the data by combining two measurements into a single parameter. Second, the combined probability parameter was a better classifier for disease than the individual parameters upon which it is based. Finally, and we speculate most importantly, it is an intuitive measure, unlike R and ADC, which makes it attractive from a clinical standpoint. We note in passing that the performance of the classifier, i.e., the sensitivity, specificity, and classification performance, can be adjusted based on the regression model and the results shown in Figs. 4 and 7.
In conclusion, we have verified our hypothesis that under experimental conditions, a combination of functional and structural HP 3He MRI measures is better at classifying disease than functional and structural parameters alone. Our animal study was designed to prototype a clinical trial. Our results suggest that such a trial would benefit from a large population so as to accurately estimate the distributions of HP imaging parameters. Such a study would also benefit from a secondary source of information that would allow us to discern whether certain regions of the lung were definitively normal. Coregistered MRI/computed tomography images may be useful for such a study. As the regression models improve, we believe that the probability maps generated from these improved models may become an important method for displaying HP MRI data due to simplicity in interpretation of the study results.
This work was supported by National Institutes of Health grants R01-HL-064741, R01-HL-077241, P50-HL-084945, and R21-EB-008173.
No conflicts of interest, financial or otherwise, are declared by the author(s).
Author contributions: M.I., K.E., Y.X., N.K., and S.K. analyzed data; M.I., K.E., S.K., and R.R.R. interpreted results of experiments; M.I., K.E., Y.X., and S.K. prepared figures; M.I., K.E., Y.X., and S.K. drafted manuscript; M.I., K.E., Y.X., A.B., S.K., and R.R.R. edited and revised manuscript; M.I., K.E., S.K., and R.R.R. approved final version of manuscript; K.E., C.J.K., G.A.L., S.K., P.L.P., and R.R.R. conception and design of research; K.E., Y.X., C.J.K., G.A.L., E.C., J.P.M.-W., J.Z., S.P., and P.L.P. performed experiments.
GlaxoSmithKline provided the diseased and control animals. Imaging and tissue processing was performed at the University of Pennsylvania. Data analysis was performed at Johns Hopkins University and the University of Pennsylvania.
Authors acknowledge the support of the Small Animal Imaging Facility of the Department of Radiology, University of Pennsylvania. Authors also thank Harmony Li for thorough proofreading and editing of the manuscript.
- Copyright © 2012 the American Physiological Society