+ All Categories
Home > Documents > The Use of Singular Value Variance-Decomposition Proportions in FT-IR Analysis of Gases and Vapors

The Use of Singular Value Variance-Decomposition Proportions in FT-IR Analysis of Gases and Vapors

Date post: 03-Oct-2016
Category:
Upload: stuart
View: 215 times
Download: 2 times
Share this document with a friend
7
The Use of Singular Value Variance-Decomposition Proportions in FT-IR Analysis of Gases and Vapors STUART BATrERMAN School of Public Health, University of Michigan, Ann Arbor, Michigan 48109-2029 Issues in the application of spectral least-squares methods to Fourier transform infrared spectrophotometry are discussed with emphasis on collinearity problems in vapor and gas reference spectra. Excessive col- linearity may degrade the accuracy and reliability of prediction. Several methods that detect and diagnose coilinearity are developed and tested with the use of numerical experiments. The condition number is based on the singular values in the reference spectra matrix and provides a stable and composite measure of collinearity. Variance-decomposition proportions and auxiliary regressions identify the spectra that form de- pendencies. Application of the methods to spectra of common vapors and gases shows complex and potentially degrading dependencies that would not be seen by examining correlation coefficients or other statistics measuring only pair-wise dependencies. A method to estimate the pre- cision of results for specified signal-to-noise ratios and degree of collin- earity is evaluated. Index Headings: Collinearity; Fourier transform infrared spectrometry; Least-squares methods; Trace gas analysis. INTRODUCTION The rapid evolution of Fourier transform infrared (FT- IR) spectrometry hardware and software has encouraged new applications such as near real-time environmental and occupational air monitoring. 1-3 Accurate identifica- tion and quantification of compounds may be difficult in some of these applications, e.g., monitoring of rela- tively complex mixtures of vapors and gases. Sophisti- cated analyses can aid the identification and quantifi- cation of compounds deduced from IR spectra. Several investigators are pursuing the development of automated and versatile systems for this purpose. 4,5 This paper develops procedures that diagnose collin- earity in sets of IR spectra. Collinearity, instrument noise, and limited spectral resolution may degrade predictions based on the interpretation of IR spectra, especially re- sults based on least-squares procedures. Methods are needed to anticipate and minimize these problems. Pre- vious work closest to that developed here used principal components analyses to select subsets of spectra libraries 4 and search prefilters. 6 Also, orthonormalized reference spectra have been used to directly identify substances. 7 The method detailed here detects and diagnoses de- grading collinearity in reference spectra and can be used in any least-squares procedure. While techniques such as examination of covariance matrices and eigenvectors may be seen more frequently, the singular value variance- decomposition proportion technique can more success- fully assess the degree and effects of collinearity, s This information permits the selection of reference spectra that avoid unacceptable collinearity. Also, it provides an a priori estimate of the accuracy of ordinary least-squares (OLS) predictions. THEORY OLS Procedures and Pitfalls. A variety of identifica- tion and fitting approaches are used to interpret IR spec- tra. Ordinary least-squares procedures, for instance, per- mit simultaneous identification and quantification of compounds. OLS regresses the test spectrum (the de- pendent variable) against a set of reference spectra (the independent variables). 9 Detected compounds have re- gression coefficients that are statistically significant. It- erative least-squares methods may be used to improve the fit. 1° Since spectral least-squares methods use the shape of absorption features as well as peak locations, sensitivity and selectivity are enhanced as compared with results for simple peak-matching techniques. Many factors can restrict the accuracy and reliability of spectral least-squares or other analysis methods. These include the following: (1) Wide, overlapping, and non- unique absorbance peaks that lead to collinearity be- tween spectra. (2) Necessary trade-offs between spectral resolution and the instrument's signal-to-noise ratio, speed, and cost. (3) The need for reference spectra (e.g., from pure compounds), which must be available and must match measured spectra. In practice, however, spectra of the same substance rarely match exactly, due to changes in external conditions and instrument noise. 6 (4) The presence of high-concentration gases and vapors (e.g., C02, H20, and CH4) that may obscure portions of the IR band and absorption features of trace compounds. Ad- ditional factors that may bias OLS predictions include (5) omission of any (unknown) components from the reference spectra set; (6) baseline drift and other instru- ment errors; (7) nonlinearities in the Beer-Lambert re- lationship; and (8) changes in environmental variables such as temperature, pressure, humidity, the presence of other gases, etc. Collinearity, noise, mismatches be- tween reference and test spectra, and instrument errors can seriously diminish the accuracy of FT-IR results. Focusing on the issue of collinearity, spectral least- squares methods require that each compound have unique absorption spectra; that is, IR spectra must be linearly independent. Collinearity may be caused by pairwise de- pendencies involving two spectra, for example, the use of two spectra (perhaps at different concentrations) of the same compound in the reference spectra set. This situation may degrade OLS predictions for some or pos- sibly all compounds, causing errors and inflated vari- ances in the concentrations predicted. Additionally, it may decrease the goodness of fit as measured by the coefficient of determination or R 2, and the robustness of predictions may be diminished (e.g., results may change greatly from small perturbations in the data). Correla- 800 Volume 46, Number 5, 1992 0003-7028/92/4605-080052.00/0 APPLIED SPECTROSCOPY © 1992 Society for Applied Spectroscopy
Transcript
Page 1: The Use of Singular Value Variance-Decomposition Proportions in FT-IR Analysis of Gases and Vapors

The Use of Singular Value Variance-Decomposition Proportions in FT-IR Analysis of Gases and Vapors

S T U A R T B A T r E R M A N School of Public Health, University of Michigan, Ann Arbor, Michigan 48109-2029

Issues in the application of spectral least-squares methods to Fourier transform infrared spectrophotometry are discussed with emphasis on collinearity problems in vapor and gas reference spectra. Excessive col- linearity may degrade the accuracy and reliability of prediction. Several methods that detect and diagnose coilinearity are developed and tested with the use of numerical experiments. The condition number is based on the singular values in the reference spectra matrix and provides a stable and composite measure of collinearity. Variance-decomposition proportions and auxiliary regressions identify the spectra that form de- pendencies. Application of the methods to spectra of common vapors and gases shows complex and potentially degrading dependencies that would not be seen by examining correlation coefficients or other statistics measuring only pair-wise dependencies. A method to estimate the pre- cision of results for specified signal-to-noise ratios and degree of collin- earity is evaluated.

Index Headings: Collinearity; Fourier transform infrared spectrometry; Least-squares methods; Trace gas analysis.

I N T R O D U C T I O N

The rapid evolution of Fourier transform infrared (FT- IR) spectrometry hardware and software has encouraged new applications such as near real-time environmental and occupational air monitoring. 1-3 Accurate identifica- tion and quantification of compounds may be difficult in some of these applications, e.g., monitoring of rela- tively complex mixtures of vapors and gases. Sophisti- cated analyses can aid the identification and quantifi- cation of compounds deduced from IR spectra. Several investigators are pursuing the development of automated and versatile systems for this purpose. 4,5

This paper develops procedures that diagnose collin- earity in sets of IR spectra. Collinearity, instrument noise, and limited spectral resolution may degrade predictions based on the interpretation of IR spectra, especially re- sults based on least-squares procedures. Methods are needed to anticipate and minimize these problems. Pre- vious work closest to that developed here used principal components analyses to select subsets of spectra libraries 4 and search prefilters. 6 Also, orthonormalized reference spectra have been used to directly identify substances. 7 The method detailed here detects and diagnoses de- grading collinearity in reference spectra and can be used in any least-squares procedure. While techniques such as examination of covariance matrices and eigenvectors may be seen more frequently, the singular value variance- decomposition proportion technique can more success- fully assess the degree and effects of collinearity, s This information permits the selection of reference spectra that avoid unacceptable collinearity. Also, it provides an a priori estimate of the accuracy of ordinary least-squares (OLS) predictions.

T H E O R Y

OLS Procedures and Pitfalls. A variety of identifica- tion and fitting approaches are used to interpret IR spec- tra. Ordinary least-squares procedures, for instance, per- mit simultaneous identification and quantification of compounds. OLS regresses the test spectrum (the de- pendent variable) against a set of reference spectra (the independent variables). 9 Detected compounds have re- gression coefficients that are statistically significant. It- erative least-squares methods may be used to improve the fit. 1° Since spectral least-squares methods use the shape of absorption features as well as peak locations, sensitivity and selectivity are enhanced as compared with results for simple peak-matching techniques.

Many factors can restrict the accuracy and reliability of spectral least-squares or other analysis methods. These include the following: (1) Wide, overlapping, and non- unique absorbance peaks that lead to collinearity be- tween spectra. (2) Necessary trade-offs between spectral resolution and the instrument's signal-to-noise ratio, speed, and cost. (3) The need for reference spectra (e.g., from pure compounds), which must be available and must match measured spectra. In practice, however, spectra of the same substance rarely match exactly, due to changes in external conditions and instrument noise. 6 (4) The presence of high-concentration gases and vapors (e.g., C02, H20, and CH4) that may obscure portions of the IR band and absorption features of trace compounds. Ad- ditional factors that may bias OLS predictions include (5) omission of any (unknown) components from the reference spectra set; (6) baseline drift and other instru- ment errors; (7) nonlinearities in the Beer-Lambert re- lationship; and (8) changes in environmental variables such as temperature, pressure, humidity, the presence of other gases, etc. Collinearity, noise, mismatches be- tween reference and test spectra, and instrument errors can seriously diminish the accuracy of FT-IR results.

Focusing on the issue of collinearity, spectral least- squares methods require that each compound have unique absorption spectra; that is, IR spectra must be linearly independent. Collinearity may be caused by pairwise de- pendencies involving two spectra, for example, the use of two spectra (perhaps at different concentrations) of the same compound in the reference spectra set. This situation may degrade OLS predictions for some or pos- sibly all compounds, causing errors and inflated vari- ances in the concentrations predicted. Additionally, it may decrease the goodness of fit as measured by the coefficient of determination or R 2, and the robustness of predictions may be diminished (e.g., results may change greatly from small perturbations in the data). Correla-

800 Volume 46, Number 5, 1992 0003-7028/92/4605-080052.00/0 APPLIED SPECTROSCOPY © 1992 Society for Applied Spectroscopy

Page 2: The Use of Singular Value Variance-Decomposition Proportions in FT-IR Analysis of Gases and Vapors

tion or covariance matrices can identify pairwise depen- dencies. Three or more spectra also can cause depen- dencies. For example, compound A may have two characteristic absorption peaks. Compound B may have a feature with the same shape and frequency as one of A's peaks, while compound C may resemble the second peak. OLS results may indicate only compound A, com- pounds B and C, or some combination of the three com- pounds. Again, collinearity has seriously degraded OLS performance. This situation may escape detection by pairwise measures of dependencies, e.g., correlation co- efficients would not indicate an unusual degree of collin- earity. Collinearity problems will intensify as the number of reference spectra increases (especially with spectra that have broad features rather than unique absorption lines), as the spectral resolution decreases, and as the spectral bandwidth decreases.

Singular Values. Collinearity in reference spectra may be determined in several ways; however, the singular value and variance-decomposition proportion procedure has important theoretical and practical advantages. De- tails concerning the derivation of this procedure can be found in Belsley e t al . , 6 and only the most important aspects are mentioned here. Singular value decomposi- tion (SVD) decomposes and diagonalizes a matrix X as

X = UDV T (1)

where UTU = VTV = I, and D is diagonal with nonneg- ative diagonal elements, #~, called the singular values of X. In spectral least-squares applications, columns of X are the reference spectra. Any matrix X can be decom- posed to provide information that encompasses the ei- gensystem of XTX. Singular values (SV) resemble eigen- values in showing the magnitude of the orthogonal components of the matrix. However, SVD has several advantages including (1) it applies directly to X, not XvX; (2) SVs are computed differently and with much greater numerical stability than eigenvalues; and (3) SVs are utilized in the variance-decomposition procedure shown below that identifies sources of collinearity.

Condition Index and Condition Number. The condition index ~k is

nk = u m ~ / ~ k k = 1 . . . p (2)

where Uk is the kth singular value, um~ is the maximum singular value, and p is the number of columns (spectra) in X. The number of CIs that are large, e.g., exceeding 10 to 30, represents the number of strong dependencies in the reference spectra. The ratio ~h shows the relative strength or degree of orthogonality of the kth component of X.

The condition number ~ is a composite measure of collinearity in the matrix:

= Unaax/~[~mi n • ( 3 )

Condition numbers (CN) that are large, e.g., above 30, indicate sensitivity to numerical problems in the inver- sion of the X~X matrix needed to solve the least-squares problem. Few inversion routines, for example, will be able to invert a matrix if ~ >- 102, and most routines will indicate that the matrix is singular. While this is an extreme case, even apparently minor computational problems may limit the accuracy of the OLS solutions.

Such computation problems can never be fully elimi- nated. 6

The CN represents a multiplicative factor indicating the effect of imprecision in the data on the accuracy of OLS predictions. Given data in reference and unknown spectra known to d significant digits and an ~ = 10 r, then a small change in the data (e.g., in the last or dth digit) can affect the solution in the (d - 2r)th place. 8 As an example, assume that spectral absorbances are known to three significant places, implying a signal-to-noise (S/N) ratio of 103. With n = 10, results of only the first (3 - 2) place are trustworthy. An S/N ratio of 104 is required to obtain two significant places. With ~ = 102, none of the digits could be trusted. Obtaining two significant places requires an unattainable S/N ratio of 106.

Variance-Decomposition Proportion. Variance-decom- position proportion H~k apportions the fraction of vari- ance of the kth regression coefficient associated with the j t h component (or singular value) of X. These propor- tions are calculated as

Hjk = ¢ j h / ¢ k j = 1 . . . p ; k = 1 . . . p (4)

p

~k = ~ Cki k = 1 . . . p (5)

Cjh = (Vjk/~h) 2 j = 1 . . . p ; k = 1 . . . p (6)

where Vkj are elements of V in Eq. 1. Condition indices ~k and variance proportions IIi~ are tabulated in a vari- ance proportion table (shown later) to aid interpretation.

The variance-decomposition proportions and CIs to- gether detect collinearity that may degrade regression estimates. Two conditions are needed: (1) There must be a SV with a high CI, greater than 10 to 30. (2) The same CI must contain high variance-decomposition pro- portions, each greater than about 0.5, for at least two regression coefficients) These conditions hold for a "dominating" dependency, i.e., one strong dependency as indicated by only one large CI. Several large CIs of similar magnitude show "competing" dependencies. Here, variance-decomposition proportions must be summed over the large CIs. If several sums are large, over about 0.5, degrading collinearity may be present. For compet- ing dependencies, auxiliary regressions can identify spec- tra forming dependencies. These procedures are dem- onstrated below.

EXPERIMENTAL

Data Sets. The diagnostic procedures are tested on three data sets of increasing complexity (Table I). Set A contains seven compounds and is based on the "BTEX" compounds often found as environmental contaminants. Set B includes 15 compounds, the BTEX compounds plus three alcohols, three aliphatic hydrocarbons, and two aromatic hydrocarbons. Set C includes 41 spectra similar to those used by Wythoff e t al . 5 that represent materials common in industrial and commercial settings. This set also contains the BTEX compounds. A spectral range in the "fingerprint" region and below water vapor absorption, 1200 to 600 cm -1 , is used at 2 cm-1 resolution, resulting in 301 absorbance values for each spectrum. Set C was also used at 8 cm -1 resolution. Block averaging

APPLIED SPECTROSCOPY 801

Page 3: The Use of Singular Value Variance-Decomposition Proportions in FT-IR Analysis of Gases and Vapors

TABLE I. Reference compounds and concentration-pathlength produce used in study.

No. Vapor or gas ppm-m No. Vapor or gas ppm-m

Set A

Set B

Set C

1 Benzene 27 5 Isopropanol 980 2 Cyclohexane 221 6 o-Xylene 496 3 Ethylbenzene 496 7 Toluene 496 4 n-Hexane 171

1 Benzene 27 9 Ethanol 600 2 n-Butane 99 10 Ethylbenzene 496 3 Cyclohexane 221 11 n-Hexane 171 4 Isobutanol 141 12 Methanol 27 5 Isobutane 640 13 o-Xylene 496 6 Isopropanol 980 14 p-Xylene 496 7 m-Xylene 496 15 Toluene 496 8 Ethane 900

1 Water 705 22 Trichlorofluoromethane (F-11) 86 2 Ozone 356 23 Dichlorofluoromethane (F-12) 26 3 1,1,1-Trichloroethane 131 24 Chlorotrifluoromethane (F-13) 43 4 1,1-Dichloroethane 474 25 Carbon tetrafluoride (F-14) 13 5 1,2-Dibromoethane 210 26 Furan 133 6 1,2-Dichloroethane 210 27 Hexane 171 7 Acetonitrile 680 28 Isopropanol 980 8 Acrylonitrile 652 29 Methyl acetate 469 9 Acetaldehyde 300 30 Methylene chloride 194

10 Acetone 188 31 Methyl ethyl ketone 762 11 1,3-Butadiene 615 32 Methyl vinyl ketone 474 12 Benzene 27 33 o-Dichlorobenzene 153 13 Chlorobenzene 170 34 o-Xylene 496 14 Chloroform 170 35 Propylene oxide 700 15 Carbon tetrachloride 282 36 Styrene 423 16 Cyclohexane 221 37 Tetrachloroethylene 186 17 Cyclopentene 282 38 1,1,1,2-Tetrachloroethane 536 18 Dimethylsulfide 300 39 Toluene 496 19 Ethylbenzene 496 40 Trichloroethylene 282 20 Ethyl ether 267 41 Vinyl chloride 564 21 Ethyl oxide 700

was used to degrade resolution. Spectra were obtained with a Digilab FTS-40 spectrometer equipped with a wide-band, N2-cooled, mercury-cadmium-telluride detec- tor with a resolution of 0.5 cm -1 (Infrared Analysis, Inc., Anaheim, CA). Pure vapors or gases in nitrogen or air were measured with the use of a single-pass 10-cm cell.

Examples use a synthetic spectrum that represents a mixture of set A compounds. This spectrum is a linear combination of pure spectra adjusted to 10 ppm-m. Re- sults, however, depend only on the relative (not absolute) levels of the gases. To deliberately degrade the signal- to-noise ratio, one adds random Gaussian noise to pure spectra. The noise has a mean of zero and a standard deviation equal to 1% of the range of absorptions found in the pure spectra from 1200 to 600 cm- ' .

Computations. A SVD program was written in FOR- TRAN based on code provided by Press. 1~ The program calculates CIs and variance-decomposition proportions,

and includes options to select, center, and unit scale spectra. Results reported use only the latter of these options, as generally recommended2 An OLS program was written in SpectraCalc Array Basic (Galactic Indus- tries, Salem, NH). To minimize numerical errors, the program uses double precision arithmetic, centers, and unit scales matrices and vectors. Array Basic programs also were used to manipulate spectra, e.g., adding noise and adjusting resolution. Some regressions and correla- tions were performed with Systat (Systat Inc., Evanston, IL).

RESULTS

Set A. The first example demonstrates the diagnostic procedure using set A. Pearson correlation coefficients in Table II show dependencies between pairs of spectra. The highest correlation coefficients, 0.61 and 0.62, are

TABLE II. Pearson correlation coefficients for set .4 compounds.

Compound Benzene Cyclohexane Ethylbenzene n-Hexane Isopropanol o-Xylene Toluene

Benzene 1.000 Cyclohexane -0.232 1.000 Ethylbenzene 0.127 -0.372 1.000 n-Hexane -0.261 0.607 -0.051 1.000 Isopropanol -0.155 0.243 -0.277 0.180 1.000 o-Xylene -0.047 -0.141 0.380 0.171 -0.138 Toluene 0.063 -0.219 0.623 0.202 -0.203

1.000 0.470 1.000

802 Volume 46, Number 5, 1992

Page 4: The Use of Singular Value Variance-Decomposition Proportions in FT-IR Analysis of Gases and Vapors

TABLE IlL Singular value variance-decomposition proportions for set .4.

Var iance-decomposi t ion propor t ions

No. Cond. index Benzene Cyclohexane E thy lbenzene n -Hexane Isopropanol o-Xylene To luene

1 1.0 0.000 0.017 0.001 0.128 0.006 0.000 0.001 2 2.5 0.020 0.109 0.057 0.106 0.055 0.033 0.041 3 3.2 0.117 0.340 0.008 0.153 0.019 0.094 0.049 4 4.2 0.631 0.257 0.001 0.299 0.133 0.026 0.002 5 6.6 0.001 0.103 0.717 0.019 0.000 0.039 0.763 6 4.6 0.200 0.116 0.003 0.282 0.724 0.126 0.027 7 5.0 0.031 0.059 0.213 0.013 0.064 0.681 0.117

found between cyclohexane and n-hexane, and between toluene and ethylbenzene, respectively. These coeffi- cients indicate that 37 to 39 % of the variance, a small amount, is shared. These correlations imply that regres- sion results are unlikely to be degraded by collinearity.

Table III shows the variance-decomposition propor- tions table for set A. CIs range from 1.0 to 6.6; the largest is the CN of this seven-compound matrix. Variance-de- composition proportions apportion the variance of re- gression estimates for each compound among the SVs; thus each column sums to unity. For example, 71.7% of the variance of the ethylbenzene estimate is due to the fifth (and largest) SV. Variance-decomposition results depend only on the reference spectra, not the test spec- trum. If the reference spectra were orthogonal, the vari- ance-decomposition matrix would be a diagonal identity matrix.

Dominating dependencies that cause collinearity prob- lems require one large CI and two or more proportions in the same row that exceed 0.5. Competing dependencies require that proportions be summed across rows with large CIs, and the sum must exceed 0.5. The values in Table III suggest only weak dependencies among the spectra. SVs 4 to 7 are comparable and show that most of the spectra are involved in multiple but weak depen- dencies. However, degrading collinearity is not present.

Table III reflects the correlation structure seen in Ta- ble II. For example, the fifth and largest CI is attributable to shared dependencies between toluene and ethylben- zene (correlation coefficient r = 0.62). Cyclohexane and n-hexane are involved in multiple dependencies that in- clude benzene and isopropanol as seen by relatively high decomposition proportions for CIs 4, 5, and 6.

Auxiliary Regressions. Dependencies may be identi- fied with the use of auxiliary regressions in which each

compound's spectra are regressed against all other spectra. These regressions must be performed for each variable in the set. A forward stepwise procedure (with a maxi- mum type I error of ~ = 0.15 to enter and exit variables) is used. As an example, the ethylbenzene (EB) spectra can be predicted as a linear combination of the cyclo- hexane (CH), isopropanol (IP), o-xylene (OX), and tol- uene (TO) spectra:

EB = 0.001 - 2.216CH - 0.076IP + 0.093OX

+ 0.468TO R 2 = 0.465. (7)

All coefficients (except the intercept) are significant at the 95 % confidence level. These results identify variables involved in dependencies that may not be clearly indi- cated by the variance-decomposition proportions. The R 2 shows that 47 % of the variance in ethylbenzene spec- tra is explained by the four spectra. An R 2 near unity indicates strong dependencies, which can degrade OLS results. If ethylbenzene is regressed against all 41 spectra in set C, 22 variables are significant predictors and 95 % of the variance is explained--a large amount that implies undesirable collinearity. As will be shown, this degree of collinearity is sufficient to degrade predictions, especially with small signal-to-noise ratios.

To illustrate results from the auxiliary regressions, Fig. 1 shows ethylbenzene, cyclohexane, o-xylene, and tolu- ene spectra, and two predictions of the ethylbenzene spectra. Prediction 1 uses Eq. 7 and poorly matches the ethylbenzene peak at 680 cm -1. None of the other six spectra in set A have strong absorptions at this wave- length; the closest are o-xylene and toluene. A poor fit is highly desirable; an excellent match indicates strong and potentially degrading collinearity. Prediction 2 uses the 22 spectra selected from set C. The location and

TABLE IV. Summary of auxiliary regression results for set .4 compounds. Constant is excluded from variable count.

Group Benzene Cyclohexane E thy lbenzene n -Hexane Isopropanol o-Xylene Toluene

Set A

R 2 0.09 0.52 0.47 0.52 0.11 0.25 0.51 No. var iables 3 4 4 5 3 4 4

Set B

R 2 0.32 0.56 0.56 0.56 0.31 0.31 0.57 No. var iables 4 9 6 6 7 5 9

Set C a t 2 cm -~

R = 0.17 0.67 0.95 0.81 0.87 0.81 0.87 No. var iables 5 13 22 15 21 16 12

Se t C a t 8 cm -~

R 2 0.77 0.95 0.95 0.90 0.91 0.98 0.98 No. var iables 8 8 8 15 26 18 8

APPLIED SPECTROSCOPY 803

Page 5: The Use of Singular Value Variance-Decomposition Proportions in FT-IR Analysis of Gases and Vapors

shape of the major ethylbenzene peak at 680 cm -1 are matched, as are minor features at 1000-1100 cm- ' , al- though some noise is present.

Auxiliary regressions for all set A compounds have low values of R 2 (Table IV). These values and small CIs in- dicate collinearity problems will not degrade OLS pre- dictions with the use of set A spectra. Thus, these 2-cm-' resolution spectra can discriminate and quantify BTEX levels, assuming that other OLS assumptions are met (e.g., no other gases or vapors are present).

To test the effect of collinearity and noise on OLS estimates, one regresses the synthetic spectrum (repre- senting a 10 ppm-m mixture of the seven BTEX com- pounds with 1% noise) against the reference spectra. OLS results show that six components are estimated accurately (column S/N = ~ in Table V). Hexane is overpredicted by 21%, an error resulting from the added noise. The reference spectra then were contaminated by 1, 2, 5, and 10% noise, corresponding to S/N ratios of 100, 50, 20, and 10, respectively. The same synthetic spectrum was regressed against the degraded reference spectra. Several trends are apparent. As the S/N ratio

=o

o

Toluene

o-Xylene

oloh xan

Predict ion

Predict ion

, , ~Ethy lb2~nzene

600 650 700 750 800 850 900 9SO 1000

Wavenumbers (cm-1)

Fro. 1. Spectral absorbances of ethylbenzene, o-xylene, isopropanol and toluene, and two predictions of ethylbenzene.

decreases, OLS predictions differ from the true values; the variance of the predictions increases and thus their statistical significance decreases; and the R 2 decreases (Table V). Results are significantly degraded for hexane

TABLE V. OLS results for set ,4 compounds for various data sets at specified signal-to-noise ratios. Asterisk denotes statistically insignificant coefficient (P > 0.05).

S/N = ~ S /N = 100 S /N = 50 S /N = 20 S /N = 10

Set A Benzene 10.0 _+ 0.0 Cyclohexane 9.8 _+0.4 E thy lbenzene 10.1 _+ 0.1 Hexane 12.1 _+ 1.2 I sopropanol 10.2 _+ 0.1 o-Xylene 10.0 _+0.1 To luene 9.8 _+0.1

Average 10.3 _+ 0.3 R 2 0.999

Set B Benzene 10.0 -+0.0 Cyclohexane 9.9 _+ 0.4 E thy lbenzene 10.1 _+ 0.1 Hexane 11.5 _+ 1.2 I sopropanol 10.2 _+0.1 o-Xylene 10.0 _+0.1 To luene 9.9 _+ 0.1

Average 10.2 _+0.3 R 2 0.999

Set C a t 2 cm -1

Benzene 10.0 -+0.0 Cyclohexane 9.3 _+0.5 E thy lbenzene 10.3 _+0.3 Hexane 9.9 _+ 1.9 I sopropanol 10.1 _+0.3 o-Xylene 10.2 _+0.1 To luene 10.0 _+0.2

Average 10.0 _+ 0.5 R 2 0.999

Set C a t 8 c m - '

Benzene 9.9 _+0.1 Cyclohexane 10.6 _+ 2.0 E thy lbenzene 9.7 _+0.6 Hexane 13.6 -+ 3.5 I sopropanol 10.3 _+0.4 o-Xylene 9.5 _+0.6 To luene 11.3 _+ 0.6

Average 10.7 _+ 1.1 R 2 1.000

10.0 - 0 . 0 9.9 _+0.1 9.8 _0 .1 8.7 _+0.2 9.8 _+0.6 9.5 _+1.0 6.9 _+2.3 4.5* _+3.9

10.2 -+_0.1 10.2 _+0.2 9.7 -+0.5 10.6 _+0.9 13.3 _+1.8 9.3 _+2.8 14.6" _+6.9 - 1 . 5 " _+11.3 10.0 _+0.1 10.4 _+0.2 10.2 _+0.6 10.2 _+ 1.0 10.1 _+0.1 9.9 -+0.2 10.4 _+0.4 8.4 _+0.7

9.8 _+0.1 10.0 _+0.2 9.3 _+0.5 10.5 _+0.8

10.4 _+0.4 9.9 +0.7 10.1 _+1.6 7.3 _+2.7 0.998 0.996 0.972 0.916

10.0 +0.0 9.9 _+0.1 9.5 _+0.1 8.8 +0.2 9.9 ___0.6 10.3 -+1.1 8.5 _+2.4 5.8* _+4.0

10.1 -+0.1 10.1 ___0.2 10.0 _+0.6 10.2 _+0.9 10.0 _+1.8 7.1 _+3.2 16.3 _+7.2 - 7 . 5 * _+12.1 10.4 _+0.2 10.5 -+0.3 9.1 _+0.7 9.5 _+1.3 10.0 _+0.1 10.0 -+0.2 9.1 _+0.4 8.6 _+0.7 10.0 _+0.1 10.1 -+0.2 9.6 -+0.5 10.6 +-0.9

10.1 -+0.4 9.7 _+0.8 10.3 +1.7 6.5 ___2.9 0.998 0.995 0.973 0.917

10.0 _+0.0 10.0 -+0.1 9.5 -+0.1 8.7 -+0.3 9.6 _+0.7 12.2 _+1.3 8.9 _+2.8 5.9* _+5.3

10.3 _+0.4 10.9 ___0.7 8.1 _+1.4 9.7 ___2.2 8.7 -+2.8 5.3* _+4.8 6.8* _+10.4 19.2" -+18.3 9.4 -+0.4 9.6 _+0.7 9.4 _+1.4 8.9 _+2.5 9.9 _+0.2 9.8 _+0.4 8.7 -+0.8 5.7 -+1.3

10.0 _+0.2 10.3 _+0.4 10.0 -+0.9 9.8 + 1.5

9.7 _+0.7 9.7 _+1.2 8.8 _-+2.5 9.7 -+4.5 0.999 0.996 0.978 0.912

9.8 -+0.1 10.1 _+0.1 9.9 _+0.1 9.3 -+0.6 13.3 _+2.8 10.4 _+3.8 11.4 _+2.2 27.9* _+14.4

9.0 _+0.8 11.3 _+1.1 9.6 _+0.7 15.7 _+4.0 9.9* _+4.8 12.6" _+6.7 12.9 _+3.9 28.0* _+32.4

10.6 _+ 0.6 9.4 _+ 0.8 10.4 _+ 0.5 10.5 _+ 3.7 8.7 _+0.9 8.8 _+1.1 10.0 _+0.7 2.9* _+3.8

12.6 _+0.8 9.2 _+1.1 11.1 _+0.7 8.8 -+3.7

10.6 _+ 1.6 10.3 _+2.1 10.8 _+ 1.3 14.7 _+9.0 1.000 0.999 1.000 0.985

804 Volume 46, Number 5, 1992

Page 6: The Use of Singular Value Variance-Decomposition Proportions in FT-IR Analysis of Gases and Vapors

TABLE VI. Singular value variance-decomposition proportions for set B. Variance-decomposition proportions

Cond. Ben- n-Bu- Cyclo- Isobu- Isobu- Isopro- m-Xy- Etha- Ethyl- n-Hex- Meth- o-Xy- p-Xy- Tol- Index zene tane hexane tanol tane panol lene Ethane nol benzene ane anol lene lene uene

1.0 0.000 0.005 0.006 0.000 0.001 0.002 0.000 0.002 0.000 0.000 0.024 0.000 0.000 0.000 0.000 2.5 0.005 0.001 0.046 0.015 0.000 0.006 0.007 0.020 0.014 0.009 0.021 0.014 0.003 0.002 0.005 3.2 0.005 0.002 0.001 0.010 0.046 0.037 0.010 0.003 0.009 0.036 0.001 0.019 0.030 0.001 0.030 3.5 0.012 0.032 0.038 0.006 0.005 0.015 0.004 0.104 0.005 0.008 0.005 0.009 0.043 0.064 0.030 4.1 0.229 0.001 0.081 0.001 0.011 0.026 0.031 0.062 0.002 0.004 0.026 0.001 0.048 0.122 0.005 4.5 0.010 0.197 0.001 0.000 0.343 0.011 0.003 0.006 0.000 0.011 0.036 0.000 0.001 0.009 0.017 4.6 0.155 0.008 0.026 0.000 0.000 0.000 0.501 0.006 0.000 0.010 0.000 0.001 0.020 0.000 0.028 4.9 0.030 0.010 0.102 0.008 0.009 0.079 0.002 0.079 0.003 0.003 0.206 0.003 0.052 0.326 0.003 5.4 0.473 0.050 0.245 0.000 0.044 0.025 0.034 0.010 0.010 0.004 0.008 0.000 0.028 0.202 0.007 5.8 0.031 0.059 0.038 0.021 0.148 0.063 0.000 0.021 0.018 0.027 0.062 0.064 0.161 0.210 0.064 6.6 0.031 0.157 0.012 0.015 0.057 0.022 0.067 0.000 0.041 0.061 0.052 0.063 0.543 0.018 0.070 7.4 0.011 0.003 0.313 0.051 0.073 0.161 0.000 0.572 0.047 0.097 0.318 0.039 0.016 0.046 0.005

11.5 0.002 0.385 0.008 0.460 0.261 0.266 0.011 0.004 0.837 0.016 0.133 0.000 0.029 0.000 0.003 9.7 0.006 0.083 0.070 0.352 0.000 0.284 0.060 0.106 0.016 0.005 0.017 0.772 0.018 0.000 0.054 9.0 0.000 0.008 0.014 0.059 0.003 0.003 0.271 0.005 0.001 0.709 0.089 0.014 0.006 0.000 0.681

and cyclohexane with S /N ratios below 20; i.e., the mean error of es t imat ion is 1.6 p p m - m . On the basis of results f rom m a n y regressions, these errors are representat ive . Overall, set A compounds yield robus t OLS predictions, in t ha t considerable noise can be to lera ted wi thout de- grading predictions.

Set B. T h e second example uses 15 compounds in set B. Pearson correlat ion coefficients for four pairs of com- pounds are above 0.6, namely, cyclohexane and n-hexane (r = 0.61); toluene and e thylbenzene (r = 0.62); isobu- tanol and methano l (r = 0.75); and e thanol and methano l (r = 0.65). The th i rd and s t rongest correlat ion explains 56 % of the variation. On the basis of these results alone, a higher CN than t ha t found for set A is ant ic ipated. T h e CN is 11.5, and three CIs approach or exceed 10, indicat ing modera te and compet ing coll inearity (Table VI). For rows with CIs above 7, var iance propor t ions summing above 0.5 include isobutanol, isopropanol , eth- ane, ethanol, ethylbenzene, methanol , and toluene. These compounds p robab ly are involved in one or several de- pendencies. Of the 15 auxil iary regressions, eight have R 2 above 0.5; the highest (ethanol 's) is 0.73, considerably higher t han implied by the correlat ion coefficient matr ix . Tab le IV shows s u m m a r y results for set A compounds . On average, each auxil iary regression employs six pre- dictors, showing complex bu t not especially strong de- pendencies.

Set B spect ra were regressed against the same syn- thet ic spec t rum used previously. Resul ts for sets A and B are similar, a l though B shows slightly greater degra- dat ion (Table V). For example, the average errors of es t imat ion for an S /N rat io of 20 are 1.6 and 1.7 p p m - m for sets A and B, respectively. These results indicate t ha t the abil i ty to predic t B T E X compounds is largely un- a l tered with the use of set B. Thus , adding eight com- pounds to the seven in set A does not cause degrading collinearity.

Set C. The th i rd example examines dependencies among the 41 compounds in set C. The large size (412) of the correlat ion and var iance-decomposi t ion matr ices prohib i t s their reproduct ion here. At 8 cm -~ resolution, the C N is 118, indicat ing s t rong collinearity. Twelve CIs above 30 and m a n y var iables with large var iance-decom-

posi t ion propor t ions indicate compet ing dependencies . At 2 cm -1, the CN is 53. Again, compet ing dependencies are shown by six CIs t ha t exceed 30.

Auxil iary regressions for the B T E X compounds have high R 2 expressing s t rong dependencies (Table IV). For example , a t 8 cm -1 resolution, six compounds share 90% or more of thei r variance, four of which share over 95 %. Regression results using the synthet ic spec t rum are de- graded a t S /N rat ios below 100 at 8 cm -1 resolution, and below 50 a t 2 cm -~ resolut ion (Table V). At 8 cm -1 res- olution, sizable errors are made with the use of even pure reference spectra , due to the 1% noise added to the syn- thet ic spec t rum. A resolut ion of 8 cm -1 is too coarse for accurate predic t ions of the compounds in set C.

Accuracy and Computations. The CN roughly indicates the effect of imprecis ion in the da ta on the accuracy of OLS predictions. Less degrada t ion was exper ienced than expec ted theoretically. For example , for ~ -~ 10 2, roughly two significant digits were ob ta ined for an S /N rat io of 10 2, and one significant digit for an S /N rat io of 10 (e.g., set C results in Tab le V). T h e o r y would indicate t ha t S /N rat ios should be a t least 10 t imes higher to obta in this precision. T h e small sample size and app rox ima te re la t ionship between CN and precision account for these discrepancies.

Another resul t of coll ineari ty was the inabil i ty to inver t the XTX mat r ix with the use of single or double precision inversion rout ines with ~ -> 10 2. After the scaling and center ing of columns of this matr ix , matr ices with slight- ly higher CNs could be inverted. However, the remain ing numer ica l errors m a y l imit the usefulness of the results.

D I S C U S S I O N

T h e numer ica l exper iments demons t r a t e the detr i- men ta l effects of coll ineari ty among IR spec t ra on the abi l i ty to predic t concentra t ions using spectral least- squares methods . These effects include inaccurate re- gression es t imates , inflated var iance of regression esti- mates , and decreased goodness of fit. Coll ineari ty or de- pendencies are best indicated by the condit ion number , the rat io of the m a x i m u m to the m i n i m u m singular value in the reference spec t ra matr ix . This composi te measure

APPLIED SPECTROSCOPY 80,5

Page 7: The Use of Singular Value Variance-Decomposition Proportions in FT-IR Analysis of Gases and Vapors

of collinearity shows the strength of dependencies that exist among two or more spectra. This parameter is fairly easily and very stably computed.

On the basis of the three spectral sets examined, de- pendencies among IR spectra of vapors and gases at mod- erate (2 cm -1) resolution are complex and involve many spectra. Competing rather than dominating dependen- cies prevail. Simple measures that show only pairwise relationships (e.g., correlation coefficients) do not diag- nose the collinearity actually present. Variables involved in the dependencies can be identified with the use of variance-decomposition proportions and auxiliary re- gressions. The information obtained can be used to select a set of reference spectra that minimizes dependencies. For competing dependencies, the decomposition pro- portions are less specific, and auxiliary regressions may more clearly identify dependencies. With large numbers of spectra, however, it is tedious to compute and evaluate all auxiliary regressions. It may be more efficient to com- pute the variance decomposition proportions and then use auxiliary regressions on the variables indicated.

The experiments indicated that collinearity had gen- erally negligible effects until S/N ratios fell below about 100. In practice, S/N ratios would be considerably higher. However, the numerical experiments implicitly assumed that perfect reference spectra were available and that all other OLS assumptions held. These assumptions would not be achieved in practice; thus degrading effects may occur at the S/N ratios commonly encountered.

Coarse spectral resolution increases collinearity as seen in set C, where 8 cm -1 resolution gave a CN of 118 that resulted in numerical problems and inaccurate predic- tions. Unfortunately, higher resolutions do not neces- sarily translate to enhanced performance, lower detec- t ion limits, etc., given the inevitable increases in instrument noise. To optimize performance, one must evaluate the trade-off between resolution, signal-to-noise, and collinearity. The suggested diagnostics can show the

potential performance of various combinations of spec- tra, resolution, and bandwidths. The coarsest resolution that yields acceptable collinearity should obtain the best performance. Importantly, results depend only on the reference spectra. Thus, results can be generalized to spectra collected under any conditions. The diagnostic measures should be applied before a set of reference spectra is used.

Collinearity is only one of several important sources of errors in spectral fitting procedures. Other OLS as- sumptions must be valid. However, spectral sets which minimize collinearity will provide more accurate and sta- ble results. Additionally, the suggested diagnostic anal- yses appear amenable to automation and could be used as a prefilter in selecting sets of reference spectra for use in spectral least-squares methods.

1. Y. Li-Shi, S. P. Levine, C. R. Strang, and W. F. Herget, Am. Ind. Hyg. Assoc. 50, 354 {1989).

2. J. R. Gosz, C. D. Dahm, and P. G. Risser, Ecology 69, 1326 (1988). 3. M. L. Spartz, M. R. Witkowski, J. H. Fately, J. S. White, J. V.

Paukstelis, R. M. Hammaker, W. G. Fateley, R. E. Carter, M. Thomas, D. D. Lane, G. G. A. Marotz, B. J. Fairless, T. Holloway, J. L. Hudson, and D. F. Gurka, Am. Env. Lab. Nov., 15 (1989).

4. J. M. Bjerga and G. W. Small, Anal. Chem. 62, 226 (1990). 5. B. Wythoff, X. Hong-Kui, S. P. Levine, and S. A. Tomellini, "Com-

puter Assisted Infrared Identification of Vapor-Phase Mixture Components," J. Chem. Inf. Comp. Sci., accepted (1991).

6. M. R. Nyden, J. E. Pallister, D. T. Sparks, and A. Slari, Appl. Spectrosc. 41, 63 (1987).

7. C. P. Wang and T. L. Isenhour, Appl. Spectrosc. 36, 185 (1987). 8. D. Belsley, E. Kuh, and R. Welsch, Regression Diagnostics: Iden-

tifying Influential Data and Sources o[ Collinearity (Wiley and Sons, New York, 1980).

9. D. M. Haaland and R. G. Easterling, Appl. Spectrosc. 34, 539 (1980).

10. X. Hong-kui, S. P. Levine, and J. B. D'Arcy, Anal. Chem. 61, 2708 (1989).

11. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes: The Art of Scientific Computing (Cambridge University Press, New York, 1987).

806 Volume 46, Number 5, 1992


Recommended