+ All Categories
Home > Documents > A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength...

A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength...

Date post: 03-Oct-2016
Category:
Upload: howard
View: 213 times
Download: 0 times
Share this document with a friend
14
14. R. A. Strehlow, Combustion Fundamentals (McGraw-Hill, New York, 1985). 15. C.F. Cullis and M. F. R. Mulcahy, Comb. and Flame 18, 222 (1972). 16. P. D. Maker and R. W. Terhune, Phy.,~.Rev. 137, A801 (1985). 17. S. Druet and J. P. Taran, Prog. Quant. Electr. 7, 1 (1981). 18. M. H. Sirretz, J. Chem. Phys. 19, 938 (1951). 19. R. D. Shelton, A. H. Nielsen, and W. H. Fletcher, J. Chem. Phys. 21, 2178 (1953). 20. W. A. Murphy, J. Raman Spectro~,~c. II, 339 (1981). 21. M.H. Brooker and H. H. Eysell, J. Raman Spectrosc. 11,322 (1981). 22. T. Hirschfeld, E. R. Schildkvant, H. Tannenbaum, and D. Tan- nenbaum, Appl. Phys. Lett. 22, 38 (1973). 23. T. Dreier, B. Lange, J. Wolfrum, and M. Zahn, Appl. Phys. B45, 183 (1988). 24. G. M. Johnson, C. J. Mattews, M. Y. Smith, and D. J. Williams, Combust. and Flame 15, 211 (1970). 25. L. A. Rahn, L. J. Zych, and P. L. Mattern, Opt. Comm. 30, 249 (1979). 26. H. A. Olschewski, J. Troe, and H. G. Wagner, Z. Physics Chem. 44, 173 (1965). 27. P. A. Gigu6re and R. Savoie, Can. J. Chem. 43, 2357 (1965). 28. J.J. McCarrey and W. D. McGrath, Proc. Roy. Soc. (London) A278, 490 (1964). 29. E. F. Hayis and G. V. Pfeiffer, Am. Chem. Soc. 90, 4773 (1968). 30. N. Basco and R. D. Morsi, Proc. Roy. Soc. (London) A321, 129 (1971). A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches HOWARD MARK Bran and Luebbe Analyzing Technologies, 103 Fairview Industrial Park, Elms[oral, New York 10523 The process of selecting wavelengths for performing quantitative analysis in the near-infrared is notorious for its instability. A Monte Carlo tech- nique was used to investigate the sensitivity of the wavelength selection process to the noise content of the spectra. The random nature of the noise causes the wavelengths to be selected at random; this seems to be sufficient to explain the instability of the selection process. The statistics of the selection process are insensitive to error in the dependent variable, and, within limits, also insensitive to the amount of noise in the spectral data. The statistics are sensitive to the number of samples in the data set and to the nature of the distribution of the noise. Index Headings: Computer applications; Reflectance spectroscopy; NIR analysis; Chemometrics; Calibration techniques. INTRODUCTION It has been many years since the initial development of the spectroscopic technique :now called Near Infrared Reflectance Analysis (NIRA). 1 In all that time, one of the most intractable problems in the use of this analytical technology has revolved around the question of devel- oping a methodology for selecting the wavelengths to use for a given analysis. In practice, many empirical methods have been de- veloped for selecting the analytical wavelengths (see the discussion in Ref. 2 for an excellent review of the tech- niques currently in use). However, most of these methods are subject to a certain difficulty. This difficulty arises because an automatic computerized wavelength selection technique is almost invariably included in the method development process. There are a number of excellent reasons for this, among them the fact that many of the analyses of interest are for powdered solid materials. Data from such specimens show an effect variously called the "particle size effect" or the "repack effect." This phenomenon causes repeat readings of the same aliquot of the specimen--if dumped out of and then repacked into the sample holder--to Received 25 April 1988; revision received 8 June 1988. show an offset which is systematic across the wavelengths but random in direction and magnitude from pack to pack) Clearly, a measurement at some wavelength is needed to account for this extraneous variation of the spectral data, but just as clearly, "particle size effect" does not have a characteristic absorbance band, such as chemical constituents possess. Furthermore, many of the constituents in natural products, where this technology has its most widespread use, have at least some absor- bance at virtually every wavelength in the near-infrared region. In many cases the desired analysis is for some characteristic of the sample defined by the process, such as "baking quality," where the chemistry and spectl:os- copy of the analysis are ill-defined or unknown. For all these reasons, automatic wavelength searches are used. One chooses the "best" wavelength so as to optimize the calibration by finding the wavelengths where one of the calibration statistics shows the calibration to be best. This optimum value could be either a maximum or a minimum value, depending upon the statistic used: if correlation coefficient, for example, is the chosen sta- tistic, it would be maximized, while if Standard Error of Estimate (SEE) is the test statistic, it would be mini- mized. The difficulty that arises when automatic wavelength selection methods are used is that, when such an exper- iment is done more than once (i.e., with a set of data from similar but not identical samples), the wavelengths that are chosen are rarely the same in the different ex- periments. Furthermore, the interactive nature of the data at different wavelengths in the NIR spectra causes large shifts in all the wavelengths that are chosen by such searches. Currently there is a trend toward use of calibration methods, such as Principal Component Analysis (PCA) and Partial Least-Squares (PLS), that do not require wavelength selection because data at all available wave- lengths are used. However, this is not a complete solution to the problem either. Not only are there already a large Volume 42, Number 8, 1988 0003-7028/88/4208-142752.00/0 APPLIED SPECTROSCOPY 1427 © 1988 Society for Applied Spectroscopy
Transcript
Page 1: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

14. R. A. Strehlow, Combustion Fundamentals (McGraw-Hill, New York, 1985).

15. C.F. Cullis and M. F. R. Mulcahy, Comb. and Flame 18, 222 (1972). 16. P. D. Maker and R. W. Terhune, Phy.,~. Rev. 137, A801 (1985). 17. S. Druet and J. P. Taran, Prog. Quant. Electr. 7, 1 (1981). 18. M. H. Sirretz, J. Chem. Phys. 19, 938 (1951). 19. R. D. Shelton, A. H. Nielsen, and W. H. Fletcher, J. Chem. Phys.

21, 2178 (1953). 20. W. A. Murphy, J. Raman Spectro~,~c. II, 339 (1981). 21. M.H. Brooker and H. H. Eysell, J. Raman Spectrosc. 11,322 (1981). 22. T. Hirschfeld, E. R. Schildkvant, H. Tannenbaum, and D. Tan-

nenbaum, Appl. Phys. Lett. 22, 38 (1973). 23. T. Dreier, B. Lange, J. Wolfrum, and M. Zahn, Appl. Phys. B45,

183 (1988).

24. G. M. Johnson, C. J. Mattews, M. Y. Smith, and D. J. Williams, Combust. and Flame 15, 211 (1970).

25. L. A. Rahn, L. J. Zych, and P. L. Mattern, Opt. Comm. 30, 249 (1979).

26. H. A. Olschewski, J. Troe, and H. G. Wagner, Z. Physics Chem. 44, 173 (1965).

27. P. A. Gigu6re and R. Savoie, Can. J. Chem. 43, 2357 (1965). 28. J.J. McCarrey and W. D. McGrath, Proc. Roy. Soc. (London) A278,

490 (1964). 29. E. F. Hayis and G. V. Pfeiffer, Am. Chem. Soc. 90, 4773 (1968). 30. N. Basco and R. D. Morsi, Proc. Roy. Soc. (London) A321, 129

(1971).

A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

HOWARD MARK Bran and Luebbe Analyzing Technologies, 103 Fairview Industrial Park, Elms[oral, New York 10523

The process of selecting wavelengths for performing quantitative analysis in the near-infrared is notorious for its instability. A Monte Carlo tech- nique was used to investigate the sensitivity of the wavelength selection process to the noise content of the spectra. The random nature of the noise causes the wavelengths to be selected at random; this seems to be sufficient to explain the instability of the selection process. The statistics of the selection process are insensitive to error in the dependent variable, and, within limits, also insensitive to the amount of noise in the spectral data. The statistics are sensitive to the number of samples in the data set and to the nature of the distribution of the noise. Index Headings: Computer applications; Reflectance spectroscopy; NIR analysis; Chemometrics; Calibration techniques.

INTRODUCTION

It has been many years since the initial development of the spectroscopic technique :now called Near Infrared Reflectance Analysis (NIRA). 1 In all that time, one of the most intractable problems in the use of this analytical technology has revolved around the question of devel- oping a methodology for selecting the wavelengths to use for a given analysis.

In practice, many empirical methods have been de- veloped for selecting the analytical wavelengths (see the discussion in Ref. 2 for an excellent review of the tech- niques currently in use). However, most of these methods are subject to a certain difficulty. This difficulty arises because an automatic computerized wavelength selection technique is almost invariably included in the method development process.

There are a number of excellent reasons for this, among them the fact that many of the analyses of interest are for powdered solid materials. Data from such specimens show an effect variously called the "particle size effect" or the "repack effect." This phenomenon causes repeat readings of the same aliquot of the specimen--if dumped out of and then repacked into the sample holder--to

Received 25 April 1988; revision received 8 June 1988.

show an offset which is systematic across the wavelengths but random in direction and magnitude from pack to pack) Clearly, a measurement at s o m e wavelength is needed to account for this extraneous variation of the spectral data, but just as clearly, "particle size effect" does not have a characteristic absorbance band, such as chemical constituents possess. Furthermore, many of the constituents in natural products, where this technology has its most widespread use, have at least some absor- bance at virtually every wavelength in the near-infrared region. In many cases the desired analysis is for some characteristic of the sample defined by the process, such as "baking quality," where the chemistry and spectl:os- copy of the analysis are ill-defined or unknown.

For all these reasons, automatic wavelength searches are used. One chooses the "best" wavelength so as to optimize the calibration by finding the wavelengths where one of the calibration statistics shows the calibration to be best. This optimum value could be either a maximum or a minimum value, depending upon the statistic used: if correlation coefficient, for example, is the chosen sta- tistic, it would be maximized, while if Standard Error of Estimate (SEE) is the test statistic, it would be mini- mized.

The difficulty that arises when automatic wavelength selection methods are used is that, when such an exper- iment is done more than once (i.e., with a set of data from similar but not identical samples), the wavelengths that are chosen are rarely the same in the different ex- periments. Furthermore, the interactive nature of the data at different wavelengths in the NIR spectra causes large shifts in all the wavelengths that are chosen by such searches.

Currently there is a trend toward use of calibration methods, such as Principal Component Analysis (PCA) and Partial Least-Squares (PLS), that do not require wavelength selection because data at all available wave- lengths are used. However, this is not a complete solution to the problem either. Not only are there already a large

Volume 42, Number 8, 1988 0003-7028/88/4208-142752.00/0 APPLIED SPECTROSCOPY 1427 © 1988 Society for Applied Spectroscopy

Page 2: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

number of instruments currently in use that are being calibrated by regression methods that require wave- length selection, but in some cases the regression meth- ods are preferred. For example, in routine use the anal- ysis time is approximately proportional to the number of wavelengths at which data must be measured. There- fore, use of calibration methods that require only a few wavelengths (which implies that the wavelengths must be selected from a larger set) will result in shorter anal- ysis time in routine use. The time savings can be appre- ciable; often, acceptable results are obtainable with as few as three or even two wavelengths. Contrast this with a calibration that requires measurement of the six, ten, twenty, or even several hundred wavelengths that are required if a principal component calibration is applied to data from a spectrophotometer.

In addition, the problem is of interest in itself. We therefore investigated the behavior of both real

and simulated spectroscopic data, to try to determine some of the factors that affect the wavelengths that are chosen. The approach used was to take the test spectra and add random noise to them, and to determine any patterns of wavelength selection that might arise.

THEORY

One of the notable characteristics of all modern in- struments used in the NIR spectral region is the ex- tremely low noise levels attained. 4 The small signals available from reflectance measurements in the NIR spectral region (in this case, "signal" means change of absorbance with constituent concentration rather than raw optical signal) impose a requirement for low-noise instrumentation that was previously almost unheard of; instrument noise levels on the order of 10 -5 absorbance units [where absorbance is defined as log(I/reflectance)] are common.

One idea was that this extremely low noise level was the source of the wavelength selection instability. A thought experiment reveals the reason for this sugges- tion, by taking this concept to the limit of zero noise. Consider the synthetic spectra of Fig. 1, which shows a set of spectra consisting of a single absorbance band. Imagine that a spectroscopic analysis is to be performed for a material with such a spectrum and that the con- ditions for the analysis are ideal: the measurement is to be made in transmission mode so that Beer's law holds exactly, the analyte has a single absorbance band, it is dissolved in a completely nonabsorbing solvent, and there are no interferences. Now let us suppose further that there is no noise whatsoever in the measured spectral values and no error in the reference laboratory values for the analyte. Under these conditions, the spectra of a set of calibration samples would look like Fig. 1, where the three "spectra" shown represent spectra of samples at different concentrations.

For such a set of calibration samples, any wavelength within the confines of the absorbance band will allow a calibration line to be determined with no error. The ques- tion then arises: What wavelength should be used? The human analyst will, of course, choose the absorbance maximum, but this is done partly from habit and partly in order to maximize signal-to-noise ratio. In the case of

1.0000~

0.9000+

O.BO00+

0.7000+

0 .8000+

0.5000+

O.dO00+

0.3000+

o.aoo~t

O.lO00+

0.0000 I 1200.0

C = 1 , 0 - ~

£ = 0 ,15

= ,

',404 04 0 NAVELENGTH

I l l 2400.0

i

Fro. 1. A set of "ideal" spectra: these curves represent a set of spectra taken from a material with a single absorbance band, dissolved in a nonabsorbing solvent, with no interferences; Beer's law holds exactly, and there is no noise on the spectra. In such a case it is impossible to objectively select a wavelength that will give the best results.

our thought experiment, however, there is no noise, and therefore nothing to maximize; hence any wavelength will be equally suitable.

A computer given this assignment would likely throw up its electronic hands in disgust at being required to select a wavelength in the face of such data, probably coming up with a "divide by zero" or some such error message; but if the programming prevented this, the computer would probably select the first wavelength en- countered that was actually on the absorbance band (or maybe the last one--in either case it would be the worst possible choice for real data).

Now let us change the scenario slightly and imagine the same situation, except that now the spectroscopic data contain a small amount of noise. Now we are jus- tified in expecting the peak of the absorbance band to correspond to the largest signal-to-noise ratio. But what does this mean?

In order to analyze the situation, we must draw a dis- tinction between the population value of the noise and the value of the noise for any given set of readings. This distinction is well discussed in the statistical literature and is also the subject of a tutorial article. 4 Without getting into all the gory details, we note here that, when we say that the signal-to-noise ratio is largest at the absorbance peak, we are comparing the signal, as rep- resented by the absorbance band, to the population val- ue of the noise, which is the same at all wavelengths.

In reality, however, the actual noise at any given wave- length will not equal the population value, but rather, the noise at each wavelength will be the value determined by the sampling distribution of the noise. We would usu- ally expect the distribution of the variance of the noise to be X 2. Consequently, the actual noise at any given wavelength will vary randomly and by fairly large amounts. Consequently, the term signal-to-noise ratio in this context properly refers only to the probability of achieving a given value, rather than to the value itself. Thus, when computer programs are written so as to choose a wavelength on the basis of optimizing some calibration

1428 Volume 42, Number 8, 1988

Page 3: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

statistic, the wavelength chosen would be expected to vary randomly. The one actually chosen will depend upon which wavelength showed the best signal-to-actual-noise, which would vary as the random character of the actual noise of any given wavelength. For random noise, this is describable only by the probabilistic characteristics of the noise.

This work was conducted in order to investigate these probabilities. We noted above that, in natural products, absorbances are found at virtually every wavelength. This fact, combined with the weakness of NIR absorbances and the superimposition of (often much larger) varia- tions due to "particle size effect," makes it difficult to determine the signal-to-noise ratio by ordinary means. Consequently the approach taken was to perform many wavelength searches, each time with a different set of random noise but with each set coming from the same population of random noise. Then, in the long run, since the actual signal-to-(population)noise ratio would reflect the probability of choosing a given wavelength, the num- ber of times a wavelength was actually chosen would indicate the probability. This, in turn, would be propor- tional to the signal (since the population value of noise was fixed).

Thus a Monte Carlo approach was used. Starting with a fixed set of spectra, "noise" in the form of computer- generated random number~was added to each spectral data point. Then a wavelength search was performed, and the wavelength selected as "optimum" was noted. Then, for the same set of starting spectra, another set of similar random noise was added, and a new search done. The process of adding noise and performing a wavelength search was repeated many times, until a his- togram of the number of times each wavelength was cho- sen could be generated.

In practice, the number of searches was chosen to be sufficient to ensure that the random fluctuations of the histogram smoothed out and approximated a smooth curve to the eye.

Both synthetic (computer-generated) and real spectra were used as the starting set of spectra for the study. Synthetic spectra were used because that approach seemed to be the best way to be able to control the "signal" part of the signal-to--noise ratio. This step, in turn, allows a determination of the distribution of wave- length selections, which estimates the probability for se- lection of a given wavelength as a function of the signal- to-noise ratio.

The use of real data allows the inverse operation: given knowledge of the distribution, which was obtained from the synthetic data, the histogram of wavelength selection gives an indication of the relative signal-to-noise ratios at the different wavelengths. Making the further as- sumption that the population :noise is constant, this step then gives an indication of the "signal" portion of the signal-to-noise ratio. This approach, then, is another method of reconstructing the spectrum of the constituent under study.

EXPERIMENTAL

For the synthetic calibration spectra, a set of straight lines was generated. A straight line is the preferable form

of "spectrum" in this case, even though it is not a good approximation to an actual absorbance band. This is because it provides a continuum of "signal" values be- tween zero signal and the maximum. Since wavelengths were used individually and without any reference to neighboring wavelengths, to actually model a real ab- sorbance band was unnecessary. For each synthetic spec- trum, 101 points were generated, corresponding to 101 wavelengths with simulated absorbances varying contin- uously between 0 and the maximum value of 1 absor- bance. A set of such spectra, the noise-free starting set, is shown in Fig. 2A.

Several parameters were varied; these included the number of searches performed, the number of spectra in the set, the noise level, the distribution of the noise, and the effect of error in the Y-variable as well as in the X-variables.

The synthetic spectra were generated in such a manner that the maximum value of the absorbance correspond- ing to the highest-absorbing "sample" was unity. Ot:her spectra were generated by scaling the absorbances of the highest-absorbing spectrum by increments of l/n, where n was the total number of spectra in the set. Each spec- trum was assigned a "constituent" value proportional to its absorbance, and was chosen so that the "constituent" value equaled the absorbance of the spectrum at the highest-absorbing wavelength.

Noise in the form of random numbers was then added to the spectra as generated above; the (population) stan- dard deviation of the noise was varied between 0.001 and 0.1 absorbance by scaling the output from the random- number generator. Figure 2B shows the same spectra as shown in Fig. 2A, with an added noise level whose stan- dard deviation is 0.01 absorbance (sometimes called RMS).

The random number generator used was a modifica- tion of the one in the IBM (FORTRAN) Scientific Sub- routines package. Upon testing this program, we found that the unmodified subroutine had a cycle length of less than one million random numbers. To circumvent this, we changed critical internal variables of the subroutine to double precision. This caused the cycle length to in- crease beyond our patience to measure it. The subroutine otherwise appeared to be satisfactory.

An "experiment" consisted of performing regressions of the synthetic data against the constituent values as- signed to the spectra. A wavelength search was per- formed by regressing the data at each wavelength against the constituent and noting which wavelength gave the best results [as defined by the smallest sum-squared error (SSE), the quantity the Standard Error of Estimate is calculated from in a calibration]. We then repeated the experiment, adding a different set of random "noise" values to the synthetic spectra and again performing a wavelength search. The purpose of all this was to note which wavelength was selected each time. By repeating the wavelength search each time, we could generate a histogram, showing the relative number of times each wavelength was selected. This resulted in an empirical determination of the distribution of wavelength selec- tions.

The real spectra used were spectra of thirty samples of Hard Red Spring wheat, measured on a Technicon

APPLIED SPECTROSCOPY 1429

Page 4: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

i

Ill !

t .O000'

0 .9000'

0.81200'

0.7000'

0.6000.

0.5000,

0.4000,

0,3000,

0.~000,

0.1000.

0.0000,

t . OOOr

O. 900(

0.800(

O, 7001

O.SOO!

0.500q

0,400i

0.3004

0,200(

0,1004

0,000q

1200 .0 L400.0 i T 0 0 , 0

A

| 0 0 0 . 0 2000.0 2 n O . 0 2400 • 0 II&VltI.Et46TI.I

i i i J , | , i , i . i i ~ . , i

B

1-200.0 t 400,0 SilO0.0 t 8 0 0 . 0 a 0 0 0 , 0 2 2 0 0 . 0 e l 0 0 , 0 IIA~..ILI~IICrH

Fro, 2. The synthetic spectra used for the test wavelength searches. Part A: the initial noise-free spectra. Part B: the spectra with random, Normally distributed noise added; the noise shown had a standard deviation of 0,01 absorbance,

1430 Volume 42, Number 8, 1988

Page 5: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

InfraAlyzer Model 500. Spectra were measured over the range 1100 to 2500 nm at 4-nm increments, resulting in 351-point spectra.

Eight spectra of each specirnen were measured, with the use of a previously defined protocol2 For each spec- imen, all eight readings at each wavelength were averaged together. This procedure minimized the inherent noise of the measurements, which include sampling error and repack error, as well as electronic noise. 5 Synthetically generated noise was then added to these spectra, in a manner similar to that used with the synthetic spectra, and similar wavelength searches were performed.

Computations on the synthetic data were done with an IBM PC-AT computer. For the real data the com- putations were performed on an IBM PS-2 Model 80 computer. All programs written in Ryan-McFarland FORTRAN.

RESULTS

Since the results were strongly dependent upon the randomness of the data, the first experiment performed was to determine the number of wavelength searches needed for the histogram generated to settle down and give a reliable indication of the relative performance of the different wavelengths. Figures 3 and 4 show the his- tograms obtained from this part of the tests. The added noise was Normally distributed and had value of 0.01 absorbance RMS. The number of wavelength searches varied from 100 to 240,000. As might be expected: im- provement was more rapid when the number of searches was small, and improvement continued, but more and more slowly as the number of searches increased. On the basis of these results, further tests performed used 120,000 wavelength searches to generate the histograms. Fewer searches (e.g., 60,000 or perhaps ewm 30,000) would prob- ably have been satisfactory, but at the time these curves were developed, it was felt that prudence dictated using the higher number, to provide a safety margin. Also, while 120,000 searches were excessive for the synthetic spectra, that number of searc:hes was barely adequate for the real spectra.

The histograms of Figs. 3 and 4 also illustrate what is probably the most crucial sing][e fact to come out of this study: the wide range of signal-to-population-noise levels at which wavelengths are chosen. While the probabilities continually decrease, these two figures show that wave- lengths are selected even when the signal-to-noise is less than half the maximum value. Below we will consider this in more detail.

The next experiment was to test the effect of the signal- to-noise level used. Three histograms were generated and compared. Ten "spectra" were used in each set. In one run, the RMS noise level was set to 0.001 absorbance, in a second run, the noise was 0.1 absorbance; the third run was the same one as in Fig. 3C: 0.01 absorbance.

Figure 5 shows the results of this test. As shown in Fig. 5A, when the noise level was at its highest value (0.1 absorbance RMS) the distribution of selected wave- lengths was spread slightly more than when the noise was set to lower levels. However, the two runs at the lower levels give virtually identical results. Figure 5B shows a section of Fig. 5A expanded. It is clear that the

two curves corresponding to 0.01 and 0.001 absorba:ace noise are following the same path, with only random variations between them.

It is not completely clear why two of the curves are the same and the third is different, but a reasonable hypothesis is that this effect is due to the fact that, when a regression analysis is applied to data with error in the independent variable, the calculated regression coeffi- cient is not an unbiased estimator of the "true" regres- sion coefficient. However, if the noise is small enough, the amount of bias is so small that it has no observable effect. Thus, in the case at hand, only when the added noise was at its highest level was the amount of bias in the coefficient sufficient to influence the wavelength se- lection. It would appear that, below this point, the dis- tribution of selected wavelengths is independent of the actual amount of noise.

The next series of computer runs was to determine the effect of the number of samples in the data set. For this experiment, the noise level was kept constant, at a stan- dard deviation of 0.01 absorbance; and sets containing 10, 20, 30, and 50 spectra were used to generate histo- grams. The ten-spectra set shown was actually the same one used in the previous runs. These runs give the fi)ur curves shown in Fig. 6.

From Fig. 6A it is clear that increasing the number of spectra in the calibration set has a consistent and marked effect upon the range of wavelengths that are chosen during an automatic search. The more spectra in the calibration data set, the "tighter" the selection becomes.

Figure 6B shows an expanded-scale plot illustrating that, even with many spectra, there is still a wide range of wavelengths that can be chosen. Even with fifty spec- tra in the data set, a wavelength can be selected when the signal-to-population-noise ratio is almost as low as half of the signal-to-noise ratio at the absorbance band maximum (although, to be sure, selection of such a wave- length is a low-probability event). When fewer spectra are available, wavelengths can be selected when the sig- nal-to-noise-ratio is as low as one-fifth the maximum.

Another experiment using synthetic spectra was to de- termine the effect of the distribution of the noise. All the previous runs were performed with noise that was Normally (Gaussian) distributed. The final experiment was to add noise to the spectra that had a uniform dis- tribution. In order to be comparable to the Normally distributed noise, the noise level of the uniformly dis- tributed noise was adjusted so that it also had a standard deviation of 0.01 absorbance, and the number of spectra used was set at ten for Fig. 7A and fifty for Fig. 7B.

In both cases the wavelength selections were spread more widely when the noise was Normally distributed than when it was uniformly distr ibuted--and more markedly so for the histograms generated from the larger sets of synthetic samples. This phenomenon was some- what surprising, as the Central Limit Theorem of statistics 6 indicates that the usual effect of larger num- bers of samples is to diminish the effect of the distri- bution of the data. At this point there seems to be no explanation for the contrary behavior observed.

The last experiment with the synthetic spectra was to evaluate the effect of error in Y on the distribution of the wavelength selection. For all the histograms gener-

APPLIED SPECTROSCOPY 1431

Page 6: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

P O . O O Q ,

O O . O O ~ ,

I O . O D Q ,

L L O . O • O ~

a O . O • O ,

D O . • • D °

~ O . O I Q e

L O O O O

~ m U ~ . O ,=o~.=

A

A ! & a D ~ o U $ m o o . o

NAVIW|NITN

L~.O••.

Y ° O O ~ e

B 0 0 ~ • .

m . O ~ O -

I , O D ~

L . O O ~

• . W O O

O O 0 . O Q ,

4 O O . 0 0 ~

l e O • . • ~ ,

& O 0 . 0 • .

O . O Q

D • O •

l o s e • L O • ~ • • o • ~ • a o • ~ • m . • ~ • ~ VS,~Ral~llg T N

B

= N d . o ' = ~ , d • , ° • 4 o

L O 0 0 0 . ~

L H L . • m o o 6 . • m m o ~ . • Odooql , • M dr, VS.i,J 8qNMll T'II4

L='Di.O , . t ~ . o

s h • O O O .

I L • O l B . O ,

ILgooo • It,~

IL LI4DO, qlP,

& • D O • • ,

I , O • D = • .

~ O . O .

m m O o • ,

M O O • •

1 0 • O . O

0 " • L I I ID 0 0 " ' & M O ~ . •

D

" n = 8 • L , o = a , = d . • .... - o • 4 . = , - 8 e 4 . o = 4 o 4 , • '

N'LL VlJ4. , , I tNm T M

FIG. 3. Tests of the effect of the number of wavelength searches used. There were ten synthetic spectra in the set, and the noise level was fixed at 0.01 absorbance RMS. Part A: 100 searches. Part B: 1000 searches. Part C: 10,000 searches. Part D: 30,000 searches.

1432 Volume 42, Number 8, 1988

Page 7: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

1 4 1 1 0 0 . ~ °

~ O O 0 . O

L O O O , O

~ O 0 . O ,

O . O e | Q & d i d O . O

A

n n i m I • • s e o o . o LHO, • mOON. • m~oa~. • 8dSalJ~. •

M d ~ , L m S J l h M ~ T ~ q

U O O 0 ~ , ~

~ O O •

~ g ~ O •

& O ~ O •

i & 4 1 P q D . •

B

n | | i u | | | ! ~ e O O . O & l g l 0 0 a O I Q . 0 m ~ O ~ . U D d ~ l O . O

~ g O O . D

8 9 O O . O

U B O O o O

~ O 0 0 . O

a g e l . g

m o t 0 . o

~ O Q O . D

I . o

I m O O I D i ~ i

! • nl l l l l ' g 1 0 0 l d O O . • 8 S O ~ . O l e o ~ . m I o o o . o l i g o . o m 4 o . B

W A V B ~ I I ~ I ~ I m T N

~ 4 O O O 0 ~

S A O I O o

L J O O Q 0

I g O O Q .

a g O I .

l o g o .

4 g i g .

m o m o .

m l o l . I

D

I i I i I i I • e i 15 M o u o o 1 1 o o . • L L S I Q o • m I O O l D . • a l M D O • • a 4 O g . •

N A V l T ~ . B L a B Ta-g

FIG. 4. Continuation of Fig. 3; tests of the effect of the number of wavelength searches used. Part A: 60,000 searches. Part B: 90,000 searches. Part C: 120,000 searches. Part D: 240,000 searches.

APPLIED SPECTROSCOPY 1433

Page 8: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

i m BOOO.O

7OOO,O

0 ,10 ABS R ~

6 0 0 0 . 0

O,O1 ~ S R M S ~ J / 50OO.O

46°°°t 3000.0~

2000.0-

iOOO,O"

O , O l = l ' l i 2 0 0 , 0 t 4 0 0 . 0 tSO0,O IBO0.O 2 0 0 0 . 0 EEO0,O 6 4 0 0 . 0

WAVELENGTH B

~ 0 0 0 . 0 "

tBO0.O"

0, i0 ABS RM tWO0.O'

O,OJ, ABS R M S ~ ~

tOO0,O

8 0 0 . 0

6 0 0 . 0

4 0 0 . 0 I

2 0 0 , 0

0 , 0 ,, , ,, , ,, , : . , • ,, i I i I i I i 6 0 6 0 , 0 2 0 4 0 , 0 2 0 6 0 , 0 2 0 8 0 . 0 E:LO0.O 21EO.O GiWO,O ~ lEO,O 0 1 0 0 . 0

WAVELENGTH

FIG. 5. Effect of noise level. Ten synthetic spectra in each set; Noise levels of 0.001, 0.01, and 0.1 absorbance RMS were added. There were 120,000 wavelength searches used to generate each histogram. Par t A: the full histograms for all three cases. Part B: an expansion of a section of Par t A.

ated to this point, all the error was in the synthetic optical data. If there had been no error in the synthetic optical data, then a perfect fit to the synthetic constituent would have been obtained.

The histograms shown in Figs. 8 and 9 present com- parisons between the distribution of wavelength selec- tions when no error is present in the dependent variable, and when a Normally distributed error with a standard deviation of 0.1 is added to the synthetic dependent vari- able. Figures 8 and 9 represent the effect of the error in Y when 10 and 50 "samples," respectively, are present in the calibration set. In both cases, there is no difference between the two histograms obtained, beyond random variations.

The number of searches performed for each case was sufficient to allow us to obtain a very good estimate of the fraction of times a wavelength would be chosen, when the signal level is above or below any given fraction of the maximum signal in the set of spectra. Thus these data are suitable for compiling tables of critical values for this statistic, and such a set of tables is presented as Table I. The several sections of Table I represent the results from the different experiments performed [num- ber of spectra, different noise level, different noise dis- tribution, error in the dependent (Y) variable]. This table also contains an entry not often found in statistical ta- bles, but included because it was deemed to be of interest to the spectroscopic community. This entry is the one on the right-hand side of the table: the signal level (rel- ative to the maximum signal) below which no wave- lengths were selected at any time. This value, indicated by the heading of alpha = 0, is the cutoff value for the signal level.

Figure 10 presents the results of performing multiple wavelength searches on data obtained by adding random Gaussian noise to real data obtained as described in the experimental section. The calibrations were for the pro- tein content of wheat. Three wavelengths were used in each calibration; for each part of Fig. 10, two of the three wavelengths were kept constant, while the third was swept through the available wavelengths. Naturally, no selec- tion was permitted at the wavelengths that were forced

into the calibration; hence, each histogram in Fig. 10 has two places where the selection probability drops to zero.

Interestingly, the probability peaks do not necessarily correspond to any one particular constituent. The peak at approximately 1440 nm in all three parts of Fig. 10, as well as the one at 1900 nm, appears to correspond to the absorbance bands of water. On the other hand, part B of Fig. 10 also shows a peak to 2100 nm; this is usually attributed to starch. Also, all three parts of Fig. 10 show a peak at about 1580 nm; the assignment of absorbance bands in this region is very uncertain.

CONCLUSIONS It is clear that the presence of noise in the optical data

has a marked effect on the variability of the wavelengths selected by an automatic computerized wavelength search routine. Since the change in the selected wavelength is caused by purely random differences in the way the cal- ibration line fits the data, it should not matter what type of search algorithm is used. Any algorithm that depends upon finding the "best" wavelength on the basis of com- parisons of calibrations using data at the different wave- lengths available will be subject to the effect of this ran- dom phenomenon. Furthermore, since this wavelength instability is caused by random noise at the most fun- damental level, then, regardless of any transformations to which the data may be subjected, the instability should be unaffected by the other common sources of variation of the data, such as particle size effects, repack effects, 7 or even sample inhomogeneity (which is a contribution to error in the dependent variable, which, as we found, has no effect).

In this sense the use of band assignments to relate the wavelength selection process to the underlying chemical factors is a one-way street. For the analyst to select wave- lengths on the basis of knowledge of the existing bands of the compounds in the materials under study is a proper method of proceeding. However, to try to go the other way, and assign meaning to a computer-selected set of wavelengths, is a process fraught with danger, insofar as we have seen that the selection process is overwhelmingly contaminated with a random process.

1434 Volume 42, Number 8, 1988

Page 9: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

A J 2 0 0 0 0 .

tO000.

tBO00,

t 4 0 0 0 .

t 2000 .

tO000.

8000.

6000 . ~

4000 .

2 0 0 0 ,

O.

50 SAMPLES

30 SAMPLES

20 SAMPLES--

10 SAMPLES~

12¢)0.0 1400 .0 1 6 0 0 . 0 1800 .0 2 0 0 0 . 0 2 2 0 0 . 0 2 4 0 0 . 0 NAVELENGTH

B i i 2 0 0 . 0 0 '

t8O.O0'

i60.00.

140 ,00 ,

t20.O0"

lO0.O0"

8 0 . 0 0 "

6 0 . 0 0 "

4 0 . 0 0

2 0 , 0 0

0.00 i

50 SAMI 30 SAMI 20 SAMI 10 SAMI

t 200 .O i 4 0 0 . 0 1 6 0 0 . 0 1800 .0 2 0 0 0 . 0 2 2 0 0 , 0 2 4 0 0 . 0 MAVF_L.ENGTH

FIG. 6. The effect of the number of spectra in the data set. For each run, 120,000 wavelength searches were used. The noise level was set to 0.01 absorbance RMS; sets of spectra containing 10, 20, 30, and 50 synthetic spectra were used. Part A: the full histograms. Part B: a scale-expansion of Part A to show how far down the signal can fall and wavelengths still be selected.

APPLIED SPECTROSCOPY 1435

Page 10: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

i 0 SAMPLES

1 0 0 0 0 . -

9 0 0 0 . -

8 0 0 0 . -

7000.-

6000.

5 0 0 0 .

4 0 0 0 .

3 0 0 0 . ~

2 0 0 0 .

tO00.

o. . ooo ~.oo.o .oo.o ~L600.0

U NIFORM~

,

~ooo.o ;,o~.o ~,oo.o '~,oo.o '~,o~.o ~,oo.o NAVELEN6TH

50 SAMPLES

3 0 0 0 0 .

2 5 0 0 0 .

20000.

i5000.

lO000.

5 0 0 0 . •

UNIFORM

NORMAL

O, t 8 0 0 . 0 t 7 0 0 , 0 | B O 0 . O t g O 0 . O 2 0 0 0 . 0 2 i 0 0 . 0 2 2 0 0 . 0 2 3 0 0 . 0 2 4 0 0 . 0 2 5 0 0 . 0

NAYELEN6TH

B

FIG. 7. A comparison of the effect of the addition of Normally distributed noise to that of uniformly distributed noise. In all histograms in this figure the population noise level was 0.01 absorbance. Part A: 10 samples in the set. Part B: 50 samples in the set.

1436 Volume 42, Number 8, 1988

Page 11: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

7000.0.

B000.0"

5000.0 '

4000.0"

3000.0"

2000.0.

fO00 .O-

0.0 2000.0 2200.0 2400.0

A

[ s ! r f , 1 2 0 0 • 0 f 4 0 0 . 0 1 6 0 0 . 0 1 8 0 0 . 0

N A V E L D I G T H

8000.0"

B 45.000.

4 0 . 0 0 0

35.000'

3 0 . 0 0 0 '

25.000'

20.000,

iS.000.

tO.O00"

, 5 0 0 . 0 ,s2o .o I s4o .o f 5 6 0 . 0 fSBO.O f 6 0 0 . 0 f 5 2 0 . 0 t 6 4 0 . 0 IBBO.O fBSO.O NAVELENGTH

FIG. 8. Comparison between histograms generated when there is no error in the "consti tuent" and when there is error in the "consti tuent" as well as in the "optical" data. The calibration set consisted of ten "samples." A: the full histogram. B: expansion of one section.

APPLIED SPECTROSCOPY 1437

Page 12: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

A

i 8 0 0 0 .

t4000.

t2000. '

tO000.

8 0 0 0 .

8 0 0 0 .

4 0 0 0 .

2 0 0 0 .

O. t 2 0 0 . 0 t 4 0 0 . 0 t 6 0 0 . 0 | 8 0 0 . 0 2 0 0 0 . 0 2 2 0 0 . 0 2 4 0 0 . 0

NAVELENGTH

B

1400.0;

t 2 0 0 . 0

lO00 .0

8O0. O'

8 0 0 . O"

400 . O"

2 2 0 0 . 0 2 2 t 0 . 0 2 2 2 0 . 0 2 2 3 0 . 0 2 2 4 0 . 0 2 2 5 0 . 0 2 2 8 0 . 0 2 2 7 0 . 0 2 2 8 0 , 0 2 2 9 0 . 0 NAV~'I ENETH

FIG. 9. Comparison between histograms generated when there is no error in the "constituent" and when there is error in the "constituent" as well as in the "optical" data. The calibration set consisted of fifty "samples." A: the full histogram. B: expansion of one section.

1438 VolUme 42, Number 8, 1988

Page 13: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

A

i l l l O O . O -

I lOOO.O'

~mOO O

• 0 0 0 0

I O 0 . O '

O O & . O

N A V I i . I ~ I I r T I . I

B

1 1 8 0 0 . O -

8 0 0 0 . O,

41,1100 0

4 L O 0 0 O

B O O . O i

0 . 0 , 1 1 0 0 . 0 ~ L d I O 0 . O • •

IIAVIEI.IINIITH

C

8 0 00 0

1 8 0 0 0

I l l 0 0 O

t 4 0 0 • O *

t 8 0 0 0

~ 0 0 0 . 0 "

8 0 0 0

!iii 0 • 0

J rely I t I t I * "

41,100. O t ,4OO. O 41,800. O :ILIIOO. O IIOOl~. O IIIAVI~I..ENrrl.I

.I _ • noo. o ~,,,oo. o

FIG. 10. Histograms obtained upon calibrating for protein concentration in Hard Red Spring Wheat. In each part of the figure two wavelengths were kept constant by being forced into the calibration, and the histograms are the result of performing automatic searches for the third wavelength. The program was not permitted to select a wavelength less than 8 nm from the ones tha t were forced, giving the breaks seen in the curves.. A: 1680 nm and 2180 nm were fixed. B: 1680 nm and 2208 nm were fixed. C: 2100 nm and 2208 nm were fixed.

APPLIED SPECTROSCOPY 1439

Page 14: A Monte Carlo Study of the Effect of Noise on Wavelength Selection During Computerized Wavelength Searches

TABLE I. Critical values for the signal level as a fraction of the max- imum signal at which the probability of selecting a wavelength falls below the specified alpha value. For example, with 20 spectra in the data set, there is only a five percent probability of selecting a wavelength where the signal is less than 0.78 of the maximum signal in the spectra. The table also lists the signal level (relative to the maximum signal) below which no wavelengths were selected (effectively, alpha = 0).

A. For different numbers of spectra (noise = 0.01 abs RMS Normally distributed).

No. of Alpha spectra 0.5 0.25 0.10 0.05 0.01 0

10 0.91 0.83 0.73 0.66 0.53 0.23 20 0.94 0.89 0.82 0.78 0.69 0.38 30 0.95 0.91 0.85 0.82 0.74 0.55 50 0.96 0.93 0.88 0.86 0.80 0.62

B. For different amounts of noise (Normally distributed, ten samples in the set).

Alpha

Noise level 0.5 0.25 0.10 0.05 0.01 0

0.1 abs 0.90 0.81 0.70 0.62 0.48 0.09 0.01 abs 0.91 0.83 0.73 0.66 0.53 0.23 0.001 abs 0.91 0.83 0.73 0.66 0.53 0.20

C. Noise uniformly distributed (noise level = 0.01 absorbance). These values may be compared with Table I.A. for the corresponding values when the noise is Normally distributed.

No. of Alpha spectra 0.5 0.25 0.10 0.05 0.01 0

10 0.92 0.84 0.75 0.68 0.55 0.27 50 0.98 0.95 0.92 0.90 0.85 0.82

We have seen that the presence or absence of noise in the dependent variable has no effect on the distribution of the selection process. Furthermore, the amount of noise in the independent variable (the optical data) has no effect as long as the noise level is sufficiently small for it not be bias the coefficients of the calibration. The extremely low noise levels present in the current NIR instrumentation would seem to satisfy that requirement adequately.

The only factors that appear to have an effect are the distribution of the noise and the number of samples pres- ent in the calibration data set. However, from exami- nation of the histograms presented in the various figures it is clear that even the best-case conditions hardly limit the wavelengths that conceivably could be chosen in any given wavelength search. Thus, any search technique that relies on finding a "best" set of wavelengths will be sub- ject to this source of variability.

There might be some concern that an automatic wave- length search could select a wavelength set that was ac- tually "bad," in the sense that a calibration equation using those wavelengths would be a poor predictor. The results obtained here should dispell such qualms. Table I shows that, in the absence of pathological calibration conditions, no wavelength is chosen unless the signal level is above some defined minimum value (the actual level varies, of course, with the calibration conditions). Therefore, any selected wavelength will perform satis- factorily, regardless of the possibility of better ones ex- isting.

By shedding light on the nature of the wavelengths selected by automatic computerized searches, we feel we have solved the scientific problem of explaining why dif- ferent wavelengths are selected under apparently similar circumstances. This still leaves us with the "engineering" problem of deciding how to select wavelengths. Nothing in this paper is intended to imply that computer-aided wavelength searches for calibration purposes are to be considered a nonutilitarian endeavor. The long history of success using such empirical procedures that NIR spectroscopy enjoys is testimony to the practical utility of computer-aided wavelength searches. Furthermore, there is theoretical justification for using this procedure in the need to accommodate particle size effects, which have no distinct absorbance bands--as was discussed in the introductory section of this paper. Thus, comput- erized wavelength searches play an important role in the practical aspects of generating usable and accurate cal- ibrations over the wide variety of materials that NIR spectroscopy is suitable for. However, we must issue a strong caveat concerning the dangers of assigning mean- ing to analytical wavelengths selected this way. Alter- native approaches are also available. Fourier transforms, s principal components, 5,s and partial least-squares 1° are all methods of utilizing the entire available spectrum for calibration purposes, thereby sidestepping the wave- length-selection problem entirely.

The science of spectroscopy is, perhaps, undergoing a fundamental change. "Classical" spectroscopic theory (Beer's law, etc.) has been based upon deterministic con- siderations of the effects of various parameters. With the advent of widespread computerization of instru- ments, and the ready availability of computer programs for implementing complicated and sophisticated multi- variate methods of data handling, we are entering into an era where results are explainable only by probabilistic considerations. The study of such phenomena has long been the province of mathematicians and statisticians, who have laid extensive groundwork for understanding of this type of behavior, in the twin disciplines of prob- ability theory and statistics. Engineers have found the study of these disciplines very fruitful in explaining the effects of noise on data transmission and weak-signal detection. It is time now for spectroscopists, and more generally chemists, to start to make more widespread use of these fields of knowledge.

I. K. H, Norris, Trans. Amer. Soc. Agr. Engr. 7, 240 (1964). 2. W. Hrushka, "Data Analysis: Wavelength Selection Techniques,"

in Near Infrared Technology in the Agricultural and Food In- dustries, P. Williams and K. Norris, Eds. (American Association of Cereal Chemists, St. Paul, Minnesota, 1987).

3. H. Mark and D. L. Tunnell, Anal. Chem. 57, 1449 (1985). 4. H. Mark and J. Workman, Spectroscopy 2(3), 47 (1987). 5. H. Mark, Anal. Chem. 58, 2814 (1986). 6. H. Mark and J. Workman, Spectroscopy 3(1), 44 (1988). 7. H. Mark and J. Workman, Anal. Chem. 58, 1454 (1986). 8. F. G. Giesbrecht, W. F. McClure, and A. Hamid, Appl. Spectrosc.

35, 210 (1981). 9. H. Mark, ChimicaOGGI, p. 57, Sept. (1987).

10. T. Naes and H. Martens, Commun. Statist.-Simula. Computa. 14, 545 (1985).

1440 Volume 42, Number 8, 1988


Recommended