+ All Categories
Home > Documents > A Comparison of Positive Matrix Factorization and the Weighted Multivariate Curve Resolution Method....

A Comparison of Positive Matrix Factorization and the Weighted Multivariate Curve Resolution Method....

Date post: 14-Dec-2016
Category:
Upload: beata
View: 221 times
Download: 0 times
Share this document with a friend
9
Published: October 26, 2011 r2011 American Chemical Society 10102 dx.doi.org/10.1021/es201024m | Environ. Sci. Technol. 2011, 45, 1010210110 ARTICLE pubs.acs.org/est A Comparison of Positive Matrix Factorization and the Weighted Multivariate Curve Resolution Method. Application to Environmental Data Ivana Stanimirova, Rom a Tauler, and Beata Walczak , * Department of Analytical Chemistry, Institute of Chemistry, The University of Silesia, 9 Szkolna Street,40-006 Katowice, Poland Institute of Environmental Assessment and Water Studies, IDAEA-CSIC, C/Jordi Girona 18-26,08034 Barcelona, Spain ABSTRACT: In recent years, positive matrix factorization, PMF, has gained popularity in environmental sciences and it has been recommended by the U.S. Environmental Protection Agency as a general modeling tool in air quality control. Among the attractive features contributing to its popularity is that measurement uncertainty information can be incorporated into the PMF model, which allows the handling of missing measurements and data below the reporting limits. In addition, the solutions obtained from PMF obey constraints such as the non-negativity of the source compositions and source contributions of samples that make their interpretation physically meaningful. A less popular multivariate curve resolution method based on a weighted alternating least-squares algorithm, MCR-WALS, also incorporates the measurement error information and non-negativity constraints, which makes this method a potential tool when obtaining composition and contribution proles of environmental data. Both methods use the same loss function, but they dier in the way the proles are obtained. The goal of this study was to compare the performance of PMF with the performance of MCR-WALS for data sets simulated with dierent correlation and error structures. The results showed that the proles extracted by both methods are virtually the same for data with dierent error structures. INTRODUCTION Over the past several years there has been an increased interest in developing and applying chemometric approaches that allow for the incorporation of the measurement uncer- tainty information in the chemometric analysis of the collected chemical data. This has been driven by the possibility of im- proving the extraction of chemical information including a pri- ori knowledge about sampling error, instrumentation noise or other possible sources of variation. Maximum likelihood prin- cipal component analysis, MLPCA, multivariate curve resolu- tion-weighted alternating least-squares, MCR-WALS and posi- tive matrix factorization, PMF, are some of the methods that have been proposed in the literature 14 as counterparts of the classic principal components analysis, PCA, 5,6 multivariate curve resolution-alternating least-squares, MCR-ALS 7,8 and traditional factor analysis. 9,10 The general goal of all of these methods is to solve a bilinear problem which can be expressed mathematically in the following way: X ¼ GF T þ E ð1Þ In other words, with these methods the original high-dimen- sional experimental data organized as a matrix X (m n) with m rows (samples or objects) and n columns (chemical species or variables) is decomposed into two lower-rank matrices G (m p) and F T (p n) the p components of which, in general, are related to the chemical information in the data, whereas matrix E (m n) contains information about the experimental error, that is, the part of variation not explained by the model of denite complex- ity p. Dierent constraints can be applied to the G and F T Received: March 27, 2011 Accepted: October 26, 2011 Revised: October 23, 2011
Transcript
Page 1: A Comparison of Positive Matrix Factorization and the Weighted Multivariate Curve Resolution Method. Application to Environmental Data

Published: October 26, 2011

r 2011 American Chemical Society 10102 dx.doi.org/10.1021/es201024m | Environ. Sci. Technol. 2011, 45, 10102–10110

ARTICLE

pubs.acs.org/est

A Comparison of Positive Matrix Factorization and the WeightedMultivariate Curve Resolution Method. Application to EnvironmentalDataIvana Stanimirova,† Rom�a Tauler,‡ and Beata Walczak†,*†Department of Analytical Chemistry, Institute of Chemistry, The University of Silesia, 9 Szkolna Street,40-006 Katowice, Poland‡Institute of Environmental Assessment and Water Studies, IDAEA-CSIC, C/Jordi Girona 18-26,08034 Barcelona, Spain

ABSTRACT:

In recent years, positive matrix factorization, PMF, has gained popularity in environmental sciences and it has been recommended bythe U.S. Environmental Protection Agency as a general modeling tool in air quality control. Among the attractive featurescontributing to its popularity is that measurement uncertainty information can be incorporated into the PMF model, which allowsthe handling of missing measurements and data below the reporting limits. In addition, the solutions obtained from PMF obeyconstraints such as the non-negativity of the source compositions and source contributions of samples that make their interpretationphysically meaningful. A less popular multivariate curve resolution method based on a weighted alternating least-squares algorithm,MCR-WALS, also incorporates the measurement error information and non-negativity constraints, which makes this method apotential tool when obtaining composition and contribution profiles of environmental data. Both methods use the same lossfunction, but they differ in the way the profiles are obtained. The goal of this study was to compare the performance of PMF with theperformance of MCR-WALS for data sets simulated with different correlation and error structures. The results showed that theprofiles extracted by both methods are virtually the same for data with different error structures.

’ INTRODUCTION

Over the past several years there has been an increasedinterest in developing and applying chemometric approachesthat allow for the incorporation of the measurement uncer-tainty information in the chemometric analysis of the collectedchemical data. This has been driven by the possibility of im-proving the extraction of chemical information including a pri-ori knowledge about sampling error, instrumentation noise orother possible sources of variation. Maximum likelihood prin-cipal component analysis, MLPCA, multivariate curve resolu-tion-weighted alternating least-squares, MCR-WALS and posi-tive matrix factorization, PMF, are some of the methods thathave been proposed in the literature1�4 as counterparts ofthe classic principal components analysis, PCA,5,6 multivariatecurve resolution-alternating least-squares, MCR-ALS7,8 andtraditional factor analysis.9,10 The general goal of all of thesemethods is to solve a bilinear problem which can be expressed

mathematically in the following way:

X ¼ GFT þ E ð1ÞIn other words, with these methods the original high-dimen-

sional experimental data organized as a matrix X (m � n) withm rows (samples or objects) and n columns (chemical species orvariables) is decomposed into two lower-rankmatricesG (m� p)and FT (p� n) the p components of which, in general, are relatedto the chemical information in the data, whereasmatrixE (m� n)contains information about the experimental error, that is, thepart of variation not explained by the model of definite complex-ity p. Different constraints can be applied to the G and FT

Received: March 27, 2011Accepted: October 26, 2011Revised: October 23, 2011

Page 2: A Comparison of Positive Matrix Factorization and the Weighted Multivariate Curve Resolution Method. Application to Environmental Data

10103 dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology ARTICLE

matrices under various assumptions for the error matrix, E, whichimplies the use of different algorithms in order to perform thedecomposition described by eq 1. In PCA and classic factoranalysis, the columns of the scores matrix G, often called prin-cipal components or latent factors, are obtained by maximizingthe variance of the projected data and are constrained to beorthogonal to each other, whereas the rows of the loadingsmatrixFT are orthonormal, that is, orthogonal and normalized to unitlength. Obtaining a unique solution of the bilinear problem is anattractive feature in chemical studies; however, neither methodprovides such a solution due to the rotational freedom of factors.Furthermore, including the orthogonality constraints into thescore and loading matrices often results in identified sources thatare difficult to interpret or are physically unrealistic. In order toenrich the interpretability of the obtained factors, either ortho-gonal (varimax, quartimax, equimax) or oblique rotations9 can beapplied. Another more general solution to the bilinear modelingproblem can be obtained using the multivariate curve resolutionapproach,MCR. In contrast to PCA, the factors foundwithMCRare not orthogonal and in general an infinite number of solutionsfor the contribution and composition matrices, G and FT, exist.However, the imposition of various constraints like non-nega-tivity, unimodality, closure or others to theG and/or FTmatricesvia the alternating least-squares, ALS, algorithm gives the possi-bility of obtaining a unique or nearly unique solution with aphysically meaningful interpretation. The MCR-ALS approachwith non-negativity constraints has found applications in kineticstudies,11 modeling of environmental data,12,13 analysis of micro-array data,2 resolving signals of coeluting mixture components inchromatography, etc.14,15

A general problem with the classic methods is that the decom-position defined by eq 1 is only optimal when the errors for allmeasurements in X are independent and identically distributed(i.i.d.) normal.16 Even though the measurements obtained fromspectroscopic and chromatographic methods are, in the majorityof cases, precise with low and relatively uniform uncertainties,this assumption is often not fulfilled when analyzing complexnatural samples such as environmental samples. With these typesof data the measurement uncertainty varies systematically withthe magnitude of the signal. Therefore, chemical componentspresent in low concentrations/contents may be neglected duringchemometric analysis with the classic methods, although theyhave the same signal-to-noise ratio as those chemical compo-nents present in high concentrations/contents. A straightforwardway of dealing with this problem is to apply approaches that offerthe possibility of including information about the magnitude ofthe measurement errors during the modeling process. In recentyears, positivematrix factorization, PMF, has gained popularity inthe field of environmentalsciences4,17�22 and has been recom-mended by the U.S. Environmental Protection Agency as amodeling tool in air quality control. An EPA-supported versionof the PMF model and user’s guide are available from the U.S.EPA Web site (http://www.epa.gov/heasd/products/pmf/pmf.html). On the other hand, MCR with weighted alternating least-squares (MCR-WALS) has been successfully used to analyzetime course DNA-microarrays data2 and in aerosol source ap-portioning studies.13 The objective function in PMF and MCR-WALS is defined in the same way, but the two methods differalgorithmically. Therefore, in this work we investigated howthe algorithmic differences may influence the contribution andcomposition profile matrices obtained from both methods. Acomparison of the results obtained by MCR-ALS, MCR-WALS,

and PMF for raw and scaled environmental data collected duringa monitoring campaign of aerosol pollution was recently pre-sented in Tauler et al.13 In this work, we offer a comprehensivecomparative study of the performances of PMF andMCR-WALSusing data sets simulated with different correlation and errorstructures and a real environmental data set.

In the following section, the algorithmic aspects of MCR-WALS and PMF are described in detail starting with a compar-ison of customary MCR-ALS and its weighted version. Then,attention is focused on the simulation study in which the perfor-mances of the methods are compared for data sets with differenterror structures. The data sets used in this study are described inthe Experimental Section. The results of the study are presentedand discussed in the Results and Discussion Section.

’THEORY

As was already mentioned in the Introduction, MCR-ALS,MCR-WALS, and PMF aim to solve the general bilinear modeldefined by eq 1. With MCR-ALS the contribution and composi-tion profile matrices, G and FT of a definite number of factors, p,are found by minimizing the sum of squared residuals, SSR, viathe alternating least-squares algorithm.

SSR ¼ ∑m

i¼ 1∑n

j¼ 1ðxij � ~xijÞ2 ð2Þ

In this equation, xij is the measurement of the j-th (j = 1, ..., n)chemical component for the i-th (i = 1, ..., m) sample and ~xij isthe measurement predicted by the model of a given complexity.The ALS algorithm is designed to work when both matrices Gand FT are not known in advance. This is possible by simplifyingthe problem of finding the solution of the bilinear problem ineq 1 by performing two simpler least-squares calculations in analternating way, which means estimating one matrix given theestimation of the other under suitable constraints. Thus, afterthe selection of the model’s complexity, p, the next step of thealgorithm is to initialize either theG or FTmatrix. Let us assumethat the FTmatrix is to be initialized. This can be done in severalpossible ways. The elements of FT can be chosen completely atrandom, which may slow down the convergence of the algo-rithm and thus finding an optimal solution is not guaranteed.Another waymay be to randomly select p rows from the originaldata X or to use, for example, the needle searching method,23

the orthogonal projection approach, OPA,24 or the simple-to-use interactive self-modeling analysis (SIMPLISMA).25 Nomatter which of the initialization approaches is chosen, it ishighly recommended that the algorithm be run several times inorder to ensure the reliability of the contribution and composi-tion profile matrices obtained. Once the FT matrix is initialized,the elements of the G matrix are found using the followingprojection:

G ¼ XFðFTFÞ�1 ð3Þ

An easy way to incorporate the non-negativity constraint is toset the negative elements of G to zero. In our work, we used thefast non-negativity least-squares algorithm, FNNLS, proposed byBro and de Jong,26,27 which minimizes the sum of squaredresiduals in X so that all elements of G are equal to or larger thanzero. The estimation obtained forG is then used to recalculate the

Page 3: A Comparison of Positive Matrix Factorization and the Weighted Multivariate Curve Resolution Method. Application to Environmental Data

10104 dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology ARTICLE

profiles matrix FT under the non-negativity constraint:

FT ¼ ðGTGÞ�1GTX ð4ÞAt each iteration, the two steps described by eqs 3 and 4 are

performed sequentially and the sum of squared residuals in X isestimated using the interim model. The iterative procedure con-tinues as long as the convergence criterion is not fulfilled, that is,the difference in the SSR values obtained in two consecutiveiterations is not smaller than a predefined number (e.g., 10�8).Different constraints mentioned earlier can be incorporated withinthe ALS algorithm, but only the non-negativity of G and FT wasconsidered in this work. Recently, further work has been donein understanding and calculating the rotation ambiguity of thesolution obtained from MCR-ALS.28�30

Unlike MCR-ALS, the core of the MCR-WALS algorithm isthe inclusion of information about error variances in order toaccount for the level and structure of the noise. MCR-WALS2

minimizes the weighted sum of residuals, WSSR, expressed by

WSSR ¼ ∑m

i¼ 1∑n

j¼ 1

ðxij � ~xijÞ2σ2ij

ð5Þ

where σij is the calculated standard deviation of the measurement.This objective function holds for uncorrelated errors and therespective maximum likelihood projections implemented in thesteps of the ALS algorithm can be performed in the following way.Suppose that matrix Σ of dimension m � n contains the errorvariances for the measurements. After the initialization of FT, therows of the original data X, xi., are projected on FT as follows:

x�i: ¼ xi:S�1i FðFTS�1

i FÞ�1FT ð6ÞHere, Si is a diagonal matrix, the diagonal elements of which arethose of the i-th row of Σ. This maximum likelihood projectionmakes it likely that the row measurements with larger uncertain-ties are downweighted more and therefore, have less influence inthe final model. The next step of the algorithm is to find theelements of the G matrix according to eq 3 using X* under non-negativity or other constraints. Then in order to re-estimate FT,the maximum likelihood projection of the columns of X, x.j, isexecuted in the space of G:

x�:j ¼ GðGTS�1j GÞ�1GTS�1

j x:j ð7ÞThe matrix Sj is the respective error covariance matrix the diagonalelements ofwhich are those of the j-th columnofΣ. Finally, the least-squares solution for FT is found under the desired constraints usingX* in eq 4. All of these steps are repeated iteratively as long as theconvergence criterion is not fulfilled. Similar to the ALS algorithmpresented earlier, the difference in the values of the loss function intwo consecutive iterations can be used as a stopping criterion.

Positive matrix factorization, PMF, is an iterative algorithm31

which, in general, minimizes the weighted sum of squares definedby eq 5. When non-negativity constraints are imposed on the Gand FT matrices, this objective function is enhanced by regular-ization and logarithmic penalty terms, and is expressed in thefollowing way:

Q ¼ ∑m

i¼ 1∑n

j¼ 1

e2ijσ2ij� α ∑

m

i¼ 1∑p

k¼ 1loggik � β ∑

p

k¼ 1∑n

j¼ 1log f kj

þ γ ∑m

i¼ 1∑p

k¼ 1cig

2ik þ δ ∑

p

k¼ 1∑n

j¼ 1djf

2kj ð8Þ

In this expression, α and β are the penalty coefficients whichprevent the elements of G and FT from becoming negative,whereas the regularization coefficients, γ and δ, reduce therotational freedom in the obtained factors. The coefficients ciand dj are added in order to remove the scaling differences in therows inG and in the columns of FT. To solve the nonlinear least-squares problem expressed by eq 8, the classic Gauss�Newtonmethod is used. The algorithm can only be applied to minimizethe sum of squared function values and therefore, the logarithmicpenalty functions need to be represented in quadratic forms. Thecore of the iterative algorithm is the update of the elements of theG and FT matrices with two respective incremental matrices, the(m + n)� p elements of which are obtained by solving a systemof m � n equations taking into account the logarithmic andpenalty terms. The iterative procedure continues as long as thedifference in values of the sum of weighted squared residualsobtained in two consecutive iterations is not smaller than a pre-defined value, for example, 10�8. To start the iterative procedure,the elements of G, F, α, β, γ, and δ need to be initialized, whilethe scaling coefficients ci and dj are estimated during the iterativeprocedure. The speed of reaching the minimum value of theobjective function and the quality of the solution obtaineddepend on the initialization and effective optimization of themany parameters of the PMF model. Specifically, in order toperform the PMF calculations in this work we used the algorithmdescribed by Lu andWu31 and the programming guide presentedtherein to optimize the different steps within the iterativeprocedure. As the authors report, this algorithm may differ inthe way the optimization of the PMF parameters is performedfrom the one proposed by Paatero and Tapper3,32 and the oneimplemented in the software that can be freely downloaded fromthe webpage (http://www.epa.gov/heasd/products/pmf/pmf.html) of the U.S. Environmental Protection Agency. Detailsrelated to the initialization and optimization of the PMF param-eters which were used in this study will be described in theResults and Discussion section. For comparative reasons, resultsobtained using the EPA PMF program for solving the PMF pro-blem, are also presented. The so-calledmultilinear engine (ME-2)program developed by Paatero33 is the core of the available U.S.EPA software.

’EXPERIMENTAL SECTION

All calculations, except EPA PMF, using in-house implemen-ted routines were performed with MATLAB 7.0 (R14) on apersonal computer (Intel(R) Pentium(R) M, 1.60 GHz with2GB RAM) using the Microsoft Windows XP (service pack 2)operating system. A general MATLAB code for the MCR-ALScan also be obtained.34

The available EPA PMF software (v3.0) was used to obtainresults from the multilinear engine, ME-2, program.Simulation of the Data Sets. The capability of the methods

described earlier was investigated using several simulated datasets. Four data sets, each of rank four (i.e., with four factors) con-taining 30 samples and 50 variables were simulated. Each error-freematrix,Y, was generated bymultiplying a 30� 4matrix,G, ofelements from a log-normal distribution (mean = 0.01, standarddeviation = 1) by a 4 � 50 matrix, FT, also drawn from a log-normal distribution with the same parameters. The pairwise cor-relation between the underlying composition profiles in FT wasset to 0.0, 0.3, 0.6, or 0.9. Simulation of the composition profiles

Page 4: A Comparison of Positive Matrix Factorization and the Weighted Multivariate Curve Resolution Method. Application to Environmental Data

10105 dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology ARTICLE

with a joint distribution of a definite correlation structure wasperformed using a copula function.35

Errors at different levels and with various structures wereadded to each of the four error-free data sets with differentcorrelation structures. Here, we studied only cases of uncorre-lated measurement errors.16 First, measurement errors thatwere i.i.d. normal with standard deviations of 1%, 5%, or 10% ofthe maximum value of the corresponding error-free data matrixwere considered. For this purpose, each 30 � 50 matrix ofstandard deviations (1%, 5%, or 10%) was multiplied element-by-element by a matrix, the elements of which were generatedfrom normal distribution, N(0,1), (mean = 0, standard devia-tion = 1). Each error matrix was then added to each of theerror-free data sets to obtain 12 data set configurations withdifferent correlation structures and different levels of noise.Three data sets were simulated from each type resulting in atotal of 36 sets.Next, the most common case where the measurements had

uncorrelated proportional errors was considered. Three ma-trices of standard deviations with fixed proportions of 5%,10%, and 30% of the measurement values were simulated.Again each of the three standard deviation matrices wasmultiplied by a matrix with normally distributed elementsand the error matrices were added to each of the error-freedata to obtain 12 data sets with proportional error structures.For consistency of the study, three data sets were also gene-rated from each type.Finally, data sets with constant and proportional errors were

also simulated. The elements of each matrix of standard devia-tions, σij, in this case, were obtained as the square root of the sumof the squares of the constant part, a, and the proportional part,

byij, using the following measurement error model:

σij ¼ffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffia2 þ b2y2ij

qð9Þ

The constant part was taken to be 1% of the maximum value ofthe error-free data, Y, while the proportional part was calculatedas 5%, 10%, or 30% of the elements, yij, in the error-free datamatrix. As was done previously, the final error matrix in each casewas found by multiplying element-by-element the matrix ofnormally distributed numbers by the matrix of standard devia-tions and was added to the error-free data sets giving 12 data setconfigurations. Again three data sets were considered from eachconfiguration.A total of 108 data sets were simulated in order to compare the

capability of the methods.Description of a Real Data Set. The real environmental data

set used here for an illustrative purpose is a part of a larger data setwhich was analyzed previously using other chemometric ap-proaches.36,37 The data set contains concentrations of 13 chemi-cal components (K+, Ca2+, Mg2+, NO3

�, SO42‑ C, Cd, Cu, Fe,

Mn, Pb, V, Zn) and the particulate matter mass measured inspring, summer, autumn, and winter, in five different particle sizefractions (PM0.04�PM0.1, PM0.1�PM0.4, PM0.4�PM1.6, PM1.6�PM6.4, PM6.4�PM25) at Arnoldstein, which is located in theAustrian province of Carynthia. The water-soluble ions K+, Ca2+,Mg2+, NO3

�, SO42‑ were determined using two ion-chromato-

graphic systems, whereas the concentrations of heavy metalswere analyzedusing atomic absorption spectrometry.38The analyzeddata set is of the dimensions 20� 14 and the measurements arepresented as values obtained by averaging the respective param-eters analyzed in quadruplicate samples of the five size fractions

Figure 1. Root mean square, RMS, errors averaged for the contribution, G, profiles obtained from (a) MCR-ALS, (b) PMF, (c) EPA PMF, (d) MCR-WALS and for the composition profiles, FT, obtained from (e)MCR-ALS, (f) PMF, (g) EPA PMF, and (h)MCR-WALS for data sets with independentand identically distributed (i.i.d.) normal measurement errors. The radius and the intensity of color (in grayscale) of each circle marker are proportionalto the magnitude of the estimated RMS error.

Page 5: A Comparison of Positive Matrix Factorization and the Weighted Multivariate Curve Resolution Method. Application to Environmental Data

10106 dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology ARTICLE

collected in four seasonal sampling campaigns. Uncertainties ofthe data set were estimated as the standard deviations of thosemean values and were considered in the analyses with PMF, EPAPMF, andMCR-WALS. In fact, these uncertainties are associatedwith the different sources of variability related to sampling proce-dures, preparation, and measurement. In general, a proportionalerror model is considered.

’RESULTS AND DISCUSSION

First, the results of the simulation study will be discussed andthen a comparison of the performance of PMF and MCR-WALSwill be presented for the real data set described earlier.Results from the Simulation Study. All data sets were

simulated with complexity four. In order to limit the problemof the rotational ambiguity of the solutions, non-negativity con-straints were imposed to G and FT and the known compositionprofiles FT were used as an initialization matrix in each of theapproaches. The PMF algorithm by Lu and Wu is not verysensitive to the selected initial values because the optimization ofthe regularization and penalty coefficients is performed in adynamic way. The solution with the lowest sum of weightedsquared residuals was used for the final comparison. Since thetrue contribution and composition profiles are known before-hand, one way to compare the performance of the methods is toestimate how well those profiles are recovered by each of themethods run under the same constraints. Therefore, as a figure ofmerit in this comparison, the root mean square, RMS, differencebetween each profile obtained from each of the methods andthe respective true profile is calculated and then averaged overthe four G or FT factors. To be precise, one RMS error value

characterizes the resolution of four G or FT profiles for eachmethod. All profiles including the true ones are normalized tounit length in order to obtain RMS errors, at comparablemagnitudes. For each data combination of correlation, leveland type of noise, three data sets were simulated and therefore,the final RMS errors are presented as mean values of the triplicateestimations. Another, but similar method of a comparison is touse the correlation coefficient estimated between the respectiveprofiles obtained from each of the two methods, but thevisualization of such a comparison is difficult to display for allof the data variants considered in this simulation study, andtherefore, only a discussion of such results will be providedfurther in the text.The influence of noise and correlation levels on the root mean

square, RMS, errors averaged for the contribution and com-position profiles obtained from MCR-ALS, PMF, EPA PMF, orMCR-WALS for data sets with i.i.d normal errors are presentedin Figure 1.The horizontal axis of the maps shows the level of noise,

whereas the vertical axis presents the strength of the pairwisecorrelation between the underlying composition profiles, FT.The radius and the intensity of the color (in grayscale) of eachcircle marker are proportional to the magnitude of the estimatedRMS error.Comparable RMS error values were obtained for the con-

tribution and composition profiles (see Figure 1), which indi-cated the same performance of all methods for data sets withuniformly distributed measurement errors. As was expected, theprofiles are recovered with higher RMS errors when data havehigher levels of measurement errors. It is worth rememberingthat errors at a level of 5% or 10% of the maximum value of the

Figure 2. Root mean square, RMS, errors averaged for the contribution, G, profiles obtained from (a) MCR-ALS, (b) PMF, (c) EPA PMF, (d) MCR-WALS and for the composition profiles, FT, obtained from (e) MCR-ALS, (f) PMF, (g) EPA PMF, and (h) MCR-WALS for data sets withmeasurements that have independent proportional errors. The radius and the intensity of color (in grayscale) of each circle marker are proportional tothe magnitude of the estimated RMS error.

Page 6: A Comparison of Positive Matrix Factorization and the Weighted Multivariate Curve Resolution Method. Application to Environmental Data

10107 dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology ARTICLE

corresponding error-free datamatrix can already be considered asan extreme case under normal operating conditions. It is also notsurprising that a higher noise and higher correlation would implygreater difficulty in recovering the true profiles by any method ascan be seen from the magnitude of the RMS errors shown inFigure 1. The pairwise correlation coefficients calculated be-tween the corresponding profiles obtained from PMF andMCR-WALS are ca. 0.99, which supports the conclusion that the samecontribution and composition profiles are resolved for data withi.i.d. measurement errors with the twomethods. Somewhat lowervalues, that is, ca. 0.96 of the correlation coefficients are estimatedbetween the respective contributions and composition profiles ofPMF and EPA PMF, which indicates some small differences inboth solutions.The results for the estimated RMS errors for data sets with

measurements that have uncorrelated proportional errors arepresented in Figure 2.Compared with a case of i.i.d. normal errors in the data, the

MCR-ALS method (see Figure 2a and e) presents greaterchallenges in recovering the true profiles for data with higherlevels of proportional errors (e.g., 30%) than PMF, EPA PMF, orMCR-WALS, which is indicated by the higher values of the RMSerrors observed for MCR-ALS as compared to those for PMF,EPA PMF, or MCR-WALS. This is to be expected since theuncertainty information is not included into the steps of the ALSalgorithm. For lower levels of proportional noise, up to 5%,MCR-ALS presents RMS errors comparable with the RMS errorsobtained from PMF, EPA PMF, or MCR-WALS. Although theMCR-WALS shows somewhat lower recovery RMS errors thanPMF and EPA PMF (compare Figure 2b and c with Figure 2dand Figure 2f and g with Figure 2h), the performance of themethods is comparable. The pairwise correlation coefficients

calculated between the corresponding profiles from PMF andMCR-WALS algorithms as well as between the respective pro-files obtained from EPA PMF andMCR-WALS are close to 0.99,which also indicates that virtually the same profiles are recovered.The real advantage of the methods incorporating the uncer-

tainty information in obtaining the solution is demonstrated fordata sets with constant and proportional error parts (see Figure 3).Compared to the recovery RMS errors obtained from PMF,

EPA PMF, and MCR-WALS, relatively higher values of thoseRMS errors in all data sets simulated with different levels ofmeasurement errors and correlation are found for MCR-ALS(compare Figure 3a with Figure 3b, c, and d as well as Figure 3ewith Figure 3f, g, and h). Again, there is the tendency, which wasobserved earlier, that the recovery of contribution and composi-tion profiles in the data with a higher correlation and highermeasurement errors presents greater difficulties for all methods.The same magnitude of RMS errors is observed on Figure 3b, c,and d as well as on Figure 3f, g, and h for PMF, EPA PMF,and MCR-WALS, respectively, which suggests, once more, thatvirtually the same contribution and composition profiles areobtained. To support the latter finding from the study, thecorrelation coefficients between the respective profiles of eachof the two methods were calculated once again and found to beabove 0.98 in all cases.Results from the Real Data Study. The aim of this study was

to compare the contribution and composition profiles obtainedfrom MCR-ALS, MCR-WALS, and the PMF algorithm by Luand Wu for real noisy environmental data, rather than to inter-pret the profiles obtained, which has already been done in severalstudies published elsewhere.36,37 Since the results of repeatedmeasurements were available, information about the uncertaintywas incorporated during the analyses with MCR-WALS and

Figure 3. Root mean square, RMS, errors averaged for the contribution, G, profiles obtained from (a) MCR-ALS, (b) PMF, (c) EPA PMF, (d) MCR-WALS and for the composition profiles, FT, obtained from (e) MCR-ALS, (f) PMF, (g) EPA PMF, and (h) MCR-WALS for data sets with a constantand proportional error parts. The paired numbers on the horizontal axis of the maps show the magnitudes of the constant and proportional error parts.The radius and the color (in grayscale) of each circle marker are proportional to the magnitude of the estimated RMS error.

Page 7: A Comparison of Positive Matrix Factorization and the Weighted Multivariate Curve Resolution Method. Application to Environmental Data

10108 dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology ARTICLE

PMF. The data set of interest was first preprocessed to eliminatethe differences in measurement units by scaling each variable to aunitary standard deviation. The uncertainty data were also scaledusing the standard deviations of the original measurement data.This procedure gives the variables the same importance duringthe analysis. It is worth noting that centering was not applied,as this is usually performed on environmental data in order toremove the offset, because non-negativity constraints were im-posed on both contribution and composition profiles.The first step in an analysis is to decide on the number of

factors in the model. Here, one can use the classic PCA methodto account for the amount of variability explained by each factorin the data. Due to the weighting scheme in MCR-WALS andPMF, the selection of complexity can be performed using a plotof the sum of weighted residuals as a function of themodel’s com-plexity. With PCA, three factors explain 92.5% of the variance,whereas with MCR-WALS and PMF a relatively large reductionin the weighted sum of squares residuals for models with com-plexities larger than three was not observed. Therefore, the solu-tions with three factors were further considered in this study.As was mentioned earlier, the initialization step in the algorithms

can be performed in different ways. In order to guarantee the opti-mality of the solution, the numerical values of the three compositionprofiles, FT (3� 14), were generated completely at random at eachrerun (e.g., 300 times) of the algorithms. The final solution is theone with the lowest value of the respective objective function. Theresults of the comparative study are presented in Figure 4.All of the panels in Figure 4 show the highly overlapping

contribution and composition profiles obtained from MCR-WALS (dash-dotted red line) with the respective profiles foundwith PMF (solid black line). This is also indicated by the values ofthe pairwise correlation coefficients presented in Table 1, whichare ca. 0.99.

Thus, both methods found virtually the same profiles for thestudied data. Since the data had a high level of proportional errors,it was expected that the profiles obtained from the traditionalMCR-ALS approach would be different in comparison withthose obtained from MCR-WALS and PMF. These differencesare also reflected in the relatively lower values of the correlationcoefficients calculated between the MCR-ALS profiles and therespective PMF or MCR-WALS profiles.The general conclusion of the simulation and the real data

studies is that MCR-WALS and PMF (or EPA PMF) extract vir-tually the same contribution and composition profiles for datasets with i.i.d. normal measurement errors and for data with mea-surements that have uncorrelated proportional errors or constantand proportional error parts. This was confirmed by the highvalues of the correlation coefficients (r > 0.98) calculated be-tween the respective contribution factors obtained from MCR-WALS and PMF or MCR-WALS and EPA PMF as well asbetween the composition profiles obtained, using each of the twomethods. As was expected, a higher level of noise and a highercorrelation would present greater difficulty in recovering the trueprofiles by either method. With the conventional MCR-ALSmethod, higher recovery errors for the profiles extracted for data

Figure 4. Comparison of (a) the first contribution (g1) profiles, (b) the second contribution (g2) profiles, (c) the third contribution (g3) profiles, (d) thefirst composition (f1) profiles, (e) the second composition (f2) profiles, and (f) the third composition (f3) profiles obtained from the MCR-ALS, PMF,and MCR-WALS methods applied to the real data set.

Table 1. Pairwise Correlation Coefficients Calculated be-tween the Respective Contribution (gk) and Composition (fk)Profiles Obtained from theMCR-ALS,MCR-WALS, and PMFModels Applied toMeasurement Data with Complexity Three

g1 g2 g3 f1 f2 f3

MCR-WALS/PMF 0.989 0.999 0.999 0.995 0.998 0.999

MCR-ALS/PMF 0.616 0.952 0.929 0.558 0.844 0.814

MCR-ALS/MCR-WALS 0.708 0.961 0.937 0.606 0.864 0.821

Page 8: A Comparison of Positive Matrix Factorization and the Weighted Multivariate Curve Resolution Method. Application to Environmental Data

10109 dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology ARTICLE

with proportional errors or data with a constant and proportionalerror parts were obtained in comparison with MCR-WALS, EPAPMF, and PMF.The major outcome of this study is that theMCR-WALS, EPA

PMF, and PMF methods give the same results for the majorfactors. MCR-WALS and the two versions of the PMF methodshare some common attractive features such as the possibility to(i) incorporate information about the measurement uncertainty,which makes them capable of handling missing values and databelow the reporting limits and to (ii) impose different constraintson the obtained solutions. However, compared to MCR-WALS,PMF, and EPA PMF are considerably complicated algorithms,which require the optimization of a number of parameters. Theconvergence speed and the quality of the solution greatly dependon the way in which the algorithmic steps are optimized.31 More-over, the imposition of different constraints like selectivity, localrank, equality and inequality or unimodality (i.e., for reactionsand chromatographic profiles), in PMF is also not an easytask. MCR-WALS as an alternating least-squares algorithm isa simpler procedure that does not require the optimization ofmany initial parameters and the implementation of the above-mentioned constraints is not difficult. Nevertheless, both meth-ods share some common problems, which require a further studylike the choice of the model’s complexity, obtaining optimalprofiles and controlling the rotational freedom of the solution.39

In general, having knowledge about the measurement errors in thedata, one could choose the model’s complexity using MLPCA.1

’AUTHOR INFORMATION

Corresponding Author*Phone: +48-32-359-1219; fax: +48-32-259-9978; e-mail: [email protected].

’ACKNOWLEDGMENT

I.S. gratefully acknowledges the financial support of theHumanCapital Programme at the University of Silesia, Katowice, Poland.We are grateful to the anonymous reviewers of this article for theirinsightful comments.

’REFERENCES

(1) Wentzell, P. D.; Andrews, D. T.; Hamilton, D. C.; Faber, K.;Kowalski, B. R. Maximum likelihood principal component analysis.J. Chemom. 1997, 11, 339–366.(2) Wentzell, P. D.; Karakach, T. K.; Roy, S.; Martinez, M. J.; Allen,

C. P.; Werner-Washburne, M. Multivariate curve resolution of timecourse microarray data. BMC Bioinf. 2006, 7, 343.(3) Paatero, P.; Tapper, U. Positive matrix factorization: A non-

negative factor model with optimal utilization of error estimates of datavalues. Environmetrics 1994, 5, 111–126.(4) Juntto, S.; Paatero, P. Analysis of daily precipitation data by

positive matrix factorization. Environmetrics 1994, 5, 127–144.(5) Wold, S.; Esbensen, K.; Geladi, P. Principal component analysis.

Chemom. Intell. Lab. Syst. 1987, 2, 37–52.(6) Massart, D. L.; Vandeginste, B. G. M.; Buydens, L. M. C.; de

Jong, S.; Lewi, P. J.; Smeyers-Verbeke, J. Handbook of chemometrics andQualimetrics: Part A; Elsevier: Amsterdam, The Netherlands, 1997.(7) Rutan, S. C.; de Juan, A.; Tauler, R. Introduction to multivariate

curve resolution. In Comprehensive Chemometrics; Brown, S. D., Tauler,R., Walczak, B., Eds.; Elsevier: Amsterdam, 2009; Vol.2, pp 249.(8) de Juan, A.; Tauler, R. Chemometrics applied to unravel multi-

component processes and mixtures. Revisiting latest trends in multi-variate resolution. Anal. Chim. Acta 2003, 500, 195–210.

(9) Malinowski, E. R. Factor Analysis in Chemistry; John Wiley &Sons: New York, NY, 1991.

(10) Armstrong, J. S. Derivation of theory bymeans of factor analysisor Tom Swift and his electric factor analysis machine. Am. Stat. 1967, 21,17–21.

(11) Maeder, M.; Neuhold Yorck-Michael. Kinetic modeling ofmultivariate measurements with nonlinear regression. In Practical Guideto Chemometrics; Gemperline, P. J., Ed.; CRC Press: Taylor & FrancisGroup: New York, 2006; pp 218.

(12) Per�e-Trepat, E.; Lacorte, S.; Tauler, R. Alternative calibrationapproaches for LC-MS quantitative determination of coeluted com-pounds in complex environmental mixtures using multivariate curveresolution. Anal. Chim. Acta 2007, 595, 228–237.

(13) Tauler, R.; Viana, M.; Querol, X.; Alastuey, A.; Flight, R. M.;Wentzell, P. D.; Hopke, P. K. Comparison of the results obtained by fourreceptor modelling methods in aerosol source apportionment studies.Atmos. Environ. 2009, 43, 3989–3997.

(14) Gemperline, P. J. Target Transformation Factor Analysis withlinear inequality constraints applied to spectroscopic-chromatographicdata. Anal. Chem. 1986, 58, 2656–2663.

(15) Gemperline, P. J.; Cash, E. Advantages of soft versus hardconstraints in self-modeling curve resolution problems. Alternating leastsquares with penalty functions. Anal. Chem. 2003, 75, 4236–4243.

(16) Wentzell, P. D. Other topics in soft-modeling: maximumlikelihood-based soft-modeling methods. In Comprehensive Chemometrics;Brown, S. D., Tauler, R., Walczak, B., Eds.; Elsevier: Amsterdam, 2009;Vol. 2, pp 507.

(17) Kim, E.; Hopke, P. K. Improving source identification of fineparticles in a rural northeastern U.S. area utilizing temperature-resolvedcarbon fractions. J. Geogr. Res. 2004, 109, D09204.

(18) Paatero, P.; Tapper, U.; Aalto, P.; Kulmala, M. J. Matrixfactorization methods for analysing diffusion battery data. Aerosol Sci.1991, 22, 273–276.

(19) Anttila, P.; Paatero, P.; Tapper, U.; J€arvinen, O. Sourceidentification of bulk wet deposition in Finland by positive matrixfactorization. Atmos. Environ. 1995, 29, 1705–1718.

(20) Brinkman, G.; Vance, G.; Hannigan, M. P.; Milford, J. B. Use ofsynthetic data to evaluate positive matrix factorization as a sourceapportionment tool for PM2.5 exposure data. Environ. Sci. Technol.2006, 40, 1892–1901.

(21) Logue, J. M.; Small, M. J.; Robinson, A. L. Identifying prioritypollutant sources: apportioning air toxics risks using positive matrixfactorization. Environ. Sci. Technol. 2009, 43, 9439–9444.

(22) Yakovleva, E.; Hopke, P. K.; Wallace, L. Receptor modelingassessment of particle total exposure assessment methodology data.Environ. Sci. Technol. 1999, 33, 3645–3652.

(23) de Juan, A.; van den Bogaert, B.; Cuesta S�anchez, F.; Massart,D. L. Application of the needle algorithm for exploratory analysis andresolution of HPLC-DAD data. Chemom. Intell. Lab. Syst. 1996, 33,133–145.

(24) Cuesta S�anchez, F.; Toft, J.; van den Bogaert, B.; Massart, D. L.Orthogonal projection approach applied to peak purity assessment.Anal. Chem. 1996, 68, 79–85.

(25) Windig, W.; Guilment, J. Interactive self-modeling mixtureanalysis. Anal. Chem. 1991, 63, 1425–1432.

(26) Bro, R.; de Jong, S. A. A fast non-negativity-constrained leastsquares algorithm. J. Chemometr. 1997, 11, 393–401.

(27) FCNNLS.M; http://www.cc.gatech.edu/∼hpark/software/fcnnls.m (accessed on October 21, 2011).

(28) Tauler, R. Calculation of maximum and minimum bandboundaries of feasible solutions for species profiles obtained by multi-variate curve resolution. J. Chemometr. 2001, 15, 627–646.

(29) Abdollahi, H.; Maeder, M.; Tauler, R. Calculation and meaningof feasible band boundaries in multivariate curve resolution of a two-component system. Anal. Chem. 2009, 81, 2115–2122.

(30) Jaumot, J.; Tauler, R. A user friendly MATLAB program for theevaluation of rotation ambiguities in multivariate curve resolution.Chemom. Intell. Lab. Syst. 2010, 103, 96–107.

Page 9: A Comparison of Positive Matrix Factorization and the Weighted Multivariate Curve Resolution Method. Application to Environmental Data

10110 dx.doi.org/10.1021/es201024m |Environ. Sci. Technol. 2011, 45, 10102–10110

Environmental Science & Technology ARTICLE

(31) Lu, J.; Wu, L. Technical details and programming guide for ageneral two-way positive matrix factorization algorithm. J. Chemometr.2004, 18, 519–525.(32) Paatero, P. Least squares formulation of robust non-negative

factor analysis. Chemom. Intell. Lab. Syst. 1997, 37, 23–35.(33) Paatero, P. The multilinear engine—A table-driven least

squares program for solving multilinear problems, including the n-wayparallel factor analysis model. J. Comput. Graphical Stat. 1999, 8, 1–35.(34) http://www.ub.edu/mcr/web_mcr/download.html (accessed

on March 21, 2011).(35) Nelsen, R. B. An Introduction to Copulas; Springer: New York,

NY, 2006.(36) Stanimirova, I.; Simeonov, V. Modeling of environmental four-

way data from air quality control. Chemom. Intell. Lab. Syst. 2005, 77,115–121.(37) Kroonenberg, P. M. Applied Multiway Data Analysis; John

Wiley & Sons: Hoboken, NJ, 2008.(38) Lavric, T. Composition and Sources of Aerosols (PM10) in the

border region Carynthia�Slovenia�Italy. Ph.D. Dissertation, TechnicalUniversity of Vienna, Vienna, 2002.(39) Paatero, P.; Hopke, P.; Song, X.-H.; Ramadan, Z. Understand-

ing and controlling rotations in factor analytic models. Chemom. Intell.Lab. Syst. 2002, 60, 253–264.


Recommended