+ All Categories
Home > Documents > Journal Club Journal of Chemometrics May 2010 August 23, 2010.

Journal Club Journal of Chemometrics May 2010 August 23, 2010.

Date post: 17-Jan-2016
Category:
Upload: dwain-julius-parks
View: 224 times
Download: 0 times
Share this document with a friend
14
Journal Club Journal of Chemometrics May 2010 August 23, 2010
Transcript
Page 1: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

Journal ClubJournal of Chemometrics

May 2010

August 23, 2010

Page 2: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

An efficient nonlinear programming strategy for PCA models with incomplete

data sets

Rodrigo López-Negrete de la Fuentea, Salvador García-Muñozb and Lorenz T. Biegler

J. Chemometrics 2010; 24: 301–311

Page 3: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

Questions addressed:

– How to obtain the parameters of PCA models in the presence of incomplete data sets based in non-linear programming strategy.

– How nonlinear programming approach is better suited when there are large amounts of missing values.

Page 4: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

Methods

• PCA with full data-set:– Given: where, T: Scores(projection) P:Loading Rx:Residulas

Problem 1:

Solution: the solution will be given by the largest eigenvalue of the covariance matrix of X.Largest eigen values: largest variance variance of XX’

Problem 2:

Solution: Has the same form as that of the solution of the maximization Problem.

Page 5: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

Methods• PCA with full data-set:

• Using SVD:

Two problems have the same solution.

Page 6: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

Methods PCA with full data-set:

Principal Components via NIPALS algorithm:

Page 7: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

Methods PCA with incomplete data-set:

Taking gradient, wrt t and p

Page 8: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

MethodsPCA with incomplete data-set:

Principal Components via modified NIPALS algorithm:

Page 9: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

MethodsPCA with incomplete data-set:

- X is the matrix of data where the missing elements have been zeroed out. - Constraint (20b) forces the loadings to be orthonormal. - Constraint (20c) makes the score vectors orthogonal to each other. - Constraint (20d) forces the scores to have zero mean. - It is clear that if there are no missing values problem (20) will reduce to problem(4) (min problem) for the first a principal components.

Let Yi,j = Xi,j + Zi,j where Xi,j are the values of the datathat are equal to zero for the missing elements, and Zi,j are theimputed values that should be zero for the nonmissing elements.

Page 10: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

MethodsPCA with incomplete data-set:

constrained problem will be solved directly, the scores and loadings obtained with the NLP will be orthogonal as needed by the PCA model, which is not true for the modified NIPALS.

Page 11: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

RESULTSNumerical simulations were done by generating a data set with 1000 rows and 100 columns from a known four-dimensional latent space with added random Gaussian noise. Values were then removed to generate data sets with missing value percentages ranging from 1 to 70%.

Page 12: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

RESULTS

Page 13: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

RESULTSIndustrial Example:- Data from 76 common pharmaceutical materials were made available (Pfizer Inc.) and the data span over 10 years of testing. Due to the reasons outlined above, approximately 61% of the data were missing.- For this example, three principal components were used in all models due to: a sudden drop in the eigenvalue for the fourth component (from 13 to 2) for the NIPALS model and the very lowpercent of the total variance for the fourth component (1.4%) in the NLP.

Page 14: Journal Club Journal of Chemometrics May 2010 August 23, 2010.

Conclusion

• The NLP solutions take less time and iterations than the current state-of the-art algorithms, while still satisfying the constraints of the PCA model.

• The current platform allows the potential inclusion of a large number of observations that otherwise would be excluded from the model building exercise, still yielding a robust model with desirable properties.

• In the presence of large amounts of missing data, this method reduces the computational time (and number of iterations) required to calculate them.


Recommended