Journal Club Journal of Chemometrics May 2010 August 23, 2010.

Journal ClubJournal of Chemometrics

May 2010

August 23, 2010

An efficient nonlinear programming strategy for PCA models with incomplete

data sets

Rodrigo López-Negrete de la Fuentea, Salvador García-Muñozb and Lorenz T. Biegler

J. Chemometrics 2010; 24: 301–311

Questions addressed:

– How to obtain the parameters of PCA models in the presence of incomplete data sets based in non-linear programming strategy.

– How nonlinear programming approach is better suited when there are large amounts of missing values.

Methods

• PCA with full data-set:– Given: where, T: Scores(projection) P:Loading Rx:Residulas

Problem 1:

Solution: the solution will be given by the largest eigenvalue of the covariance matrix of X.Largest eigen values: largest variance variance of XX’

Problem 2:

Solution: Has the same form as that of the solution of the maximization Problem.

Methods• PCA with full data-set:

• Using SVD:

Two problems have the same solution.

Methods PCA with full data-set:

Principal Components via NIPALS algorithm:

Methods PCA with incomplete data-set:

Taking gradient, wrt t and p

MethodsPCA with incomplete data-set:

Principal Components via modified NIPALS algorithm:


- X is the matrix of data where the missing elements have been zeroed out. - Constraint (20b) forces the loadings to be orthonormal. - Constraint (20c) makes the score vectors orthogonal to each other. - Constraint (20d) forces the scores to have zero mean. - It is clear that if there are no missing values problem (20) will reduce to problem(4) (min problem) for the first a principal components.

Let Yi,j = Xi,j + Zi,j where Xi,j are the values of the datathat are equal to zero for the missing elements, and Zi,j are theimputed values that should be zero for the nonmissing elements.


constrained problem will be solved directly, the scores and loadings obtained with the NLP will be orthogonal as needed by the PCA model, which is not true for the modified NIPALS.

RESULTSNumerical simulations were done by generating a data set with 1000 rows and 100 columns from a known four-dimensional latent space with added random Gaussian noise. Values were then removed to generate data sets with missing value percentages ranging from 1 to 70%.

RESULTS

RESULTSIndustrial Example:- Data from 76 common pharmaceutical materials were made available (Pfizer Inc.) and the data span over 10 years of testing. Due to the reasons outlined above, approximately 61% of the data were missing.- For this example, three principal components were used in all models due to: a sudden drop in the eigenvalue for the fourth component (from 13 to 2) for the NIPALS model and the very lowpercent of the total variance for the fourth component (1.4%) in the NLP.

Conclusion

• The NLP solutions take less time and iterations than the current state-of the-art algorithms, while still satisfying the constraints of the PCA model.

• The current platform allows the potential inclusion of a large number of observations that otherwise would be excluded from the model building exercise, still yielding a robust model with desirable properties.

• In the presence of large amounts of missing data, this method reduces the computational time (and number of iterations) required to calculate them.

Date post:	17-Jan-2016
Category:	Documents
Upload:	dwain-julius-parks
View:	224 times
Download:	0 times

Journal Club Journal of Chemometrics May 2010 August 23, 2010.

Documents