4Th Iranian chemometrics Workshop (ICW)
Zanjan-2004
The Problem of Factor Selection in PCA-Based
Calibration Methods
By:
Bahram HemmateenejadMedicinal & Natural Products Chemistry Research Center,
Shiraz University of Medical Science
4Th
ICW 4Th ICW
Multivariate Calibration
Regression Equation relating measurements on m samples to k different variables by:
y = X by (m1): Dependent variable or Predicted
Variable
X (mk) : Independent variables or Predictor Variables
b (k1): regression coefficient
4Th
ICW 4Th ICW
Multicomponent Analysis
y: concentration of the analyte
X: Recorded analytical signals at k different channels, i.e. absorbance at different wavelength
QSAR/QSPR Studies
y: chemical property or biological activity
X: Molecular descriptors representing structural features of molecules by number
4Th
ICW 4Th ICW
• Colinearity between the independent variables (X)
• Number of dependent variables (k) should be much lower than the number of samples (m)
4Th
ICW 4Th ICW
Problems associated with MLR
Reduced number of variables must be used
Feature selection The variables are selected
based on their generalization ability using selection methods such as stepwise variable selection, genetic algorithm, simulated annealing,…
Feature extraction The variables are
transformed into new coordinate axes with lower dimension
Principal Component Analysis (PCA) or Factor Analysis (FA)
4Th
ICW 4Th ICW
PCA or FA or PFA
X = T P
X (mk)
T (mk)
P (kk)
T =[t1 t2 t3 t4 t5 … tk] Score
PT=[pT1 pT
2 pT3 pT
4 pT5 … pT
k] Loading
=[1 2 3 4 5 … k] eigen-value
1 > 2 > 3 > 4 > 5 > …> k
4Th
ICW 4Th ICW
Each vector of T or P is named eigen-vector or PC or factor
i shows the amount of variances in the X matrix that is explained by the corresponding eigen-vectors (ti or pi)
A reduced set of PCs is necessary to reproduce the original data matrix without losing significant information
4Th
ICW 4Th ICW
PTX
X
T
P
(mk)
(mf)
(fk)
f is the number of significant factors
f is the rank of the original data matrix
f describes the complexity of the X matrix
Ideally, f is the number of nonzero eigen-values
f can be determined by the theory of FA
Scree plot, indicator function, imbedded error, real error, …
4Th
ICW 4Th ICW
PCA-Based regression method
MLR (Classical Least
Squares)
y = X b
b = (XTX)-1XTy
ynew = xnew b
Principal Component Regression (PCR)
X = T P
y = T b
b = (TTT)-1TTy
tnew = xnew P
ynew = tnew b
4Th
ICW 4Th ICW
1. How many PCs must be used in PCR?
2. Which PCs should be considered in PCR modeling?
3. Is the magnitude of an eigen-value necessarily a measure of its significance for the calibration?
Significance of factor selection
4Th
ICW 4Th ICW Some Questions
Top-down eigen-value ranking(ER)
Factors are entered to the model based on their decreasing eigen-value one after the other
Once new factor is entered, the regression model is build and its performances are validated by the existing procedures such as cross-validation
4Th
ICW 4Th ICW
-4
-3
-2
-1
0
1
2
3
4
5
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
loga
rithm
of e
igen
-val
ue
4Th
ICW 4Th ICW
Top-down Correlation Ranking (CR)
First the correlation between each one of the factors and the dependent variable (concentration, y) is determined
Then, the factors are entered to the models based on their decreasing correlation consecutively.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Cor
rela
tion
Coe
ffic
ient
Other factor selection methods
4Th
ICW 4Th ICW
• Stepwise selection procedure
• Search algorithms• Simulated annealing• Genetic algorithm
Some references4Th
ICW 4Th ICW
1. Xie YL, Kalivas JH. Evaluation of principal component selection methods to form a global prediction model by principal component regression. Anal. Chim. Acta 1997; 348: 19-27.
2. Sutter JM, Kalivas JH. Which principal components to utilize for principal component regression. J. Chemometrics 1992; 6: 217-225.
3. Sun J. A correlation principal component regression analysis of NIR data. J. Chemometrics 1995; 9: 21-29.
4. Depczynski U, Frost VJ, Molt K. Genetic algorithms applied to the selection of factors in principal component regression. Anal. Chim. Acta 2000; 420: 217-227.
5. Barros AS, Rutledge DN. Genetic algorithm applied to the selection of principal components. Chemometrics Intell. Lab. Syst. 1998; 40: 65-81.
6. Verdu-Andres J, Massart DL. Comparison of prediction-and correlation-Based methods to select the best Subset of principal components for principal component regression and detect outlying objects. Appl. Spect. 1998; 52: 1425-1434.
7. Xie YL, Kalivas JH. Local prediction models by principal component regression. Anal. Chim. Acta 1997; 348: 29-38.
8. Ferre L. Selection of components in principal component analysis: a comparison of methods. Comput. Stat. Data Anal. 1995; 19: 669-682.
• Quantitative Structure-Electrochemistry Relationship Study of Some Organic Compounds
• Dependent variable• Half-wave reduction potential (E1/2)of 69 compounds
• Independent variables• 1150 theoretical molecular descriptors calculated by DRAGON
software
4Th
ICW 4Th ICW A QSPR example
0
20
40
60
80
100
1 6 11 16 21 26 31
cum
ulat
ive
perc
ent o
f var
ianc
e
0
0.1
0.2
0.3
0.4
0.5
0.6
1 6 11 16 21 26 31
Corr
elat
ion
Coef
ficie
nt
15
17
19
21
23
25
0 2 4 6 8 10 12 14Number of entered PC
PRES
SCV
ER
CR
• ANN is a nonlinear non-parametric modeling method
• Feature selection is more important for ANN• Feature selection-based ANN modeling is a
complex procedure• Orthogonalization of the variables before
introducing to the network substantially decreases the computational time and increases the overall performances of the ANN
• PC-ANN is a feature extraction-based algorithm
4Th
ICW 4Th ICW
Principal Component-Artificial Neural Network (PC-ANN)
4Th
ICW 4Th ICW
• Genetic Algorithm Applied to the selection of Factors in PC-ANN modeling,
• The set of PCs selected by GA could model the structure-antagonist activity of the calcium channel blockers better than the ER procedure
• B. Hemmateenejad, M. Akhond, R. Miri, M. Shamsipur, J. Chem. Inf,. Comput. Sci. 43 (2003) 1328.
• How are the factors ranked based on their correlation coefficient in PC-ANN?
PC-GA-ANN Algorithm
CR-PC-ANN Algorithm
• Correlation Ranking Procedure for factor selection in PC-ANN modeling,
• The nonlinear relationship between each one of the PCs and the dependent variable (y) was modeled by separate ANN models.
• It was found that the subset of PCs selected by CR was relatively the same as those selected by GA. Therefore the results of these factor selection procedures were similar
• B. Hemmateenejad, Chemometrics Intelligent Laboratory System, 2004, Accepted.
4Th
ICW 4Th ICW
1. Application of ab initio theory to QSAR study of the 1,4-dihydrpyridine-based calcium channel blockers using GA-MLR and PC-GA-ANN procedures, B. Hemmateenejad, M.A. Safarpour, R.Miri, F. Taghavi, Journal of Computational Chemistry 25 (2004) 1495.
2. Highly Correlating Distance-Connectivity-Based Topological Indices. 2: Prediction of 15 Properties of a Large Set of Alkanes Using a Stepwise Factor Selection-Based PCR Analysis, M. Shamsipur, R. Ghavami, B. Hemmateenejad, H. Sharghi, QSAR Combinatorial Sciences, 2004, Accepted.
3. Quantitative Structure-Electrochemistry Relationship Study of some Organic Compounds using PCR and PC-ANN, B. Hemmateenejad, M. Shamsipur, Internet Electronic Journal of Molecular Design 3 (2004) 316.
4. Toward an Optimal Procedure for PC-ANN Model Building: Prediction of the Carcinogenic Activity of a Large Set of Drugs, B. Hemmateenejad, M.A. Safarpour, R. Miri, N. Nesari, Journal of Chemical Information and Computer Sciences, Revised
5. Optimal QSAR analysis of the carcinogenic activity of drugs by correlation ranking and genetic algorithm-based PCR, B. Hemmateenejad, Journal of Chemometrics, Submitted.
1. Selection of Latent Variables in PLS2. Application of other selection algorithms such
as successive projections algorithm3. Comparison between the importance of factor
selection in multicomponent analysis and QSAR/QSPR studies
4. Application of the factor selection-based ANN modeling in multicomponent analysis
5. Validation of the different factor selection algorithms by new criteria
4Th
ICW 4Th ICW Feature Works