+ All Categories
Home > Documents > SAS/STAT 15.2 User’s Guide

SAS/STAT 15.2 User’s Guide

Date post: 16-Oct-2021
Category:
Upload: others
View: 17 times
Download: 0 times
Share this document with a friend
63
SAS/STAT 15.2 ® User’s Guide The PLS Procedure SAS ® Documentation November 06, 2020
Transcript
Page 1: SAS/STAT 15.2 User’s Guide

SAS/STAT 15.2®

User’s GuideThe PLS Procedure

SAS® DocumentationNovember 06, 2020

Page 2: SAS/STAT 15.2 User’s Guide

This document is an individual chapter from SAS/STAT® 15.2 User’s Guide.

The correct bibliographic citation for this manual is as follows: SAS Institute Inc. 2020. SAS/STAT® 15.2 User’s Guide. Cary, NC:SAS Institute Inc.

SAS/STAT® 15.2 User’s Guide

Copyright © 2020, SAS Institute Inc., Cary, NC, USA

All Rights Reserved. Produced in the United States of America.

For a hard-copy book: No part of this publication may be reproduced, stored in a retrieval system, or transmitted, in any form or byany means, electronic, mechanical, photocopying, or otherwise, without the prior written permission of the publisher, SAS InstituteInc.

For a web download or e-book: Your use of this publication shall be governed by the terms established by the vendor at the timeyou acquire this publication.

The scanning, uploading, and distribution of this book via the Internet or any other means without the permission of the publisher isillegal and punishable by law. Please purchase only authorized electronic editions and do not participate in or encourage electronicpiracy of copyrighted materials. Your support of others’ rights is appreciated.

U.S. Government License Rights; Restricted Rights: The Software and its documentation is commercial computer softwaredeveloped at private expense and is provided with RESTRICTED RIGHTS to the United States Government. Use, duplication, ordisclosure of the Software by the United States Government is subject to the license terms of this Agreement pursuant to, asapplicable, FAR 12.212, DFAR 227.7202-1(a), DFAR 227.7202-3(a), and DFAR 227.7202-4, and, to the extent required under U.S.federal law, the minimum restricted rights as set out in FAR 52.227-19 (DEC 2007). If FAR 52.227-19 is applicable, this provisionserves as notice under clause (c) thereof and no other notice is required to be affixed to the Software or documentation. TheGovernment’s rights in Software and documentation shall be only those set forth in this Agreement.

SAS Institute Inc., SAS Campus Drive, Cary, NC 27513-2414

November 2020

SAS® and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute Inc. in theUSA and other countries. ® indicates USA registration.

Other brand and product names are trademarks of their respective companies.

SAS software may be provided with certain third-party software, including but not limited to open-source software, which islicensed under its applicable third-party software license agreement. For license information about third-party software distributedwith SAS software, refer to http://support.sas.com/thirdpartylicenses.

Page 3: SAS/STAT 15.2 User’s Guide

Chapter 94

The PLS Procedure

ContentsOverview: PLS Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7654

Basic Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7654Getting Started: PLS Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7655

Spectrometric Calibration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7655Syntax: PLS Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7664

PROC PLS Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7664BY Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7671CLASS Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7671EFFECT Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7672ID Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7674MODEL Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7674OUTPUT Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7674

Details: PLS Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7675Regression Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7675Cross Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7680Centering and Scaling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7681Missing Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7682Displayed Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7682ODS Table Names . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7683ODS Graphics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7684

Examples: PLS Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7686Example 94.1: Examining Model Details . . . . . . . . . . . . . . . . . . . . . . . . 7686Example 94.2: Examining Outliers . . . . . . . . . . . . . . . . . . . . . . . . . . . 7693Example 94.3: Choosing a PLS Model by Test Set Validation . . . . . . . . . . . . . 7695Example 94.4: Partial Least Squares Spline Smoothing . . . . . . . . . . . . . . . . . 7701

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7707

Page 4: SAS/STAT 15.2 User’s Guide

7654 F Chapter 94: The PLS Procedure

Overview: PLS ProcedureThe PLS procedure fits models by using any one of a number of linear predictive methods, including partialleast squares (PLS). Ordinary least squares regression, as implemented in SAS/STAT procedures such asPROC GLM and PROC REG, has the single goal of minimizing sample response prediction error, seekinglinear functions of the predictors that explain as much variation in each response as possible. The techniquesimplemented in the PLS procedure have the additional goal of accounting for variation in the predictors,under the assumption that directions in the predictor space that are well sampled should provide betterprediction for new observations when the predictors are highly correlated. All of the techniques implementedin the PLS procedure work by extracting successive linear combinations of the predictors, called factors (alsocalled components, latent vectors, or latent variables), which optimally address one or both of these twogoals—explaining response variation and explaining predictor variation. In particular, the method of partialleast squares balances the two objectives, seeking factors that explain both response and predictor variation.

Note that the name “partial least squares” also applies to a more general statistical method that is notimplemented in this procedure. The partial least squares method was originally developed in the 1960s bythe econometrician Herman Wold (1966) for modeling “paths” of causal relation between any number of“blocks” of variables. However, the PLS procedure fits only predictive partial least squares models, with one“block” of predictors and one “block” of responses. If you are interested in fitting more general path models,you should consider using the CALIS procedure.

Basic FeaturesThe techniques implemented by the PLS procedure are as follows:

� principal components regression, which extracts factors to explain as much predictor sample variationas possible

� reduced rank regression, which extracts factors to explain as much response variation as possible. Thistechnique, also known as (maximum) redundancy analysis, differs from multivariate linear regressiononly when there are multiple responses.

� partial least squares regression, which balances the two objectives of explaining response variation andexplaining predictor variation. Two different formulations for partial least squares are available: theoriginal predictive method of Wold (1966) and the SIMPLS method of De Jong (1993).

The number of factors to extract depends on the data. Basing the model on more extracted factors improvesthe model fit to the observed data, but extracting too many factors can cause overfitting—that is, tailoringthe model too much to the current data, to the detriment of future predictions. The PLS procedure enablesyou to choose the number of extracted factors by cross validation—that is, fitting the model to part of thedata, minimizing the prediction error for the unfitted part, and iterating with different portions of the data inthe roles of fitted and unfitted. Various methods of cross validation are available, including one-at-a-timevalidation and splitting the data into blocks. The PLS procedure also offers test set validation, where themodel is fit to the entire primary input data set and the fit is evaluated over a distinct test data set.

Page 5: SAS/STAT 15.2 User’s Guide

Getting Started: PLS Procedure F 7655

You can use the general linear modeling approach of the GLM procedure to specify a model for your design,allowing for general polynomial effects as well as classification or ANOVA effects. You can save the modelfit by the PLS procedure in a data set and apply it to new data by using the SCORE procedure.

The PLS procedure uses ODS Graphics to create graphs as part of its output. For general informationabout ODS Graphics, see Chapter 23, “Statistical Graphics Using ODS.” For specific information about thestatistical graphics available with the PLS procedure, see the PLOTS options in the PROC PLS statementsand the section “ODS Graphics” on page 7684.

Getting Started: PLS Procedure

Spectrometric CalibrationThe example in this section illustrates basic features of the PLS procedure. The data are reported in Umetrics(1995); the original source is Lindberg, Persson, and Wold (1983). Suppose that you are researching pollutionin the Baltic Sea, and you would like to use the spectra of samples of seawater to determine the amountsof three compounds present in samples from the Baltic Sea: lignin sulfonate (ls: pulp industry pollution),humic acids (ha: natural forest products), and optical whitener from detergent (dt). Spectrometric calibrationis a type of problem in which partial least squares can be very effective. The predictors are the spectraemission intensities at different frequencies in sample spectrum, and the responses are the amounts of variouschemicals in the sample.

For the purposes of calibrating the model, samples with known compositions are used. The calibration dataconsist of 16 samples of known concentrations of ls, ha, and dt, with spectra based on 27 frequencies (or,equivalently, wavelengths). The following statements create a SAS data set named Sample for these data.

data Sample;input obsnam $ v1-v27 ls ha dt @@;datalines;

EM1 2766 2610 3306 3630 3600 3438 3213 3051 2907 2844 27962787 2760 2754 2670 2520 2310 2100 1917 1755 1602 14671353 1260 1167 1101 1017 3.0110 0.0000 0.00

EM2 1492 1419 1369 1158 958 887 905 929 920 887 800710 617 535 451 368 296 241 190 157 128 10689 70 65 56 50 0.0000 0.4005 0.00

EM3 2450 2379 2400 2055 1689 1355 1109 908 750 673 644640 630 618 571 512 440 368 305 247 196 156120 98 80 61 50 0.0000 0.0000 90.63

EM4 2751 2883 3492 3570 3282 2937 2634 2370 2187 2070 20071974 1950 1890 1824 1680 1527 1350 1206 1080 984 888810 732 669 630 582 1.4820 0.1580 40.00

EM5 2652 2691 3225 3285 3033 2784 2520 2340 2235 2148 20942049 2007 1917 1800 1650 1464 1299 1140 1020 909 810726 657 594 549 507 1.1160 0.4104 30.45

EM6 3993 4722 6147 6720 6531 5970 5382 4842 4470 4200 40774008 3948 3864 3663 3390 3090 2787 2481 2241 2028 18301680 1533 1440 1314 1227 3.3970 0.3032 50.82

EM7 4032 4350 5430 5763 5490 4974 4452 3990 3690 3474 3357

Page 6: SAS/STAT 15.2 User’s Guide

7656 F Chapter 94: The PLS Procedure

3300 3213 3147 3000 2772 2490 2220 1980 1779 1599 14401320 1200 1119 1032 957 2.4280 0.2981 70.59

EM8 4530 5190 6910 7580 7510 6930 6150 5490 4990 4670 44904370 4300 4210 4000 3770 3420 3060 2760 2490 2230 20601860 1700 1590 1490 1380 4.0240 0.1153 89.39

EM9 4077 4410 5460 5857 5607 5097 4605 4170 3864 3708 35883537 3480 3330 3192 2910 2610 2325 2064 1830 1638 14761350 1236 1122 1044 963 2.2750 0.5040 81.75

EM10 3450 3432 3969 4020 3678 3237 2814 2487 2205 2061 20011965 1947 1890 1776 1635 1452 1278 1128 981 867 753663 600 552 507 468 0.9588 0.1450 101.10

EM11 4989 5301 6807 7425 7155 6525 5784 5166 4695 4380 41974131 4077 3972 3777 3531 3168 2835 2517 2244 2004 18091620 1470 1359 1266 1167 3.1900 0.2530 120.00

EM12 5340 5790 7590 8390 8310 7670 6890 6190 5700 5380 52005110 5040 4900 4700 4390 3970 3540 3170 2810 2490 22402060 1870 1700 1590 1470 4.1320 0.5691 117.70

EM13 3162 3477 4365 4650 4470 4107 3717 3432 3228 3093 30092964 2916 2838 2694 2490 2253 2013 1788 1599 1431 13051194 1077 990 927 855 2.1600 0.4360 27.59

EM14 4380 4695 6018 6510 6342 5760 5151 4596 4200 3948 38073720 3672 3567 3438 3171 2880 2571 2280 2046 1857 16801548 1413 1314 1200 1119 3.0940 0.2471 61.71

EM15 4587 4200 5040 5289 4965 4449 3939 3507 3174 2970 28502814 2748 2670 2529 2328 2088 1851 1641 1431 1284 11341020 918 840 756 714 1.6040 0.2856 108.80

EM16 4017 4725 6090 6570 6354 5895 5346 4911 4611 4422 43144287 4224 4110 3915 3600 3240 2913 2598 2325 2088 19171734 1587 1452 1356 1257 3.1620 0.7012 60.00

;

Fitting a PLS Model

To isolate a few underlying spectral factors that provide a good predictive model, you can fit a PLS model tothe 16 samples by using the following SAS statements:

proc pls data=sample;model ls ha dt = v1-v27;

run;

By default, the PLS procedure extracts at most 15 factors. The procedure lists the amount of variationaccounted for by each of these factors, both individual and cumulative; this listing is shown in Figure 94.1.

Page 7: SAS/STAT 15.2 User’s Guide

Spectrometric Calibration F 7657

Figure 94.1 PLS Variation Summary

The PLS Procedure

Percent Variation Accounted for by PartialLeast Squares Factors

Model EffectsDependentVariables

Numberof

ExtractedFactors Current Total Current Total

1 97.4607 97.4607 41.9155 41.9155

2 2.1830 99.6436 24.2435 66.1590

3 0.1781 99.8217 24.5339 90.6929

4 0.1197 99.9414 3.7898 94.4827

5 0.0415 99.9829 1.0045 95.4873

6 0.0106 99.9935 2.2808 97.7681

7 0.0017 99.9952 1.1693 98.9374

8 0.0010 99.9961 0.5041 99.4415

9 0.0014 99.9975 0.1229 99.5645

10 0.0010 99.9985 0.1103 99.6747

11 0.0003 99.9988 0.1523 99.8270

12 0.0003 99.9991 0.1291 99.9561

13 0.0002 99.9994 0.0312 99.9873

14 0.0004 99.9998 0.0065 99.9938

15 0.0002 100.0000 0.0062 100.0000

Note that all of the variation in both the predictors and the responses is accounted for by only 15 factors; thisis because there are only 16 sample observations. More important, almost all of the variation is accounted forwith even fewer factors—one or two for the predictors and three to eight for the responses.

Selecting the Number of Factors by Cross Validation

A PLS model is not complete until you choose the number of factors. You can choose the number of factorsby using cross validation, in which the data set is divided into two or more groups. You fit the model to allgroups except one, and then you check the capability of the model to predict responses for the group omitted.Repeating this for each group, you then can measure the overall capability of a given form of the model. Thepredicted residual sum of squares (PRESS) statistic is based on the residuals generated by this process.

To select the number of extracted factors by cross validation, you specify the CV= option with an argumentthat says which cross validation method to use. For example, a common method is split-sample validation,in which the different groups are composed of every nth observation beginning with the first, every nthobservation beginning with the second, and so on. You can use the CV=SPLIT option to specify split-samplevalidation with n = 7 by default, as in the following SAS statements:

proc pls data=sample cv=split;model ls ha dt = v1-v27;

run;

The resulting output is shown in Figure 94.2 and Figure 94.3.

Page 8: SAS/STAT 15.2 User’s Guide

7658 F Chapter 94: The PLS Procedure

Figure 94.2 Split-Sample Validated PRESS Statistics for Number of Factors

The PLS Procedure

Split-sampleValidation for the

Number ofExtracted Factors

Numberof

ExtractedFactors

RootMean

PRESS

0 1.107747

1 0.957983

2 0.931314

3 0.520222

4 0.530501

5 0.586786

6 0.475047

7 0.477595

8 0.483138

9 0.485739

10 0.48946

11 0.521445

12 0.525653

13 0.531049

14 0.531049

15 0.531049

Minimum root mean PRESS 0.4750

Minimizing number of factors 6

Figure 94.3 PLS Variation Summary for Split-Sample Validated Model

Percent Variation Accounted for by PartialLeast Squares Factors

Model EffectsDependentVariables

Numberof

ExtractedFactors Current Total Current Total

1 97.4607 97.4607 41.9155 41.9155

2 2.1830 99.6436 24.2435 66.1590

3 0.1781 99.8217 24.5339 90.6929

4 0.1197 99.9414 3.7898 94.4827

5 0.0415 99.9829 1.0045 95.4873

6 0.0106 99.9935 2.2808 97.7681

The absolute minimum PRESS is achieved with six extracted factors. Notice, however, that this is not muchsmaller than the PRESS for three factors. By using the CVTEST option, you can perform a statistical modelcomparison suggested by Van der Voet (1994) to test whether this difference is significant, as shown in the

Page 9: SAS/STAT 15.2 User’s Guide

Spectrometric Calibration F 7659

following SAS statements:

proc pls data=sample cv=split cvtest(seed=12345);model ls ha dt = v1-v27;

run;

The model comparison test is based on a rerandomization of the data. By default, the seed for this randomiza-tion is based on the system clock, but it is specified here. The resulting output is shown in Figure 94.4 andFigure 94.5.

Figure 94.4 Testing Split-Sample Validation for Number of Factors

The PLS Procedure

Split-sample Validation for the Number ofExtracted Factors

Numberof

ExtractedFactors

RootMean

PRESS T**2 Prob > T**2

0 1.107747 9.272858 0.0010

1 0.957983 10.62305 0.0010

2 0.931314 8.950878 0.0010

3 0.520222 5.133259 0.1440

4 0.530501 5.168427 0.1340

5 0.586786 6.437266 0.0150

6 0.475047 0 1.0000

7 0.477595 2.809763 0.4750

8 0.483138 7.189526 0.0110

9 0.485739 7.931726 0.0060

10 0.48946 6.612597 0.0150

11 0.521445 6.666235 0.0130

12 0.525653 7.092861 0.0080

13 0.531049 7.538298 0.0030

14 0.531049 7.538298 0.0030

15 0.531049 7.538298 0.0030

Minimum root mean PRESS 0.4750

Minimizing number of factors 6

Smallest number of factors with p > 0.1 3

Page 10: SAS/STAT 15.2 User’s Guide

7660 F Chapter 94: The PLS Procedure

Figure 94.5 PLS Variation Summary for Tested Split-Sample Validated Model

Percent Variation Accounted for by PartialLeast Squares Factors

Model EffectsDependentVariables

Numberof

ExtractedFactors Current Total Current Total

1 97.4607 97.4607 41.9155 41.9155

2 2.1830 99.6436 24.2435 66.1590

3 0.1781 99.8217 24.5339 90.6929

The p-value of 0.1430 in comparing the cross validated residuals from models with 6 and 3 factors indicatesthat the difference between the two models is insignificant; therefore, the model with fewer factors is preferred.The variation summary shows that over 99% of the predictor variation and over 90% of the response variationare accounted for by the three factors.

Predicting New Observations

Now that you have chosen a three-factor PLS model for predicting pollutant concentrations based on samplespectra, suppose that you have two new samples. The following SAS statements create a data set containingthe spectra for the new samples:

data newobs;input obsnam $ v1-v27 @@;datalines;

EM17 3933 4518 5637 6006 5721 5187 4641 4149 37893579 3447 3381 3327 3234 3078 2832 2571 22742040 1818 1629 1470 1350 1245 1134 1050 987

EM25 2904 2997 3255 3150 2922 2778 2700 2646 25712487 2370 2250 2127 2052 1713 1419 1200 984795 648 525 426 351 291 240 204 162

;

You can apply the PLS model to these samples to estimate pollutant concentration. To do so, append the newsamples to the original 16, and specify that the predicted values for all 18 be output to a data set, as shown inthe following statements:

data all;set sample newobs;

run;

proc pls data=all nfac=3;model ls ha dt = v1-v27;output out=pred p=p_ls p_ha p_dt;

run;

proc print data=pred;where (obsnam in ('EM17','EM25'));var obsnam p_ls p_ha p_dt;

run;

Page 11: SAS/STAT 15.2 User’s Guide

Spectrometric Calibration F 7661

The new observations are not used in calculating the PLS model, since they have no response values. Theirpredicted concentrations are shown in Figure 94.6.

Figure 94.6 Predicted Concentrations for New Observations

Obs obsnam p_ls p_ha p_dt

17 EM17 2.54261 0.31877 81.4174

18 EM25 -0.24716 1.37892 46.3212

Finally, if ODS Graphics is enabled, PLS also displays by default a plot of the amount of variation accountedfor by each factor, as well as a correlations loading plot that summarizes the first two dimensions of the PLSmodel. The following statements, which are the same as the previous split-sample validation analysis butwith ODS Graphics enabled, additionally produce Figure 94.7 and Figure 94.8:

ods graphics on;

proc pls data=sample cv=split cvtest(seed=12345);model ls ha dt = v1-v27;

run;

Figure 94.7 Split-Sample Cross Validation Plot

Page 12: SAS/STAT 15.2 User’s Guide

7662 F Chapter 94: The PLS Procedure

Figure 94.8 Correlation Loading Plot

The cross validation plot in Figure 94.7 gives a visual representation of the selection of the optimum numberof factors discussed previously. The correlation loading plot is a compact summary of many features of thePLS model. For example, it shows that the first factor is highly positively correlated with all spectral values,indicating that it is approximately an average of them all; the second factor is positively correlated with thelowest frequencies and negatively correlated with the highest, indicating that it is approximately a contrastbetween the two ends of the spectrum. The observations, represented by their number in the data set on thisplot, are generally spaced well apart, indicating that the data give good information about these first twofactors. For more details on the interpretation of the correlation loading plot, see the section “ODS Graphics”on page 7684 and Example 94.1.

The default correlation loading plot for just the first two factors depicts most of the model information. Inorder to see correlation loadings for all three of the selected factors, you use the NFAC= suboption for thePLOT=CORRLOAD option, as in the following:

proc pls data=sample nfac=3 plot=corrload(nfac=3);model ls ha dt = v1-v27;

run;

The resulting plot is shown in Figure 94.9.

Page 13: SAS/STAT 15.2 User’s Guide

Spectrometric Calibration F 7663

Figure 94.9 Correlation Loading Plot Matrix

Page 14: SAS/STAT 15.2 User’s Guide

7664 F Chapter 94: The PLS Procedure

Syntax: PLS ProcedureThe following statements are available in the PLS procedure. Items within the angle brackets are optional.

PROC PLS < options > ;BY variables ;CLASS variables < / option > ;EFFECT name=effect-type (variables< / options >) ;ID variables ;MODEL dependent-variables = effects < / options > ;OUTPUT OUT=SAS-data-set < options > ;

To analyze a data set, you must use the PROC PLS and MODEL statements. You can use the other statementsas needed. CLASS and EFFECT statements, if present, must precede the MODEL statement.

PROC PLS StatementPROC PLS < options > ;

The PROC PLS statement invokes the PLS procedure. Optionally, you can also indicate the analysis dataand method in the PROC PLS statement. Table 94.1 summarizes the options available in the PROC PLSstatement.

Table 94.1 PROC PLS Statement Options

Option Description

CENSCALE Displays the centering and scaling informationCV= Specifies the cross validation method to be usedCVTEST Specifies that van der Voet’s (1994) randomization-based model

comparison test be performedDATA= Names the SAS data setDETAILS Displays the details of the fitted modelMETHOD= Specifies the general factor extraction method to be usedMISSING= Specifies how observations with missing values are to be handled in

computing the fitNFAC= Specifies the number of factors to extractNOCENTER Suppresses centering of the responses and predictors before fittingNOCVSTDIZE Suppresses re-centering and rescaling of the responses and predictors when

cross validatingNOPRINT Suppresses the normal display of resultsNOSCALE Suppresses scaling of the responses and predictors before fittingPLOTS Controls the plots produced through ODS GraphicsVARSCALE Specifies that continuous model variables be centered and scaledVARSS Displays the amount of variation accounted for in each response and

predictor

Page 15: SAS/STAT 15.2 User’s Guide

PROC PLS Statement F 7665

The following options are available.

CENSCALElists the centering and scaling information for each response and predictor.

CV=ONE

CV=SPLIT < (n) >

CV=BLOCK < (n) >

CV=RANDOM < (cv-random-opts) >

CV=TESTSET(SAS-data-set)specifies the cross validation method to be used. By default, no cross validation is performed. Themethod CV=ONE requests one-at-a-time cross validation, CV=SPLIT requests that every nth obser-vation be excluded, CV=BLOCK requests that n blocks of consecutive observations be excluded,CV=RANDOM requests that observations be excluded at random, and CV=TESTSET(SAS-data-set)specifies a test set of observations to be used for validation (formally, this is called “test set validation”rather than “cross validation”). You can, optionally, specify n for CV=SPLIT and CV=BLOCK; thedefault is n = 7. You can also specify the following optional cv-random-options in parentheses afterthe CV=RANDOM option:

NITER=nspecifies the number of random subsets to exclude. The default value is 10.

NTEST=nspecifies the number of observations in each random subset chosen for exclusion. The defaultvalue is one-tenth of the total number of observations.

SEED=nspecifies an integer used to start the pseudo-random number generator for selecting the randomtest set. If you do not specify a seed, or specify a value less than or equal to zero, the seed is bydefault generated from reading the time of day from the computer’s clock.

CVTEST < (cvtest-options) >specifies that van der Voet’s (1994) randomization-based model comparison test be performed to testmodels with different numbers of extracted factors against the model that minimizes the predictedresidual sum of squares; for more information, see the section “Cross Validation” on page 7680. Youcan also specify the following cv-test-options in parentheses after the CVTEST option:

PVAL=nspecifies the cutoff probability for declaring an insignificant difference. The default value is 0.10.

STAT=test-statisticspecifies the test statistic for the model comparison. You can specify either T2, for Hotelling’sT 2 statistic, or PRESS, for the predicted residual sum of squares. The default value is T2.

NSAMP=nspecifies the number of randomizations to perform. The default value is 1000.

Page 16: SAS/STAT 15.2 User’s Guide

7666 F Chapter 94: The PLS Procedure

SEED=nspecifies the seed value for randomization generation (the clock time is used by default).

DATA=SAS-data-setnames the SAS data set to be used by PROC PLS. The default is the most recently created data set.

DETAILSlists the details of the fitted model for each successive factor. The details listed are different for differentextraction methods; for more information, see the section “Displayed Output” on page 7682.

METHOD=PLS< (PLS-options ) > | SIMPLS | PCR | RRRspecifies the general factor extraction method to be used. The value PLS requests partial leastsquares, SIMPLS requests the SIMPLS method of De Jong (1993), PCR requests principal componentsregression, and RRR requests reduced rank regression. The default is METHOD=PLS. You can alsospecify the following optional PLS-options in parentheses after METHOD=PLS:

ALGORITHM=NIPALS | SVD | EIG | RLGWnames the specific algorithm used to compute extracted PLS factors. NIPALS requests the usualiterative NIPALS algorithm, SVD bases the extraction on the singular value decomposition ofX0Y, EIG bases the extraction on the eigenvalue decomposition of Y0XX0Y, and RLGW is aniterative approach that is efficient when there are many predictors. ALGORITHM=SVD is themost accurate but least efficient approach; the default is ALGORITHM=NIPALS.

MAXITER=nspecifies the maximum number of iterations for the NIPALS and RLGW algorithms. The defaultvalue is 200.

EPSILON=nspecifies the convergence criterion for the NIPALS and RLGW algorithms. The default value is10�12.

MISSING=NONE | AVG | EM < ( EM-options ) >specifies how observations with missing values are to be handled in computing the fit. The default isMISSING=NONE, for which observations with any missing variables (dependent or independent) areexcluded from the analysis. MISSING=AVG specifies that the fit be computed by filling in missingvalues with the average of the nonmissing values for the corresponding variable. If you specifyMISSING=EM, then the procedure first computes the model with MISSING=AVG and then fills inmissing values by their predicted values based on that model and computes the model again. Forboth methods of imputation, the imputed values contribute to the centering and scaling values, andthe difference between the imputed values and their final predictions contributes to the percentageof variation explained. You can also specify the following optional EM-options in parentheses afterMISSING=EM:

MAXITER=nspecifies the maximum number of iterations for the imputation/fit loop. The default value is1. If you specify a large value of MAXITER=, then the loop will iterate until it converges (ascontrolled by the EPSILON= option).

Page 17: SAS/STAT 15.2 User’s Guide

PROC PLS Statement F 7667

EPSILON=nspecifies the convergence criterion for the imputation/fit loop. The default value is 10�8. Thisoption is effective only if you specify a large value for the MAXITER= option.

NFAC=nspecifies the number of factors to extract. The default is minf15; p; N g, where p is the number ofpredictors (the number of dependent variables for METHOD=RRR) and N is the number of runs(observations). This is probably more than you need for most applications. Extracting too many factorscan lead to an overfit model, one that matches the training data too well, sacrificing predictive ability.Thus, if you use the default NFAC= specification, you should also either use the CV= option to selectthe appropriate number of factors for the final model or consider the analysis to be preliminary andexamine the results to determine the appropriate number of factors for a subsequent analysis.

NOCENTERsuppresses centering of the responses and predictors before fitting. This is useful if the analysisvariables are already centered and scaled. For more information, see the section “Centering andScaling” on page 7681.

NOCVSTDIZEsuppresses re-centering and rescaling of the responses and predictors before each model is fit in thecross validation. For more information, see the section “Centering and Scaling” on page 7681.

NOPRINTsuppresses the normal display of results. This is useful when you want only the output statistics savedin a data set. Note that this option temporarily disables the Output Delivery System (ODS); for moreinformation, see Chapter 22, “Using the Output Delivery System.”

NOSCALEsuppresses scaling of the responses and predictors before fitting. This is useful if the analysis variablesare already centered and scaled. For more information, see the section “Centering and Scaling” onpage 7681.

PLOTS < (global-plot-options) > < = plot-request< (options) > >PLOTS < (global-plot-options) > < = (plot-request< (options) > < . . . plot-request< (options) > >) >

controls the plots produced through ODS Graphics. When you specify only one plot-request , you canomit the parentheses from around the plot request. For example:

plots=noneplots=cvplotplots=(diagnostics cvplot)plots(unpack)=diagnosticsplots(unpack)=(diagnostics corrload(trace=off))

ODS Graphics must be enabled before plots can be requested. For example:

ods graphics on;proc pls data=pentaTrain;

model log_RAI = S1-S5 L1-L5 P1-P5;run;ods graphics off;

Page 18: SAS/STAT 15.2 User’s Guide

7668 F Chapter 94: The PLS Procedure

For more information about enabling and disabling ODS Graphics, see the section “Enabling andDisabling ODS Graphics” on page 651 in Chapter 23, “Statistical Graphics Using ODS.”

If ODS Graphics is enabled but you do not specify the PLOTS= option, then PROC PLS produces bydefault a plot of the R-square analysis and a correlation loading plot summarizing the first two factors.

The global-plot-options include the following:

FLIPinterchanges the X-axis and Y-axis dimensions for the score, weight, and loading plots.

ONLYsuppresses the default plots. Only plots specifically requested are displayed.

UNPACKPANEL

UNPACKsuppresses paneling. By default, multiple plots can appear in some output panels. SpecifyUNPACKPANEL to get each plot in a separate panel. You can specify PLOTS(UNPACKPANEL)to unpack only the default plots. You can also specify UNPACKPANEL as a suboption for certainspecific plots, as discussed in the following.

The plot-requests include the following:

ALLproduces all appropriate plots. You can specify other options with ALL—for example, to requestall plots and unpack only the residuals, specify PLOTS=(ALL RESIDUALS(UNPACK)).

CORRLOAD < (options ) >produces a correlation loading plot (default). You can specify the following options:

TRACE=OFF | ONcontrols how points that correspond to the X-loadings are depicted. You can specify thefollowing values:

OFF specifies that all X-loadings be depicted in the plot by their names plotted at thecorresponding point on the graph.

ON specifies that the positions for all the X-loadings be depicted by a “trace” throughthe corresponding points.

By default, TRACE=ON if there are more than 20 predictors, and TRACE=OFF otherwise.

NFAC=nspecifies the number of factors for which to display correlation loading plots. By default,NFAC=2, which corresponds to a single plot for the first two factors. If you specify a valueof n greater than 2, then the n.n�1/=2 plots are displayed together in a matrix of correlationloading plots. The maximum number of factors that can be displayed in such a matrix is 8.

Page 19: SAS/STAT 15.2 User’s Guide

PROC PLS Statement F 7669

UNPACKrequests that the n.n � 1/=2 correlation loading plots be produced separately instead of in amatrix. This options has no effect unless the NFAC=n option is also specified, with a valueof n greater than 2.

CVPLOTproduces a cross validation and R-square analysis. This plot requires the CV= option to bespecified, and is displayed by default in this case.

DIAGNOSTICS < (UNPACK) >produces a summary panel of the fit for each dependent variable. The summary by default consistsof a panel for each dependent variable, with plots depicting the distribution of residuals andpredicted values. You can use the UNPACK suboption to specify that the subplots be producedseparately.

DMODproduces the DMODX, DMODY, and DMODXY plots.

DMODXproduces a plot of the distance of each observation to the X model.

DMODXYproduces plots of the distance of each observation to the X and Y models.

DMODYproduces a plot of the distance of each observation to the Y model.

FITproduces both the fit diagnostic panel and the ParmProfiles plot.

NONEsuppresses the display of graphics.

PARMPROFILESproduces profiles of the regression coefficients.

SCORES < (UNPACK | FLIP) >produces the XScores, YScores, XYScores, and DModXY plots. You can use the UNPACKsuboption to specify that the subplots for scores be produced separately, and the FLIP option tointerchange their default X-axis and Y-axis dimensions.

RESIDUALS < (UNPACK) >plots the residuals for each dependent variable against each independent variable. Residualplots are by default composed of multiple plots combined into a single panel. You can use theUNPACK suboption to specify that the subplots be produced separately.

VIPproduces profiles of variable importance factors.

Page 20: SAS/STAT 15.2 User’s Guide

7670 F Chapter 94: The PLS Procedure

WEIGHTS < (UNPACK | FLIP) >produces all X and Y loading and weight plots, as well as the VIP plot. You can use the UNPACKsuboption to specify that the subplots for weights and loadings be produced separately, and theFLIP option to interchange their default X-axis and Y-axis dimensions.

XLOADINGPLOT < (UNPACK | FLIP) >produces a scatter plot matrix of X-loadings against each other. Loading scatter plot matrices areby default composed of multiple plots combined into a single panel. You can use the UNPACKsuboption to specify that the subplots be produced separately, and the FLIP option to interchangethe default X-axis and Y-axis dimensions.

XLOADINGPROFILESproduces profiles of the X-loadings.

XSCORES < (UNPACK | FLIP) >produces a scatter plot matrix of X-scores against each other. Score scatter plot matrices are bydefault composed of multiple plots combined into a single panel. You can use the UNPACKsuboption to specify that the subplots be produced separately, and the FLIP option to interchangethe default X-axis and Y-axis dimensions.

XWEIGHTPLOT < (UNPACK | FLIP) >produces a scatter plot matrix of X-weights against each other. Weight scatter plot matrices areby default composed of multiple plots combined into a single panel. You can use the UNPACKsuboption to specify that the subplots be produced separately, and the FLIP option to interchangethe default X-axis and Y-axis dimensions.

XWEIGHTPROFILESproduces profiles of the X-weights.

XYSCORES < (UNPACK) >produces a scatter plot matrix of X-scores against Y-scores. Score scatter plot matrices are bydefault composed of multiple plots combined into a single panel. You can use the UNPACKsuboption to specify that the subplots be produced separately.

YSCORES < (UNPACK | FLIP) >produces a scatter plot matrix of Y-scores against each other. Score scatter plot matrices are bydefault composed of multiple plots combined into a single panel. You can use the UNPACKsuboption to specify that the subplots be produced separately, and the FLIP option to interchangethe default X-axis and Y-axis dimensions.

YWEIGHTPLOT < (UNPACK | FLIP) >produces a scatter plot matrix of Y-weights against each other. Weight scatter plot matrices areby default composed of multiple plots combined into a single panel. You can use the UNPACKsuboption to specify that the subplots be produced separately, and the FLIP option to interchangethe default X-axis and Y-axis dimensions.

Page 21: SAS/STAT 15.2 User’s Guide

BY Statement F 7671

VARSCALEspecifies that continuous model variables be centered and scaled prior to centering and scaling themodel effects in which they are involved. The rescaling specified by the VARSCALE option issometimes more appropriate if the model involves crossproducts between model variables; however,the VARSCALE option still might not produce the model you expect. For more information, see thesection “Centering and Scaling” on page 7681.

VARSSlists, in addition to the average response and predictor sum of squares accounted for by each successivefactor, the amount of variation accounted for in each response and predictor.

BY StatementBY variables ;

You can specify a BY statement in PROC PLS to obtain separate analyses of observations in groups that aredefined by the BY variables. When a BY statement appears, the procedure expects the input data set to besorted in order of the BY variables. If you specify more than one BY statement, only the last one specified isused.

If your input data set is not sorted in ascending order, use one of the following alternatives:

� Sort the data by using the SORT procedure with a similar BY statement.

� Specify the NOTSORTED or DESCENDING option in the BY statement in the PLS procedure. TheNOTSORTED option does not mean that the data are unsorted but rather that the data are arrangedin groups (according to values of the BY variables) and that these groups are not necessarily inalphabetical or increasing numeric order.

� Create an index on the BY variables by using the DATASETS procedure (in Base SAS software).

For more information about BY-group processing, see the discussion in SAS Language Reference: Concepts.For more information about the DATASETS procedure, see the discussion in the Base SAS Procedures Guide.

CLASS StatementCLASS variables < / TRUNCATE > ;

The CLASS statement names the classification variables to be used in the model. The PLS procedure uses asingular, GLM parameterization for effects involving CLASS variables, as discussed in the section “GLMParameterization of Classification Variables and Effects” on page 390 in Chapter 19, “Shared Concepts andTopics.” Typical classification variables are Treatment, Sex, Race, Group, and Replication. If you use theCLASS statement, it must appear before the MODEL statement statement.

Classification variables can be either character or numeric. By default, class levels are determined from theentire set of formatted values of the CLASS variables.

Page 22: SAS/STAT 15.2 User’s Guide

7672 F Chapter 94: The PLS Procedure

NOTE: Prior to SAS 9, class levels were determined by using no more than the first 16 characters of theformatted values. To revert to this previous behavior, you can use the TRUNCATE option in the CLASSstatement.

In any case, you can use formats to group values into levels. See the discussion of the FORMAT procedurein the Base SAS Procedures Guide and the discussions of the FORMAT statement and SAS formats in SASFormats and Informats: Reference.

You can specify the following option in the CLASS statement after a slash (/):

TRUNCATEspecifies that class levels should be determined by using only up to the first 16 characters of theformatted values of CLASS variables. When formatted values are longer than 16 characters, you canuse this option to revert to the levels as determined in releases prior to SAS 9.

EFFECT StatementEFFECT name=effect-type (variables< / options >) ;

The EFFECT statement enables you to construct special collections of columns for design matrices. Thesecollections are referred to as constructed effects to distinguish them from the usual model effects formed fromcontinuous or classification variables, as discussed in the section “GLM Parameterization of ClassificationVariables and Effects” on page 390 in Chapter 19, “Shared Concepts and Topics.”

The following effect-types are available.

COLLECTION specifies a collection effect that defines one or more variables as a singleeffect with multiple degrees of freedom. The variables in a collection areconsidered as a unit for estimation and inference.

LAG specifies a classification effect in which the level that is used for a particularperiod corresponds to the level in the preceding period.

MULTIMEMBER | MM specifies a multimember classification effect whose levels are determined byone or more variables that appear in a CLASS statement.

POLYNOMIAL | POLY specifies a multivariate polynomial effect in the specified numeric variables.

SPLINE specifies a regression spline effect whose columns are univariate spline ex-pansions of one or more variables. A spline expansion replaces the originalvariable with an expanded or larger set of new variables.

Table 94.2 summarizes the options available in the EFFECT statement.

Table 94.2 EFFECT Statement Options

Option Description

Collection Effects OptionsDETAILS Displays the constituents of the collection effect

Page 23: SAS/STAT 15.2 User’s Guide

EFFECT Statement F 7673

Table 94.2 continued

Option Description

Lag Effects OptionsDESIGNROLE= Names a variable that controls to which lag design an observation

is assigned

DETAILS Displays the lag design of the lag effect

NLAG= Specifies the number of periods in the lag

PERIOD= Names the variable that defines the period. This option is required.

WITHIN= Names the variable or variables that define the group within whicheach period is defined. This option is required.

Multimember Effects OptionsNOEFFECT Specifies that observations with all missing levels for the

multimember variables should have zero values in thecorresponding design matrix columns

WEIGHT= Specifies the weight variable for the contributions of each of theclassification effects

Polynomial Effects OptionsDEGREE= Specifies the degree of the polynomialMDEGREE= Specifies the maximum degree of any variable in a term of the

polynomialSTANDARDIZE= Specifies centering and scaling suboptions for the variables that

define the polynomial

Spline Effects OptionsBASIS= Specifies the type of basis (B-spline basis or truncated power

function basis) for the spline effectDEGREE= Specifies the degree of the spline effectKNOTMETHOD= Specifies how to construct the knots for the spline effect

For more information about the syntax of these effect-types and how columns of constructed effects arecomputed, see the section “EFFECT Statement” on page 400 in Chapter 19, “Shared Concepts and Topics.”

Page 24: SAS/STAT 15.2 User’s Guide

7674 F Chapter 94: The PLS Procedure

ID StatementID variables ;

The ID statement names variables whose values are used to label observations in plots. If you do not specifyan ID statement, then each observations is labeled in plots by its corresponding observation number.

MODEL StatementMODEL response-variables = predictor-effects < / options > ;

The MODEL statement names the responses and the predictors, which determine the Y and X matrices ofthe model, respectively. Usually you simply list the names of the predictor variables as the model effects, butyou can also use the effects notation of PROC GLM to specify polynomial effects and interactions; for moreinformation, see the section “Specification of Effects” on page 4083 in Chapter 52, “The GLM Procedure.”The MODEL statement is required. You can specify only one MODEL statement (in contrast to the REGprocedure, for example, which allows several MODEL statements in the same PROC REG run).

You can specify the following options in the MODEL statement after a slash (/).

INTERCEPTBy default, the responses and predictors are centered; thus, no intercept is required in the model. Youcan specify the INTERCEPT option to override the default.

SOLUTIONlists the coefficients of the final predictive model for the responses. The coefficients for predicting thecentered and scaled responses based on the centered and scaled predictors are displayed, as well as thecoefficients for predicting the raw responses based on the raw predictors.

OUTPUT StatementOUTPUT OUT=SAS-data-set keyword=names < . . . keyword=names > ;

You use the OUTPUT statement to specify a data set to receive quantities that can be computed for everyinput observation, such as extracted factors and predicted values. The following keywords are available:

PREDICTED predicted values for responses

YRESIDUAL residuals for responses

XRESIDUAL residuals for predictors

XSCORE extracted factors (X-scores, latent vectors, latent variables, T)

YSCORE extracted responses (Y-scores, U)

STDY standardized (centered and scaled) responses

STDX standardized (centered and scaled) predictors

H approximate leverage

Page 25: SAS/STAT 15.2 User’s Guide

Details: PLS Procedure F 7675

PRESS approximate predicted residuals

TSQUARE scaled sum of squares of score values

STDXSSE sum of squares of residuals for standardized predictors

STDYSSE sum of squares of residuals for standardized responses

Suppose that there are Nx predictors and Ny responses and that the model has Nf selected factors.

� The keywords XRESIDUAL and STDX define an output variable for each predictor, so Nx names arerequired after each one.

� The keywords PREDICTED, YRESIDUAL, STDY, and PRESS define an output variable for eachresponse, so Ny names are required after each of these keywords.

� The keywords XSCORE and YSCORE specify an output variable for each selected model factor. Forthese keywords, you provide only one base name, and the variables corresponding to each successivefactor are named by appending the factor number to the base name. For example, if Nf D 3, then aspecification of XSCORE=T would produce the variables T1, T2, and T3.

� Finally, the keywords H, TSQUARE, STDXSSE, and STDYSSE each specify a single output variable,so only one name is required after each of these keywords.

Details: PLS Procedure

Regression MethodsAll of the predictive methods implemented in PROC PLS work essentially by finding linear combinations ofthe predictors (factors) to use to predict the responses linearly. The methods differ only in how the factors arederived, as explained in the following sections.

Partial Least Squares

Partial least squares (PLS) works by extracting one factor at a time. Let X D X0 be the centered and scaledmatrix of predictors and let Y D Y0 be the centered and scaled matrix of response values. The PLS methodstarts with a linear combination t D X0w of the predictors, where t is called a score vector and w is itsassociated weight vector. The PLS method predicts both X0 and Y0 by regression on t:

OX0 D tp0; where p0 D .t0t/�1t0X0

OY0 D tc0; where c0 D .t0t/�1t0Y0

The vectors p and c are called the X- and Y-loadings, respectively.

The specific linear combination t D X0w is the one that has maximum covariance t0u with some responselinear combination u D Y0q. Another characterization is that the X- and Y-weights w and q are proportionalto the first left and right singular vectors of the covariance matrix X00Y0 or, equivalently, the first eigenvectorsof X00Y0Y00X0 and Y00X0X00Y0, respectively.

Page 26: SAS/STAT 15.2 User’s Guide

7676 F Chapter 94: The PLS Procedure

This accounts for how the first PLS factor is extracted. The second factor is extracted in the same way byreplacing X0 and Y0 with the X- and Y-residuals from the first factor:

X1 D X0 �OX0

Y1 D Y0 �OY0

These residuals are also called the deflated X and Y blocks. The process of extracting a score vector anddeflating the data matrices is repeated for as many extracted factors as are wanted.

SIMPLS

Note that each extracted PLS factor is defined in terms of different X-variables Xi . This leads to difficultiesin comparing different scores, weights, and so forth. The SIMPLS method of De Jong (1993) overcomesthese difficulties by computing each score ti D Xri in terms of the original (centered and scaled) predictorsX. The SIMPLS X-weight vectors ri are similar to the eigenvectors of SS0 D X0YY0X, but they satisfy adifferent orthogonality condition. The r1 vector is just the first eigenvector e1 (so that the first SIMPLS scoreis the same as the first PLS score), but whereas the second eigenvector maximizes

e01SS0e2 subject to e01e2 D 0

the second SIMPLS weight r2 maximizes

r01SS 0r2 subject to r01X0Xr2 D t01t2 D 0

The SIMPLS scores are identical to the PLS scores for one response but slightly different for more than oneresponse; see De Jong (1993) for details. The X- and Y-loadings are defined as in PLS, but since the scoresare all defined in terms of X, it is easy to compute the overall model coefficients B:

OY D

Xi

tic0i

D

Xi

Xric0i

D XB; where B D RC0

Principal Components Regression

Like the SIMPLS method, principal components regression (PCR) defines all the scores in terms of theoriginal (centered and scaled) predictors X. However, unlike both the PLS and SIMPLS methods, the PCRmethod chooses the X-weights/X-scores without regard to the response data. The X-scores are chosento explain as much variation in X as possible; equivalently, the X-weights for the PCR method are theeigenvectors of the predictor covariance matrix X0X. Again, the X- and Y-loadings are defined as in PLS;but, as in SIMPLS, it is easy to compute overall model coefficients for the original (centered and scaled)responses Y in terms of the original predictors X.

Page 27: SAS/STAT 15.2 User’s Guide

Regression Methods F 7677

Reduced Rank Regression

As discussed in the preceding sections, partial least squares depends on selecting factors t D Xw of thepredictors and u D Yq of the responses that have maximum covariance, whereas principal componentsregression effectively ignores u and selects t to have maximum variance, subject to orthogonality constraints.In contrast, reduced rank regression selects u to account for as much variation in the predicted responses aspossible, effectively ignoring the predictors for the purposes of factor extraction. In reduced rank regression,the Y-weights qi are the eigenvectors of the covariance matrix OY0LS

OYLS of the responses predicted by ordinaryleast squares regression; the X-scores are the projections of the Y-scores Yqi onto the X space.

Relationships between Methods

When you develop a predictive model, it is important to consider not only the explanatory power of the modelfor current responses, but also how well sampled the predictive functions are, since this affects how wellthe model can extrapolate to future observations. All of the techniques implemented in the PLS procedurework by extracting successive factors, or linear combinations of the predictors, that optimally address oneor both of these two goals—explaining response variation and explaining predictor variation. In particular,principal components regression selects factors that explain as much predictor variation as possible, reducedrank regression selects factors that explain as much response variation as possible, and partial least squaresbalances the two objectives, seeking for factors that explain both response and predictor variation.

To see the relationships between these methods, consider how each one extracts a single factor from thefollowing artificial data set consisting of two predictors and one response:

data data;input x1 x2 y;datalines;3.37651 2.30716 0.756150.74193 -0.88845 1.152854.18747 2.17373 1.423920.96097 0.57301 0.27433

-1.11161 -0.75225 -0.25410-1.38029 -1.31343 -0.047281.28153 -0.13751 1.00341

-1.39242 -2.03615 0.455180.63741 0.06183 0.40699

-2.52533 -1.23726 -0.910802.44277 3.61077 -0.82590

;

proc pls data=data nfac=1 method=rrr;model y = x1 x2;

run;

proc pls data=data nfac=1 method=pcr;model y = x1 x2;

run;

proc pls data=data nfac=1 method=pls;model y = x1 x2;

run;

Page 28: SAS/STAT 15.2 User’s Guide

7678 F Chapter 94: The PLS Procedure

The amount of model and response variation explained by the first factor for each method is shown inFigure 94.10 through Figure 94.12.

Figure 94.10 Variation Explained by First Reduced Rank Regression Factor

The PLS Procedure

Percent Variation Accounted for by ReducedRank Regression Factors

Model EffectsDependentVariables

Numberof

ExtractedFactors Current Total Current Total

1 15.0661 15.0661 100.0000 100.0000

Figure 94.11 Variation Explained by First Principal Components Regression Factor

The PLS Procedure

Percent Variation Accounted for byPrincipal Components

Model EffectsDependentVariables

Numberof

ExtractedFactors Current Total Current Total

1 92.9996 92.9996 9.3787 9.3787

Figure 94.12 Variation Explained by First Partial Least Squares Regression Factor

The PLS Procedure

Percent Variation Accounted for by PartialLeast Squares Factors

Model EffectsDependentVariables

Numberof

ExtractedFactors Current Total Current Total

1 88.5357 88.5357 26.5304 26.5304

Notice that, while the first reduced rank regression factor explains all of the response variation, it accountsfor only about 15% of the predictor variation. In contrast, the first principal components regression factoraccounts for most of the predictor variation (93%) but only 9% of the response variation. The first partialleast squares factor accounts for only slightly less predictor variation than principal components but aboutthree times as much response variation.

Figure 94.13 illustrates how partial least squares balances the goals of explaining response and predictorvariation in this case.

Page 29: SAS/STAT 15.2 User’s Guide

Regression Methods F 7679

Figure 94.13 Depiction of First Factors for Three Different Regression Methods

The ellipse shows the general shape of the 11 observations in the predictor space, with the contours ofincreasing y overlaid. Also shown are the directions of the first factor for each of the three methods. Noticethat, while the predictors vary most in the x1 = x2 direction, the response changes most in the orthogonal x1= -x2 direction. This explains why the first principal component accounts for little variation in the responseand why the first reduced rank regression factor accounts for little variation in the predictors. The directionof the first partial least squares factor represents a compromise between the other two directions.

Page 30: SAS/STAT 15.2 User’s Guide

7680 F Chapter 94: The PLS Procedure

Cross ValidationNone of the regression methods implemented in the PLS procedure fit the observed data any better thanordinary least squares (OLS) regression; in fact, all of the methods approach OLS as more factors areextracted. The crucial point is that, when there are many predictors, OLS can overfit the observed data;biased regression methods with fewer extracted factors can provide better predictability of future observations.However, as the preceding observations imply, the quality of the observed data fit cannot be used to choosethe number of factors to extract; the number of extracted factors must be chosen on the basis of how well themodel fits observations not involved in the modeling procedure itself.

One method of choosing the number of extracted factors is to fit the model to only part of the available data(the training set) and to measure how well models with different numbers of extracted factors fit the otherpart of the data (the test set). This is called test set validation. However, it is rare that you have enoughdata to make both parts large enough for pure test set validation to be useful. Alternatively, you can makeseveral different divisions of the observed data into training set and test set. This is called cross validation,and there are several different types. In one-at-a-time cross validation, the first observation is held out as asingle-element test set, with all other observations as the training set; next, the second observation is held out,then the third, and so on. Another method is to hold out successive blocks of observations as test sets—forexample, observations 1 through 7, then observations 8 through 14, and so on; this is known as blockedvalidation. A similar method is split-sample cross validation, in which successive groups of widely separatedobservations are held out as the test set—for example, observations {1, 11, 21, . . . }, then observations {2,12, 22, . . . }, and so on. Finally, test sets can be selected from the observed data randomly; this is known asrandom sample cross validation.

Which validation you should use depends on your data. Test set validation is preferred when you haveenough data to make a division into a sizable training set and test set that represent the predictive populationwell. You can specify that the number of extracted factors be selected by test set validation by using theCV=TESTSET(data set) option, where data set is the name of the data set containing the test set. If you donot have enough data for test set validation, you can use one of the cross validation techniques. The mostcommon technique is one-at-a-time validation (which you can specify with the CV=ONE option or just theCV option), unless the observed data are serially correlated, in which case either blocked or split-samplevalidation might be more appropriate (CV=BLOCK or CV=SPLIT); you can specify the number of test setsin blocked or split-sample validation with a number in parentheses after the CV= option. Note that CV=ONEis the most computationally intensive of the cross validation methods, since it requires a recomputation of thePLS model for every input observation. Also, note that using random subset selection with CV=RANDOMmight lead two different researchers to produce different PLS models on the same data (unless the same seedis used).

Whichever validation method you use, the number of factors chosen is usually the one that minimizes thepredicted residual sum of squares (PRESS); this is the default choice if you specify any of the CV methodswith PROC PLS. However, often models with fewer factors have PRESS statistics that are only marginallylarger than the absolute minimum. To address this, Van der Voet (1994) has proposed a statistical test forcomparing the predicted residuals from different models; when you apply van der Voet’s test, the number offactors chosen is the fewest with residuals that are insignificantly larger than the residuals of the model withminimum PRESS.

To see how van der Voet’s test works, let Ri;jk be the jth predicted residual for response k for the modelwith i extracted factors; the PRESS statistic is

Pjk R2

i;jk. Also, let imin be the number of factors for which

PRESS is minimized. The critical value for van der Voet’s test is based on the differences between squared

Page 31: SAS/STAT 15.2 User’s Guide

Centering and Scaling F 7681

predicted residuals

Di;jk D R2i;jk �R2

imin;jk

One alternative for the critical value is Ci DP

jk Di;jk , which is just the difference between the PRESSstatistics for i and imin factors; alternatively, van der Voet suggests Hotelling’s T 2 statistic Ci D d0i;�S

�1i di;�,

where di;� is the sum of the vectors di;j D fDi;j1; : : : ; Di;jNyg0 and Si is the sum of squares and crossprod-

ucts matrix

Si D

Xj

di;j d0i;j

Virtually, the significance level for van der Voet’s test is obtained by comparing Ci with the distribution ofvalues that result from randomly exchanging R2

i;jkand R2

imin;jk. In practice, a Monte Carlo sample of such

values is simulated and the significance level is approximated as the proportion of simulated critical valuesthat are greater than Ci . If you apply van der Voet’s test by specifying the CVTEST option, then, by default,the number of extracted factors chosen is the least number with an approximate significance level that isgreater than 0.10.

Centering and ScalingBy default, the predictors and the responses are centered and scaled to have mean 0 and standard deviation1. Centering the predictors and the responses ensures that the criterion for choosing successive factors isbased on how much variation they explain, in either the predictors or the responses or both. (See the section“Regression Methods” on page 7675 for more details on how different methods explain variation.) Withoutcentering, both the mean variable value and the variation around that mean are involved in selecting factors.Scaling serves to place all predictors and responses on an equal footing relative to their variation in the data.For example, if Time and Temp are two of the predictors, then scaling says that a change of std.Time/ inTime is roughly equivalent to a change of std.Temp/ in Temp.

Usually, both the predictors and responses should be centered and scaled. However, if their values alreadyrepresent variation around a nominal or target value, then you can use the NOCENTER option in the PROCPLS statement to suppress centering. Likewise, if the predictors or responses are already all on comparablescales, then you can use the NOSCALE option to suppress scaling.

Note that, if the predictors involve crossproduct terms, then, by default, the variables are not standardizedbefore standardizing the crossproduct. That is, if the ith values of two predictors are denoted x1

i and x2i , then

the default standardized ith value of the crossproduct is

x1i x2

i �meanj .x1j x2

j /

stdj .x1j x2

j /

If you want the crossproduct to be based instead on standardized variables

x1i �m1

s1�

x2i �m2

s2

Page 32: SAS/STAT 15.2 User’s Guide

7682 F Chapter 94: The PLS Procedure

where mk D meanj .xkj / and sk D stdj .xk

j / for k D 1; 2, then you should use the VARSCALE optionin the PROC PLS statement. Standardizing the variables separately is usually a good idea, but unless themodel also contains all crossproducts nested within each term, the resulting model might not be equivalent toa simple linear model in the same terms. To see this, note that a model involving the crossproduct of twostandardized variables

x1i �m1

s1�

x2i �m2

s2D x1

i x2i

1

s1s2� x1

i

m2

s1s2� x2

i

m1

s1s2C

m1m2

s1s2

involves both the crossproduct term and the linear terms for the unstandardized variables.

When cross validation is performed for the number of effects, there is some disagreement among practitionersas to whether each cross validation training set should be retransformed. By default, PROC PLS does so, butyou can suppress this behavior by specifying the NOCVSTDIZE option in the PROC PLS statement.

Missing ValuesBy default, PROC PLS handles missing values very simply. Observations with any missing independentvariables (including all classification variables) are excluded from the analysis, and no predictions arecomputed for such observations. Observations with no missing independent variables but any missingdependent variables are also excluded from the analysis, but predictions are computed.

However, the MISSING= option in the PROC PLS statement provides more sophisticated ways of modelingin the presence of missing values. If you specify MISSING=AVG or MISSING=EM, then all observations inthe input data set contribute to both the analysis and the OUTPUT OUT= data set. With MISSING=AVG, thefit is computed by filling in missing values with the average of the nonmissing values for the correspondingvariable. With MISSING=EM, the procedure first computes the model with MISSING=AVG, then fills inmissing values with their predicted values based on that model and computes the model again. Alternatively,you can specify MISSING=EM(MAXITER=n) with a large value of n in order to perform this imputation/fitloop until convergence.

Displayed OutputBy default, PROC PLS displays just the amount of predictor and response variation accounted for by eachfactor.

If you perform a cross validation for the number of factors by specifying the CV option in the PROC PLSstatement, then the procedure displays a summary of the cross validation for each number of factors, alongwith information about the optimal number of factors.

If you specify the DETAILS option in the PROC PLS statement, then details of the fitted model are displayedfor each successive factor. These details for each number of factors include the following:

� the predictor loadings

� the predictor weights

� the response weights

Page 33: SAS/STAT 15.2 User’s Guide

ODS Table Names F 7683

� the coded regression coefficients (for METHOD=SIMPLS, PCR, or RRR)

If you specify the CENSCALE option in the PROC PLS statement, then centering and scaling informationfor each response and predictor is displayed.

If you specify the VARSS option in the PROC PLS statement, the procedure displays, in addition to theaverage response and predictor sum of squares accounted for by each successive factor, the amount ofvariation accounted for in each response and predictor.

If you specify the SOLUTION option in the MODEL statement, then PROC PLS displays the coefficients ofthe final predictive model for the responses. The coefficients for predicting the centered and scaled responsesbased on the centered and scaled predictors are displayed, as well as the coefficients for predicting the rawresponses based on the raw predictors.

ODS Table NamesPROC PLS assigns a name to each table it creates. You can use these names to reference the table whenusing the Output Delivery System (ODS) to select tables and create output data sets. These names are listedin Table 94.3. For more information about ODS, see Chapter 22, “Using the Output Delivery System.”

Table 94.3 ODS Tables Produced by PROC PLS

ODS Table Name Description Statement Option

CVResults Results of cross validation PROC PLS CVCenScaleParms Parameter estimates for centered and

scaled dataMODEL SOLUTION

CodedCoef Coded coefficients PROC PLS DETAILSMissingIterations Iterations for missing value imputation PROC PLS MISSING=EMModelInfo Model information PROC PLS DefaultNObs Number of observations PROC PLS DefaultParameterEstimates Parameter estimates for raw data MODEL SOLUTIONPercentVariation Variation accounted for by each factor PROC PLS DefaultResidualSummary Residual summary from cross validation PROC PLS CVXEffectCenScale Centering and scaling information for

predictor effectsPROC PLS CENSCALE

XLoadings Loadings for independents PROC PLS DETAILSXVariableCenScale Centering and scaling information for

predictor variablesPROC PLS CENSCALE

and VARSCALEXWeights Weights for independents PROC PLS DETAILSYVariableCenScale Centering and scaling information for

responsesPROC PLS CENSCALE

YWeights Weights for dependents PROC PLS DETAILS

Page 34: SAS/STAT 15.2 User’s Guide

7684 F Chapter 94: The PLS Procedure

ODS GraphicsStatistical procedures use ODS Graphics to create graphs as part of their output. ODS Graphics is describedin detail in Chapter 23, “Statistical Graphics Using ODS.”

Before you create graphs, ODS Graphics must be enabled (for example, by specifying the ODS GRAPH-ICS ON statement). For more information about enabling and disabling ODS Graphics, see the section“Enabling and Disabling ODS Graphics” on page 651 in Chapter 23, “Statistical Graphics Using ODS.”

The overall appearance of graphs is controlled by ODS styles. Styles and other aspects of using ODSGraphics are discussed in the section “A Primer on ODS Statistical Graphics” on page 650 in Chapter 23,“Statistical Graphics Using ODS.”

When ODS Graphics is enabled, by default the PLS procedure produces a plot of the variation accounted forby each extracted factor, as well as a correlation loading plot for the first two extracted factors (if the finalmodel has at least two factors). The plot of the variation accounted for can take several forms:

� If the PLS analysis does not include cross validation, then the plot shows the total R square for bothmodel effects and the dependent variables against the number of factors.

� If you specify the CV= option to select the number of factors in the final model by cross validation,then the plot shows the R-square analysis discussed previously as well as the root mean PRESS fromthe cross validation analysis, with the selected number of factors identified by a vertical line.

The correlation loading plot for the first two factors summarizes many aspects of the two most significantdimensions of the model. It consists of overlaid scatter plots of the scores of the first two factors, the loadingsof the model effects, and the loadings of the dependent variables. The loadings are scaled so that the amountof variation in the variables that is explained by the model is proportional to the distance from the origin;circles indicating various levels of explained variation are also overlaid on the correlation loading plot. Also,the correlation between the model approximations for any two variables is proportional to the length of theprojection of the point corresponding to one variable on a line through the origin passing through the pointcorresponding to the other variable; the sign of the correlation corresponds to which side of the origin theprojected point falls on.

The R square and the first two correlation loadings are plotted by default when ODS Graphics is enabled, butyou can produce many other plots for the PROC PLS analysis.

ODS Graph Names

PROC PLS assigns a name to each graph it creates using ODS. You can use these names to reference thegraphs when using ODS. The names are listed in Table 94.4.

Table 94.4 Graphs Produced by PROC PLS

ODS Graph Name Plot Description Option

CorrLoadPlot Correlation loading plot(default)

PLOT=CORRLOAD(option)

CVPlot Cross validation andR-square analysis (default,as appropriate)

CV=

Page 35: SAS/STAT 15.2 User’s Guide

ODS Graphics F 7685

Table 94.4 continued

ODS Graph Name Plot Description Option

DModXPlot Distance of eachobservation to the X model

PLOT=DMODX

DModXYPlot Distance of eachobservation to the X and Ymodels

PLOT=DMODXY

DModYPlot Distance of eachobservation to the Y model

PLOT=DMODY

DiagnosticsPanel Panel of diagnostic plots forthe fit

PLOT=DIAGNOSTICS

AbsResidualByPredicted Absolute residual bypredicted values

PLOT=DIAGNOSTICS(UNPACK)

ObservedByPredicted Observed by predicted PLOT=DIAGNOSTICS(UNPACK)QQPlot Residual Q-Q plot PLOT=DIAGNOSTICS(UNPACK)ResidualByPredicted Residual by predicted

valuesPLOT=DIAGNOSTICS(UNPACK)

ResidualHistogram Residual histogram PLOT=DIAGNOSTICS(UNPACK)RFPlot RF plot PLOT=DIAGNOSTICS(UNPACK)

ParmProfiles Profiles of regressioncoefficients

PLOT=PARMPROFILES

R2Plot R-square analysis (default,as appropriate)

ResidualPlots Residuals for eachdependent variable

PLOT=RESIDUALS

VariableImportancePlot Profile of variableimportance factors

PLOT=VIP

XLoadingPlot Scatter plot matrix ofX-loadings against eachother

PLOT=XLOADINGPLOT

XLoadingProfiles Profiles of the X-loadings PLOT=XLOADINGPROFILESXScorePlot Scatter plot matrix of

X-scores against each otherPLOT=XSCORES

XWeightPlot Scatter plot matrix ofX-weights against eachother

PLOT=XWEIGHTPLOT

XWeightProfiles Profiles of the X-weights PLOT=XWEIGHTPROFILESXYScorePlot Scatter plot matrix of

X-scores against Y-scoresPLOT=XYSCORES

YScorePlot Scatter plot matrix ofY-scores against each other

PLOT=YSCORES

YWeightPlot Scatter plot matrix ofY-weights against each other

PLOT=YWEIGHTPLOT

Page 36: SAS/STAT 15.2 User’s Guide

7686 F Chapter 94: The PLS Procedure

Examples: PLS Procedure

Example 94.1: Examining Model DetailsThis example, from Umetrics (1995), demonstrates different ways to examine a PLS model. The data comefrom the field of drug discovery. New drugs are developed from chemicals that are biologically active. Testinga compound for biological activity is an expensive procedure, so it is useful to be able to predict biologicalactivity from cheaper chemical measurements. In fact, computational chemistry makes it possible to calculatecertain chemical measurements without even making the compound. These measurements include size,lipophilicity, and polarity at various sites on the molecule. The following statements create a data set namedpentaTrain, which contains these data.

data pentaTrain;input obsnam $ S1 L1 P1 S2 L2 P2

S3 L3 P3 S4 L4 P4S5 L5 P5 log_RAI @@;

n = _n_;datalines;

VESSK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.07011.9607 -1.6324 0.5746 1.9607 -1.6324 0.57462.8369 1.4092 -3.1398 0.00

VESAK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.07011.9607 -1.6324 0.5746 0.0744 -1.7333 0.09022.8369 1.4092 -3.1398 0.28

VEASK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.07010.0744 -1.7333 0.0902 1.9607 -1.6324 0.57462.8369 1.4092 -3.1398 0.20

VEAAK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.07010.0744 -1.7333 0.0902 0.0744 -1.7333 0.09022.8369 1.4092 -3.1398 0.51

VKAAK -2.6931 -2.5271 -1.2871 2.8369 1.4092 -3.13980.0744 -1.7333 0.0902 0.0744 -1.7333 0.09022.8369 1.4092 -3.1398 0.11

VEWAK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701-4.7548 3.6521 0.8524 0.0744 -1.7333 0.09022.8369 1.4092 -3.1398 2.73

VEAAP -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.07010.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902

-1.2201 0.8829 2.2253 0.18VEHAK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701

2.4064 1.7438 1.1057 0.0744 -1.7333 0.09022.8369 1.4092 -3.1398 1.53

VAAAK -2.6931 -2.5271 -1.2871 0.0744 -1.7333 0.09020.0744 -1.7333 0.0902 0.0744 -1.7333 0.09022.8369 1.4092 -3.1398 -0.10

GEAAK 2.2261 -5.3648 0.3049 3.0777 0.3891 -0.07010.0744 -1.7333 0.0902 0.0744 -1.7333 0.09022.8369 1.4092 -3.1398 -0.52

LEAAK -4.1921 -1.0285 -0.9801 3.0777 0.3891 -0.07010.0744 -1.7333 0.0902 0.0744 -1.7333 0.0902

Page 37: SAS/STAT 15.2 User’s Guide

Example 94.1: Examining Model Details F 7687

2.8369 1.4092 -3.1398 0.40FEAAK -4.9217 1.2977 0.4473 3.0777 0.3891 -0.0701

0.0744 -1.7333 0.0902 0.0744 -1.7333 0.09022.8369 1.4092 -3.1398 0.30

VEGGK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.07012.2261 -5.3648 0.3049 2.2261 -5.3648 0.30492.8369 1.4092 -3.1398 -1.00

VEFAK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701-4.9217 1.2977 0.4473 0.0744 -1.7333 0.09022.8369 1.4092 -3.1398 1.57

VELAK -2.6931 -2.5271 -1.2871 3.0777 0.3891 -0.0701-4.1921 -1.0285 -0.9801 0.0744 -1.7333 0.09022.8369 1.4092 -3.1398 0.59

;

You would like to study the relationship between these measurements and the activity of the compound,represented by the logarithm of the relative Bradykinin activating activity (log_RAI). Notice that thesedata consist of many predictors relative to the number of observations. Partial least squares is especiallyappropriate in this situation as a useful tool for finding a few underlying predictive factors that accountfor most of the variation in the response. Typically, the model is fit for part of the data (the “training” or“work” set), and the quality of the fit is judged by how well it predicts the other part of the data (the “test” or“prediction” set). For this example, the first 15 observations serve as the training set and the rest constitute thetest set (see Ufkes et al. 1978, 1982).

When you fit a PLS model, you hope to find a few PLS factors that explain most of the variation in bothpredictors and responses. Factors that explain response variation provide good predictive models for newresponses, and factors that explain predictor variation are well represented by the observed values of thepredictors. The following statements fit a PLS model with two factors and save predicted values, residuals,and other information for each data point in a data set named outpls.

proc pls data=pentaTrain;model log_RAI = S1-S5 L1-L5 P1-P5;

run;

The PLS procedure displays a table, shown in Output 94.1.1, showing how much predictor and responsevariation is explained by each PLS factor.

Page 38: SAS/STAT 15.2 User’s Guide

7688 F Chapter 94: The PLS Procedure

Output 94.1.1 Amount of Training Set Variation Explained

The PLS Procedure

Percent Variation Accounted for by PartialLeast Squares Factors

Model EffectsDependentVariables

Numberof

ExtractedFactors Current Total Current Total

1 16.9014 16.9014 89.6399 89.6399

2 12.7721 29.6735 7.8368 97.4767

3 14.6554 44.3289 0.4636 97.9403

4 11.8421 56.1710 0.2485 98.1889

5 10.5894 66.7605 0.1494 98.3383

6 5.1876 71.9481 0.2617 98.6001

7 6.1873 78.1354 0.2428 98.8428

8 7.2252 85.3606 0.1926 99.0354

9 6.7285 92.0891 0.0725 99.1080

10 7.9076 99.9967 0.0000 99.1080

11 0.0033 100.0000 0.0099 99.1179

12 0.0000 100.0000 0.0000 99.1179

13 0.0000 100.0000 0.0000 99.1179

14 0.0000 100.0000 0.0000 99.1179

15 0.0000 100.0000 0.0000 99.1179

From Output 94.1.1, note that 97% of the response variation is already explained by just two factors, but only29% of the predictor variation is explained.

The graphics in PROC PLS, available when ODS Graphics is enabled, make it easier to see features of thePLS model.

If ODS Graphics is enabled, then in addition to the tables discussed previously, PROC PLS displays agraphical depiction of the R-square analysis as well as a correlation loading plot summarizing the modelbased on the first two PLS factors. The following statements perform the previous analysis with ODSGraphics enabled, producing Output 94.1.2 and Output 94.1.3.

ods graphics on;

proc pls data=pentaTrain;model log_RAI = S1-S5 L1-L5 P1-P5;

run;

Page 39: SAS/STAT 15.2 User’s Guide

Example 94.1: Examining Model Details F 7689

Output 94.1.2 Plot of Proportion of Variation Accounted For

Page 40: SAS/STAT 15.2 User’s Guide

7690 F Chapter 94: The PLS Procedure

Output 94.1.3 Correlation Loading Plot

The plot in Output 94.1.2 of the proportion of variation explained (or R square) makes it clear that there is aplateau in the response variation after two factors are included in the model. The correlation loading plot inOutput 94.1.3 summarizes many features of this two-factor model, including the following:

� The X-scores are plotted as numbers for each observation. You should look for patterns or clearlygrouped observations. If you see a curved pattern, for example, you might want to add a quadraticterm. Two or more groupings of observations indicate that it might be better to analyze the groupsseparately, perhaps by including classification effects in the model. This plot appears to show most ofthe observations close together, with a few being more spread out with larger positive X-scores forfactor 2. There are no clear grouping patterns, but observation 13 stands out.

� The loadings show how much variation in each variable is accounted for by the first two factors, jointlyby the distance of the corresponding point from the origin and individually by the distance for theprojections of this point onto the horizontal and vertical axes. That the dependent variable is wellexplained by the model is reflected in the fact that the point for log_RAI is near the 100% circle.

� You can also use the projection interpretation to relate variables to each other. For example, projectingother variables’ points onto the line that runs through the log_RAI point and the origin, you can seethat the PLS approximation for the predictor L3 is highly positively correlated with log_RAI, S3 is

Page 41: SAS/STAT 15.2 User’s Guide

Example 94.1: Examining Model Details F 7691

somewhat less correlated but in the negative direction, and several predictors including L1, L5, and S5have very little correlation with log_RAI.

Other graphics enable you to explore more of the features of the PLS model. For example, you can examinethe X-scores versus the Y-scores to explore how partial least squares chooses successive factors. For a goodPLS model, the first few factors show a high correlation between the X- and Y-scores. The correlation usuallydecreases from one factor to the next. When ODS Graphics is enabled, you can plot the X-scores versus theY-scores by using the PLOT=XYSCORES option, as shown in the following statements.

proc pls data=pentaTrain nfac=4 plot=XYScores;model log_RAI = S1-S5 L1-L5 P1-P5;

run;

The plot of the X-scores versus the Y-scores for the first four factors is shown in Output 94.1.4.

Output 94.1.4 X-Scores versus Y-Scores

For this example, Output 94.1.4 shows high correlation between X- and Y-scores for the first factor butsomewhat lower correlation for the second factor and sharply diminishing correlation after that. This addsstrength to the judgment that NFAC=2 is the right number of factors for these data and this model. Notethat observation 13 is again extreme in the first two plots. This run might be overly influential for the PLS

Page 42: SAS/STAT 15.2 User’s Guide

7692 F Chapter 94: The PLS Procedure

analysis; thus, you should check to make sure it is reliable.

As explained earlier, you can draw some inferences about the relationship between individual predictors andthe dependent variable from the correlation loading plot. However, the regression coefficient profile and thevariable importance plot give a more direct indication of which predictors are most useful for predicting thedependent variable. The regression coefficients represent the importance each predictor has in the predictionof just the response. The variable importance plot, on the other hand, represents the contribution of eachpredictor in fitting the PLS model for both predictors and response. It is based on the Variable Importancefor Projection (VIP) statistic of Wold (1994), which summarizes the contribution a variable makes to themodel. If a predictor has a relatively small coefficient (in absolute value) and a small value of VIP, then it isa prime candidate for deletion. Wold in Umetrics (1995) considers a value less than 0.8 to be “small” for theVIP. The following statements fit a two-factor PLS model and display these two additional plots.

proc pls data=pentaTrain nfac=2 plot=(ParmProfiles VIP);model log_RAI = S1-S5 L1-L5 P1-P5;

run;

ods graphics off;

The additional graphics are shown in Output 94.1.5 and Output 94.1.6.

Output 94.1.5 Variable Importance Plots

Page 43: SAS/STAT 15.2 User’s Guide

Example 94.2: Examining Outliers F 7693

Output 94.1.6 Regression Parameter Profile

In these two plots, the variables L1, L2, P2, S5, L5, and P5 have small absolute coefficients and small VIP.Looking back at the correlation loading plot in Output 94.1.2, you can see that these variables tend to be theones near zero for both PLS factors. You should consider dropping these variables from the model.

Example 94.2: Examining OutliersThis example is a continuation of Example 94.1.

Standard diagnostics for statistical models focus on the response, allowing you to look for patterns thatindicate the model is inadequate or for outliers that do not seem to follow the trend of the rest of the data.However, partial least squares effectively models the predictors as well as the responses, so you shouldconsider the pattern of the fit for both. The DModX and DModY statistics give the distance from each pointto the PLS model with respect to the predictors and the responses, respectively, and ODS Graphics enablesyou to plot these values. No point should be dramatically farther from the model than the rest. If there is agroup of points that are all farther from the model than the rest, they might have something in common, inwhich case they should be analyzed separately.

Page 44: SAS/STAT 15.2 User’s Guide

7694 F Chapter 94: The PLS Procedure

The following statements fit a reduced model to the data discussed in Example 94.1 and plot a panel ofstandard diagnostics as well as the distances of the observations to the model.

ods graphics on;

proc pls data=pentaTrain nfac=2 plot=(diagnostics dmod);model log_RAI = S1 P1

S2S3 L3 P3S4 L4 ;

run;

The plots are shown in Output 94.2.1 and Output 94.2.2.

Output 94.2.1 Model Fit Diagnostics

Page 45: SAS/STAT 15.2 User’s Guide

Example 94.3: Choosing a PLS Model by Test Set Validation F 7695

Output 94.2.2 Predictor versus Response Distances to the Model

There appear to be no profound outliers in either the predictor space or the response space.

Example 94.3: Choosing a PLS Model by Test Set ValidationThis example demonstrates issues in spectrometric calibration. The data (Umetrics 1995) consist of spectro-graphic readings on 33 samples containing known concentrations of two amino acids, tyrosine and tryptophan.The spectra are measured at 30 frequencies across the overall range of frequencies. For example, Figure 94.3.1shows the observed spectra for three samples, one with only tryptophan, one with only tyrosine, and one witha mixture of the two, all at a total concentration of 10�6.

Page 46: SAS/STAT 15.2 User’s Guide

7696 F Chapter 94: The PLS Procedure

Output 94.3.1 Spectra for Three Samples of Tyrosine and Tryptophan

Of the 33 samples, 18 are used as a training set and 15 as a test set. The data originally appear in McAvoyet al. (1989).

These data were created in a lab, with the concentrations fixed in order to provide a wide range of applicabilityfor the model. You want to use a linear function of the logarithms of the spectra to predict the logarithms oftyrosine and tryptophan concentration, as well as the logarithm of the total concentration. Actually, becauseof the possibility of zeros in both the responses and the predictors, slightly different transformations are used.The following statements create SAS data sets containing the training and test data, named ftrain and ftest,respectively.

data ftrain;input obsnam $ tot tyr f1-f30 @@;try = tot - tyr;if (tyr) then tyr_log = log10(tyr); else tyr_log = -8;if (try) then try_log = log10(try); else try_log = -8;tot_log = log10(tot);datalines;

17mix35 0.00003 0-6.215 -5.809 -5.114 -3.963 -2.897 -2.269 -1.675 -1.235-0.900 -0.659 -0.497 -0.395 -0.335 -0.315 -0.333 -0.377

Page 47: SAS/STAT 15.2 User’s Guide

Example 94.3: Choosing a PLS Model by Test Set Validation F 7697

-0.453 -0.549 -0.658 -0.797 -0.878 -0.954 -1.060 -1.266-1.520 -1.804 -2.044 -2.269 -2.496 -2.714

19mix35 0.00003 3E-7-5.516 -5.294 -4.823 -3.858 -2.827 -2.249 -1.683 -1.218-0.907 -0.658 -0.501 -0.400 -0.345 -0.323 -0.342 -0.387-0.461 -0.554 -0.665 -0.803 -0.887 -0.960 -1.072 -1.272-1.541 -1.814 -2.058 -2.289 -2.496 -2.712

21mix35 0.00003 7.5E-7-5.519 -5.294 -4.501 -3.863 -2.827 -2.280 -1.716 -1.262-0.939 -0.694 -0.536 -0.444 -0.384 -0.369 -0.377 -0.421-0.495 -0.596 -0.706 -0.824 -0.917 -0.988 -1.103 -1.294-1.565 -1.841 -2.084 -2.320 -2.521 -2.729

23mix35 0.00003 1.5E-6

... more lines ...

mix6 0.0001 0.00009-1.140 -0.757 -0.497 -0.362 -0.329 -0.412 -0.513 -0.647-0.772 -0.877 -0.958 -1.040 -1.104 -1.162 -1.233 -1.317-1.425 -1.543 -1.661 -1.804 -1.877 -1.959 -2.034 -2.249-2.502 -2.732 -2.964 -3.142 -3.313 -3.576

;

data ftest;input obsnam $ tot tyr f1-f30 @@;try = tot - tyr;if (tyr) then tyr_log = log10(tyr); else tyr_log = -8;if (try) then try_log = log10(try); else try_log = -8;tot_log = log10(tot);datalines;

43trp6 1E-6 0-5.915 -5.918 -6.908 -5.428 -4.117 -5.103 -4.660 -4.351-4.023 -3.849 -3.634 -3.634 -3.572 -3.513 -3.634 -3.572-3.772 -3.772 -3.844 -3.932 -4.017 -4.023 -4.117 -4.227-4.492 -4.660 -4.855 -5.428 -5.103 -5.428

59mix6 1E-6 1E-7-5.903 -5.903 -5.903 -5.082 -4.213 -5.083 -4.838 -4.639-4.474 -4.213 -4.001 -4.098 -4.001 -4.001 -3.907 -4.001-4.098 -4.098 -4.206 -4.098 -4.213 -4.213 -4.335 -4.474-4.639 -4.838 -4.837 -5.085 -5.410 -5.410

51mix6 1E-6 2.5E-7-5.907 -5.907 -5.415 -4.843 -4.213 -4.843 -4.843 -4.483-4.343 -4.006 -4.006 -3.912 -3.830 -3.830 -3.755 -3.912-4.006 -4.001 -4.213 -4.213 -4.335 -4.483 -4.483 -4.642-4.841 -5.088 -5.088 -5.415 -5.415 -5.415

49mix6 1E-6 5E-7

... more lines ...

tyro2 0.0001 0.0001-1.081 -0.710 -0.470 -0.337 -0.327 -0.433 -0.602 -0.841-1.119 -1.423 -1.750 -2.121 -2.449 -2.818 -3.110 -3.467-3.781 -4.029 -4.241 -4.366 -4.501 -4.366 -4.501 -4.501-4.668 -4.668 -4.865 -4.865 -5.109 -5.111

Page 48: SAS/STAT 15.2 User’s Guide

7698 F Chapter 94: The PLS Procedure

;

The following statements fit a PLS model with 10 factors.

proc pls data=ftrain nfac=10;model tot_log tyr_log try_log = f1-f30;

run;

The table shown in Output 94.3.2 indicates that only three or four factors are required to explain almost all ofthe variation in both the predictors and the responses.

Output 94.3.2 Amount of Training Set Variation Explained

The PLS Procedure

Percent Variation Accounted for by PartialLeast Squares Factors

Model EffectsDependentVariables

Numberof

ExtractedFactors Current Total Current Total

1 81.1654 81.1654 48.3385 48.3385

2 16.8113 97.9768 32.5465 80.8851

3 1.7639 99.7407 11.4438 92.3289

4 0.1951 99.9357 3.8363 96.1652

5 0.0276 99.9634 1.6880 97.8532

6 0.0132 99.9765 0.7247 98.5779

7 0.0052 99.9817 0.2926 98.8705

8 0.0053 99.9870 0.1252 98.9956

9 0.0049 99.9918 0.1067 99.1023

10 0.0034 99.9952 0.1684 99.2707

In order to choose the optimal number of PLS factors, you can explore how well models based on the trainingdata with different numbers of factors fit the test data. To do so, use the CV=TESTSET option, with anargument pointing to the test data set ftest. The following statements also employ the ODS Graphics featuresin PROC PLS to display the cross validation results in a plot.

ods graphics on;

proc pls data=ftrain nfac=10 cv=testset(ftest)cvtest(stat=press seed=12345);

model tot_log tyr_log try_log = f1-f30;run;

The tabular results of the test set validation are shown in Output 94.3.3, and the graphical results are shownin Output 94.3.4. They indicate that, although five PLS factors give the minimum predicted residual sumof squares, the residuals for four factors are insignificantly different from those for five. Thus, the smallermodel is preferred.

Page 49: SAS/STAT 15.2 User’s Guide

Example 94.3: Choosing a PLS Model by Test Set Validation F 7699

Output 94.3.3 Test Set Validation for the Number of PLS Factors

The PLS Procedure

Test Set Validation for the Numberof Extracted Factors

Numberof

ExtractedFactors

RootMean

PRESS Prob > PRESS

0 3.056797 <.0001

1 2.630561 <.0001

2 1.00706 0.0070

3 0.664603 0.0020

4 0.521578 0.3800

5 0.500034 1.0000

6 0.513561 0.5100

7 0.501431 0.6870

8 1.055791 0.1530

9 1.435085 0.1010

10 1.720389 0.0320

Minimum root mean PRESS 0.5000

Minimizing number of factors 5

Smallest number of factors with p > 0.1 4

Percent Variation Accounted for by PartialLeast Squares Factors

Model EffectsDependentVariables

Numberof

ExtractedFactors Current Total Current Total

1 81.1654 81.1654 48.3385 48.3385

2 16.8113 97.9768 32.5465 80.8851

3 1.7639 99.7407 11.4438 92.3289

4 0.1951 99.9357 3.8363 96.1652

Page 50: SAS/STAT 15.2 User’s Guide

7700 F Chapter 94: The PLS Procedure

Output 94.3.4 Test Set Validation Plot

The factor loadings show how the PLS factors are constructed from the centered and scaled predictors.For spectral calibration, it is useful to plot the loadings against the frequency. In many cases, the physicalmeanings that can be attached to factor loadings help to validate the scientific interpretation of the PLS model.You can use ODS Graphics with PROC PLS to plot the loadings for the four PLS factors against frequency,as shown in the following statements.

proc pls data=ftrain nfac=4 plot=XLoadingProfiles;model tot_log tyr_log try_log = f1-f30;

run;

ods graphics off;

The resulting plot is shown in Output 94.3.5.

Page 51: SAS/STAT 15.2 User’s Guide

Example 94.4: Partial Least Squares Spline Smoothing F 7701

Output 94.3.5 Predictor Loadings across Frequencies

Notice that all four factors handle frequencies below and above about 7 or 8 differently. For example, the firstfactor is very nearly a simple contrast between the averages of the two sets of frequencies, and the secondfactor appears to be approximately a weighted sum of only the frequencies in the first set.

Example 94.4: Partial Least Squares Spline SmoothingThe EFFECT statement makes it easy to construct a wide variety of linear models. In particular, you can usethe spline effect to add smoothing terms to a model. A particular benefit of using spline effects in PROC PLSis that, when operating on spline basis functions, the partial least squares algorithm effectively chooses theamount of smoothing automatically, especially if you combine it with cross validation for the selecting thenumber of factors. This example employs the EFFECT statement to demonstrate partial least squares splinesmoothing of agricultural data.

Weibe (1935) presents data from a study of uniformity of wheat yields over a certain rectangular plot of land.The following statements read these wheat yield measurements, indexed by row and column distances, intothe SAS data set Wheat:

Page 52: SAS/STAT 15.2 User’s Guide

7702 F Chapter 94: The PLS Procedure

data Wheat;keep Row Column Yield;input Yield @@;iRow = int((_N_-1)/12);iCol = mod( _N_-1 ,12);Column = iCol*15 + 1; /* Column distance, in feet */Row = iRow* 1 + 1; /* Row distance, in feet */Row = 125 - Row + 1; /* Invert rows */datalines;

715 595 580 580 615 610 540 515 557 665 560 612770 710 655 675 700 690 565 585 550 574 511 618760 715 690 690 655 725 665 640 665 705 644 705665 615 685 555 585 630 550 520 553 616 573 570755 730 670 580 545 620 580 525 495 565 599 612745 670 585 560 550 710 590 545 538 587 600 664645 690 550 520 450 630 535 505 530 536 611 578585 495 455 470 445 555 500 450 420 461 531 559

... more lines ...

570 585 635 765 550 675 765 620 608 705 677 660505 500 580 655 470 565 570 555 537 585 589 619465 430 510 680 460 600 670 615 620 594 616 784

;

The following statements use the PLS procedure to smooth these wheat yields using two spline effects, onefor rows and another for columns, in addition to their crossproduct. Each spline effect has, by default, sevenbasis columns; thus their crossproduct has 49 D 72 columns, for a total of 63 parameters in the full linearmodel. However, the predictive PLS model does not actually need to have 63 degrees of freedom. Rather, thedegree of smoothing is controlled by the number of PLS factors, which in this case is chosen automaticallyby random subset validation with the CV=RANDOM option.

ods graphics on;

proc pls data=Wheat cv=random(seed=1) cvtest(seed=12345)plot(only)=contourfit(obs=gradient);

effect splCol = spline(Column);effect splRow = spline(Row );model Yield = splCol|splRow;

run;

These statements produce the output shown in Output 94.4.1 through Output 94.4.4.

Page 53: SAS/STAT 15.2 User’s Guide

Example 94.4: Partial Least Squares Spline Smoothing F 7703

Output 94.4.1 Default Spline Basis: Model and Data Information

The PLS Procedure

Data Set WORK.WHEAT

Factor Extraction Method Partial Least Squares

PLS Algorithm NIPALS

Number of Response Variables 1

Number of Predictor Parameters 63

Missing Value Handling Exclude

Maximum Number of Factors 15

Validation Method 10-fold Random Subset Validation

Random Subset Seed 1

Validation Testing Criterion Prob T**2 > 0.1

Number of Random Permutations 1000

Random Permutation Seed 12345

Number of Observations Read 1500

Number of Observations Used 1500

Output 94.4.2 Default Spline Basis: Random Subset Validated PRESS Statistics for Number of Factors

Random Subset Validation for theNumber of Extracted Factors

Numberof

ExtractedFactors

RootMean

PRESS T**2 Prob > T**2

0 1.066355 251.8793 <.0001

1 0.826177 123.8161 <.0001

2 0.745877 61.6035 <.0001

3 0.725181 44.99644 <.0001

4 0.701464 23.20199 <.0001

5 0.687164 8.369711 0.0030

6 0.683917 8.775847 0.0010

7 0.677969 2.907019 0.0830

8 0.676423 2.190871 0.1340

9 0.676966 3.191284 0.0600

10 0.675026 1.334638 0.2390

11 0.673906 0.556455 0.4470

12 0.673653 1.257292 0.2790

13 0.672669 0 1.0000

14 0.673596 2.386014 0.1190

15 0.672828 0.02962 0.8820

Minimum root mean PRESS 0.6727

Minimizing number of factors 13

Smallest number of factors with p > 0.1 8

Page 54: SAS/STAT 15.2 User’s Guide

7704 F Chapter 94: The PLS Procedure

Output 94.4.3 Default Spline Basis: PLS Variation Summary for Split-Sample Validated Model

Percent Variation Accounted for by PartialLeast Squares Factors

Model EffectsDependentVariables

Numberof

ExtractedFactors Current Total Current Total

1 11.5269 11.5269 40.2471 40.2471

2 7.2314 18.7583 10.4908 50.7379

3 6.9147 25.6730 2.6523 53.3902

4 3.8433 29.5163 2.8806 56.2708

5 6.4795 35.9958 1.3197 57.5905

6 7.6201 43.6159 1.1700 58.7605

7 7.3214 50.9373 0.7186 59.4790

8 4.8363 55.7736 0.4548 59.9339

Output 94.4.4 Default Spline Basis: Smoothed Yield

The cross validation results in Output 94.4.2 point to a model with eight PLS factors; this is the smallest

Page 55: SAS/STAT 15.2 User’s Guide

Example 94.4: Partial Least Squares Spline Smoothing F 7705

model whose predicted residual sum of squares (PRESS) is insignificantly different from the model with theabsolute minimum PRESS. The variation summary in Output 94.4.3 shows that this model accounts for about60% of the variation in the Yield values. The OBS=GRADIENT suboption for the PLOT=CONTOURFIToption specifies that the observations in the resulting plot, Output 94.4.4, be colored according to the samescheme as the surface of predicted yield. This coloration enables you to easily tell which observations areabove the surface of predicted yield and which are below.

The surface of predicted yield is somewhat smoother than what Weibe (1935) settled on originally, with apredominance of simple, elliptically shaped contours. You can easily specify a potentially more granularmodel by increasing the number of knots in the spline bases. Even though the more granular model increasesthe number of predictor parameters, cross validation can still protect you from overfitting the data. Thefollowing statements are the same as those shown before, except that the spline effects now have twice asmany basis functions:

ods graphics on;

proc pls data=Wheat cv=random(seed=1) cvtest(seed=12345)plot(only)=contourfit(obs=gradient);

effect splCol = spline(Column / knotmethod=equal(14));effect splRow = spline(Row / knotmethod=equal(14));model Yield = splCol|splRow;

run;

The resulting output is shown in Output 94.4.5 through Output 94.4.8.

Output 94.4.5 More Granular Spline Basis: Model and Data Information

The PLS Procedure

Data Set WORK.WHEAT

Factor Extraction Method Partial Least Squares

PLS Algorithm NIPALS

Number of Response Variables 1

Number of Predictor Parameters 360

Missing Value Handling Exclude

Maximum Number of Factors 15

Validation Method 10-fold Random Subset Validation

Random Subset Seed 1

Validation Testing Criterion Prob T**2 > 0.1

Number of Random Permutations 1000

Random Permutation Seed 12345

Number of Observations Read 1500

Number of Observations Used 1500

Page 56: SAS/STAT 15.2 User’s Guide

7706 F Chapter 94: The PLS Procedure

Output 94.4.6 More Granular Spline Basis: Random Subset Validated PRESS Statistics for Number ofFactors

Random Subset Validation for theNumber of Extracted Factors

Numberof

ExtractedFactors

RootMean

PRESS T**2 Prob > T**2

0 1.066355 247.9268 <.0001

1 0.652658 20.68858 <.0001

2 0.615087 0.074822 0.7740

3 0.614128 0 1.0000

4 0.615268 0.197678 0.6490

5 0.618001 1.372038 0.2340

6 0.622949 5.035504 0.0180

7 0.626482 7.296797 0.0080

8 0.633316 13.66045 <.0001

9 0.635239 16.16922 <.0001

10 0.636938 18.02295 <.0001

11 0.636494 16.9881 <.0001

12 0.63682 16.83341 <.0001

13 0.637719 16.74157 <.0001

14 0.637627 15.79342 <.0001

15 0.638431 16.12327 <.0001

Minimum root mean PRESS 0.6141

Minimizing number of factors 3

Smallest number of factors with p > 0.1 2

Output 94.4.7 More Granular Spline Basis: PLS Variation Summary for Split-Sample Validated Model

Percent Variation Accounted for by PartialLeast Squares Factors

Model EffectsDependentVariables

Numberof

ExtractedFactors Current Total Current Total

1 1.7967 1.7967 64.7792 64.7792

2 1.3719 3.1687 6.3163 71.0955

Page 57: SAS/STAT 15.2 User’s Guide

References F 7707

Output 94.4.8 More Granular Spline Basis: Smoothed Yield

Output 94.4.5 shows that the model now has 360 parameters, many more than before. In Output 94.4.6 youcan see that with more granular spline effects, fewer PLS factors are required—only two, in fact. However,Output 94.4.7 shows that this model now accounts for over 70% of the variation in the Yield values, and thecontours of predicted values in Output 94.4.8 are less inclined to be simple elliptical shapes.

References

De Jong, S. (1993). “SIMPLS: An Alternative Approach to Partial Least Squares Regression.” Chemometricsand Intelligent Laboratory Systems 18:251–263.

De Jong, S., and Kiers, H. (1992). “Principal Covariates Regression.” Chemometrics and IntelligentLaboratory Systems 14:155–164.

Dijkstra, T. K. (1983). “Some Comments on Maximum Likelihood and Partial Least Squares Methods.”Journal of Econometrics 22:67–90.

Page 58: SAS/STAT 15.2 User’s Guide

7708 F Chapter 94: The PLS Procedure

Dijkstra, T. K. (1985). Latent Variables in Linear Stochastic Models: Reflections on Maximum Likelihoodand Partial Least Squares Methods. 2nd ed. Amsterdam: Sociometric Research Foundation.

Frank, I., and Friedman, J. (1993). “A Statistical View of Some Chemometrics Regression Tools.” Techno-metrics 35:109–135.

Geladi, P., and Kowalski, B. (1986). “Partial Least-Squares Regression: A Tutorial.” Analytica Chimica Acta185:1–17.

Haykin, S. (1994). Neural Networks: A Comprehensive Foundation. New York: Macmillan.

Helland, I. S. (1988). “On the Structure of Partial Least Squares Regression.” Communications in Statistics—Simulation and Computation 17:581–607.

Hoerl, A., and Kennard, R. (1970). “Ridge Regression: Biased Estimation for Non-orthogonal Problems.”Technometrics 12:55–67.

Lindberg, W., Persson, J.-A., and Wold, S. (1983). “Partial Least-Squares Method for SpectrofluorimetricAnalysis of Mixtures of Humic Acid and Ligninsulfonate.” Analytical Chemistry 55:643–648.

McAvoy, T. J., Wang, N. S., Naidu, S., Bhat, N., Gunter, J., and Simmons, M. (1989). “Interpreting BiosensorData via Backpropagation.” International Joint Conference on Neural Networks 1:227–233.

Naes, T., and Martens, H. (1985). “Comparison of Prediction Methods for Multicollinear Data.” Communica-tions in Statistics—Simulation and Computation 14:545–576.

Ränner, S., Lindgren, F., Geladi, P., and Wold, S. (1994). “A PLS Kernel Algorithm for Data Sets with ManyVariables and Fewer Objects.” Journal of Chemometrics 8:111–125.

Sarle, W. S. (1994). “Neural Networks and Statistical Models.” In Proceedings of the Nine-teenth Annual SAS Users Group International Conference, 1538–1550. Cary, NC: SAS InstituteInc. https://support.sas.com/resources/papers/proceedings-archive/SUGI94/Sugi-94-255%20Sarle.pdf.

Shao, J. (1993). “Linear Model Selection by Cross-Validation.” Journal of the American StatisticalAssociation 88:486–494.

Tobias, R. D. (1995). “An Introduction to Partial Least Squares Regression.” In Proceedings of theTwentieth Annual SAS Users Group International Conference, 1250–1257. Cary, NC: SAS Institute Inc.http://support.sas.com/rnd/app/stat/papers/pls.pdf.

Ufkes, J. G. R., Visser, B. J., Heuver, G., and van der Meer, C. (1978). “Structure-Activity Relationships ofBradykinin Potentiating Peptides.” European Journal of Pharmacology 50:119–122.

Ufkes, J. G. R., Visser, B. J., Heuver, G., Wynne, H. J., and van der Meer, C. (1982). “Further Studies on theStructure-Activity Relationships of Bradykinin-Potentiating Peptides.” European Journal of Pharmacology79:155–158.

Umetrics (1995). Multivariate Analysis. Three-day course. Winchester, MA: Umetrics.

Van den Wollenberg, A. L. (1977). “Redundancy Analysis: An Alternative to Canonical Correlation Analysis.”Psychometrika 42:207–219.

Page 59: SAS/STAT 15.2 User’s Guide

References F 7709

Van der Voet, H. (1994). “Comparing the Predictive Accuracy of Models Using a Simple RandomizationTest.” Chemometrics and Intelligent Laboratory Systems 25:313–323.

Weibe, G. A. (1935). “Variation and Correlation in Grain Yield among 1,500 Wheat Nursery Plots.” Journalof Agricultural Research 50:331–354.

Wold, H. (1966). “Estimation of Principal Components and Related Models by Iterative Least Squares.” InMultivariate Analysis, edited by P. R. Krishnaiah, 391–420. New York: Academic Press.

Wold, S. (1994). “PLS for Multivariate Linear Modeling.” In QSAR: Chemometric Methods in MolecularDesign; Methods and Principles in Medicinal Chemistry, edited by H. van de Waterbeemd, 195–218.Weinheim, Germany: Verlag-Chemie.

Page 60: SAS/STAT 15.2 User’s Guide

Subject Index

componentsPLS procedure, 7654

constructed effectsPLS procedure, 7672

cross validationPLS procedure, 7665, 7680

factorsPLS procedure, 7654

latent variablesPLS procedure, 7654

latent vectorsPLS procedure, 7654

ODS examplesPLS procedure, 7700

ODS graph namesPLS procedure, 7684

options summaryEFFECT statement, 7672

partial least squares, 7654, 7675PLS procedure

algorithms, 7666centering, 7681compared to other procedures, 7654components, 7654computation method, 7666constructed effects, 7672cross validation, 7654, 7680cross validation method, 7665examples, 7686factors, 7654factors, selecting the number of, 7657introductory example, 7655latent variables, 7654latent vectors, 7654missing values, 7666ODS graph names, 7684ODS table names, 7683outlier detection, 7693output data sets, 7674output keywords, 7674partial least squares regression, 7654, 7675predicting new observations, 7660principal components regression, 7654, 7676reduced rank regression, 7654, 7677scaling, 7681

SIMPLS method, 7676spline smoothing, 7701test set validation, 7680, 7695

principal componentsregression (PLS), 7654, 7676

reduced rank regression, 7654PLS procedure, 7677

regressionpartial least squares (PROC PLS), 7654, 7675principal components (PROC PLS), 7654, 7676reduced rank (PROC PLS), 7654, 7677

SIMPLS methodPLS procedure, 7676

test set validationPLS procedure, 7680

variable importance for projection, 7692VIP, 7692

Page 61: SAS/STAT 15.2 User’s Guide
Page 62: SAS/STAT 15.2 User’s Guide

Syntax Index

ALGORITHM= optionPROC PLS statement, 7666

BY statementPLS procedure, 7671

CENSCALE optionPROC PLS statement, 7665

CLASS statementPLS procedure, 7671

CV= optionPROC PLS statement, 7665

CVTEST= optionPROC PLS statement, 7665

DATA= optionPROC PLS statement, 7666

DETAILS optionPROC PLS statement, 7666

EFFECT statementPLS procedure, 7672

EPSILON= optionPROC PLS statement, METHOD=PLS option,

7666PROC PLS statement, MISSING=EM option,

7667

ID statementPLS procedure, 7674

INTERCEPT optionMODEL statement (PLS), 7674

MAXITER= optionPROC PLS statement, METHOD=PLS option,

7666PROC PLS statement, MISSING=EM option,

7666METHOD= option

PROC PLS statement, 7666MISSING= option

PROC PLS statement, 7666MODEL statement

PLS procedure, 7674

NFAC= optionPROC PLS statement, 7667

NITER= optionPROC PLS statement, 7665

NOCENTER option

PROC PLS statement, 7667NOCVSTDIZE option

PROC PLS statement, 7667NOPRINT option

PROC PLS statement, 7667NOSCALE option

PROC PLS statement, 7667, 7671NTEST= option

PROC PLS statement, 7665

OUTPUT statementPLS procedure, 7674

PLOTS= optionPROC PLS statement, 7667

PLS proceduresyntax, 7664

PLS procedure, BY statement, 7671PLS procedure, CLASS statement, 7671

TRUNCATE option, 7672PLS procedure, EFFECT statement, 7672PLS procedure, ID statement, 7674PLS procedure, MODEL statement, 7674

INTERCEPT option, 7674SOLUTION option, 7674

PLS procedure, OUTPUT statement, 7674PLS procedure, PROC PLS statement, 7664

ALGORITHM= option, 7666CENSCALE option, 7665CV= option, 7665CVTEST= option, 7665DATA= option, 7666DETAILS option, 7666METHOD= option, 7666MISSING= option, 7666NFAC= option, 7667NITER= option, 7665NOCENTER option, 7667NOCVSTDIZE option, 7667NOPRINT option, 7667NOSCALE option, 7667, 7671NTEST= option, 7665PLOTS= option, 7667PVAL= option, 7665SEED= option, 7665, 7666STAT= option, 7665VARSCALE option, 7671

PLS procedure, PROC PLS statement, METHOD=PLSoption

Page 63: SAS/STAT 15.2 User’s Guide

EPSILON= option, 7666MAXITER= option, 7666

PLS procedure, PROC PLS statement, MISSING=EMoption

EPSILON= option, 7667MAXITER= option, 7666

PROC PLS statement, see PLS procedurePVAL= option

PROC PLS statement, 7665

SEED= optionPROC PLS statement, 7665, 7666

SOLUTION optionMODEL statement (PLS), 7674

STAT= optionPROC PLS statement, 7665

TRUNCATE optionCLASS statement (PLS), 7672

VARSCALE optionPROC PLS statement, 7671


Recommended