A general class of powerful and flexible modeling techniques, spline smoothing has attracted a great deal of research attention in recent years and has been widely used in many application areas, from medicine to economics. Smoothing Splines: Methods and Applications covers basic smoothing spline models, including polynomial, periodic, spherical, thin-plate, L-, and partial splines, as well as more advanced models, such as smoothing spline ANOVA, extended and generalized smoothing spline ANOVA, vector spline, nonparametric nonlinear regression, semiparametric regression, and semiparametric mixed-effects models. It also presents methods for model selection and inference.
The book provides unified frameworks for estimation, inference, and software implementation by using the general forms of nonparametric/semiparametric, linear/nonlinear, and fixed/mixed smoothing spline models. The theory of reproducing kernel Hilbert space (RKHS) is used to present various smoothing spline models in a unified fashion. Although this approach can be technical and difficult, the author makes the advanced smoothing spline methodology based on RKHS accessible to practitioners and students. He offers a gentle introduction to RKHS, keeps theory at a minimum level, and explains how RKHS can be used to construct spline models.
Smoothing Splines offers a balanced mix of methodology, compu-tation, implementation, software, and applications. It uses R to per-form all data analyses and includes a host of real data examples from astronomy, economics, medicine, and meteorology. The codes for all examples, along with related developments, can be found on the book’s web page.
C7755
Smoothing Splines
Wang
Statistics
Smoothing Splines Methods and Applications
Yuedong Wang
Monographs on Statistics and Applied Probability 121121
C7755_Cover.indd 1 4/25/11 9:28 AM
Smoothing SplinesMethods and Applications
C7755_FM.indd 1 4/29/11 2:13 PM
MONOGRAPHS ON STATISTICS AND APPLIED PROBABILITY
General Editors
F. Bunea, V. Isham, N. Keiding, T. Louis, R. L. Smith, and H. Tong
1 Stochastic Population Models in Ecology and Epidemiology M.S. Barlett (1960)2 Queues D.R. Cox and W.L. Smith (1961)
3 Monte Carlo Methods J.M. Hammersley and D.C. Handscomb (1964)4 The Statistical Analysis of Series of Events D.R. Cox and P.A.W. Lewis (1966)
5 Population Genetics W.J. Ewens (1969)6 Probability, Statistics and Time M.S. Barlett (1975)
7 Statistical Inference S.D. Silvey (1975)8 The Analysis of Contingency Tables B.S. Everitt (1977)
9 Multivariate Analysis in Behavioural Research A.E. Maxwell (1977)10 Stochastic Abundance Models S. Engen (1978)
11 Some Basic Theory for Statistical Inference E.J.G. Pitman (1979)12 Point Processes D.R. Cox and V. Isham (1980)13 Identification of Outliers D.M. Hawkins (1980)
14 Optimal Design S.D. Silvey (1980)15 Finite Mixture Distributions B.S. Everitt and D.J. Hand (1981)
16 Classification A.D. Gordon (1981)17 Distribution-Free Statistical Methods, 2nd edition J.S. Maritz (1995)
18 Residuals and Influence in Regression R.D. Cook and S. Weisberg (1982)19 Applications of Queueing Theory, 2nd edition G.F. Newell (1982)
20 Risk Theory, 3rd edition R.E. Beard, T. Pentikäinen and E. Pesonen (1984)21 Analysis of Survival Data D.R. Cox and D. Oakes (1984)
22 An Introduction to Latent Variable Models B.S. Everitt (1984)23 Bandit Problems D.A. Berry and B. Fristedt (1985)
24 Stochastic Modelling and Control M.H.A. Davis and R. Vinter (1985)25 The Statistical Analysis of Composition Data J. Aitchison (1986)
26 Density Estimation for Statistics and Data Analysis B.W. Silverman (1986)27 Regression Analysis with Applications G.B. Wetherill (1986)
28 Sequential Methods in Statistics, 3rd edition G.B. Wetherill and K.D. Glazebrook (1986)
29 Tensor Methods in Statistics P. McCullagh (1987)30 Transformation and Weighting in Regression
R.J. Carroll and D. Ruppert (1988)31 Asymptotic Techniques for Use in Statistics
O.E. Bandorff-Nielsen and D.R. Cox (1989)32 Analysis of Binary Data, 2nd edition D.R. Cox and E.J. Snell (1989)
33 Analysis of Infectious Disease Data N.G. Becker (1989) 34 Design and Analysis of Cross-Over Trials B. Jones and M.G. Kenward (1989)
35 Empirical Bayes Methods, 2nd edition J.S. Maritz and T. Lwin (1989)36 Symmetric Multivariate and Related Distributions
K.T. Fang, S. Kotz and K.W. Ng (1990)37 Generalized Linear Models, 2nd edition P. McCullagh and J.A. Nelder (1989)
38 Cyclic and Computer Generated Designs, 2nd edition J.A. John and E.R. Williams (1995)
39 Analog Estimation Methods in Econometrics C.F. Manski (1988)40 Subset Selection in Regression A.J. Miller (1990)
41 Analysis of Repeated Measures M.J. Crowder and D.J. Hand (1990)42 Statistical Reasoning with Imprecise Probabilities P. Walley (1991)43 Generalized Additive Models T.J. Hastie and R.J. Tibshirani (1990)
44 Inspection Errors for Attributes in Quality Control N.L. Johnson, S. Kotz and X. Wu (1991)
45 The Analysis of Contingency Tables, 2nd edition B.S. Everitt (1992)
C7755_FM.indd 2 4/29/11 2:13 PM
46 The Analysis of Quantal Response Data B.J.T. Morgan (1992)47 Longitudinal Data with Serial Correlation—A State-Space Approach
R.H. Jones (1993)48 Differential Geometry and Statistics M.K. Murray and J.W. Rice (1993)
49 Markov Models and Optimization M.H.A. Davis (1993)50 Networks and Chaos—Statistical and Probabilistic Aspects O.E. Barndorff-Nielsen, J.L. Jensen and W.S. Kendall (1993)
51 Number-Theoretic Methods in Statistics K.-T. Fang and Y. Wang (1994)52 Inference and Asymptotics O.E. Barndorff-Nielsen and D.R. Cox (1994)
53 Practical Risk Theory for Actuaries C.D. Daykin, T. Pentikäinen and M. Pesonen (1994)
54 Biplots J.C. Gower and D.J. Hand (1996)55 Predictive Inference—An Introduction S. Geisser (1993)
56 Model-Free Curve Estimation M.E. Tarter and M.D. Lock (1993)57 An Introduction to the Bootstrap B. Efron and R.J. Tibshirani (1993)
58 Nonparametric Regression and Generalized Linear Models P.J. Green and B.W. Silverman (1994)
59 Multidimensional Scaling T.F. Cox and M.A.A. Cox (1994)60 Kernel Smoothing M.P. Wand and M.C. Jones (1995)61 Statistics for Long Memory Processes J. Beran (1995)
62 Nonlinear Models for Repeated Measurement Data M. Davidian and D.M. Giltinan (1995)
63 Measurement Error in Nonlinear Models R.J. Carroll, D. Rupert and L.A. Stefanski (1995)
64 Analyzing and Modeling Rank Data J.J. Marden (1995)65 Time Series Models—In Econometrics, Finance and Other Fields
D.R. Cox, D.V. Hinkley and O.E. Barndorff-Nielsen (1996)66 Local Polynomial Modeling and its Applications J. Fan and I. Gijbels (1996)
67 Multivariate Dependencies—Models, Analysis and Interpretation D.R. Cox and N. Wermuth (1996)
68 Statistical Inference—Based on the Likelihood A. Azzalini (1996)69 Bayes and Empirical Bayes Methods for Data Analysis
B.P. Carlin and T.A Louis (1996)70 Hidden Markov and Other Models for Discrete-Valued Time Series
I.L. MacDonald and W. Zucchini (1997)71 Statistical Evidence—A Likelihood Paradigm R. Royall (1997)72 Analysis of Incomplete Multivariate Data J.L. Schafer (1997)73 Multivariate Models and Dependence Concepts H. Joe (1997)
74 Theory of Sample Surveys M.E. Thompson (1997)75 Retrial Queues G. Falin and J.G.C. Templeton (1997)
76 Theory of Dispersion Models B. Jørgensen (1997)77 Mixed Poisson Processes J. Grandell (1997)
78 Variance Components Estimation—Mixed Models, Methodologies and Applications P.S.R.S. Rao (1997)79 Bayesian Methods for Finite Population Sampling
G. Meeden and M. Ghosh (1997)80 Stochastic Geometry—Likelihood and computation
O.E. Barndorff-Nielsen, W.S. Kendall and M.N.M. van Lieshout (1998)81 Computer-Assisted Analysis of Mixtures and Applications— Meta-analysis, Disease Mapping and Others D. Böhning (1999)
82 Classification, 2nd edition A.D. Gordon (1999)83 Semimartingales and their Statistical Inference B.L.S. Prakasa Rao (1999)
84 Statistical Aspects of BSE and vCJD—Models for Epidemics C.A. Donnelly and N.M. Ferguson (1999)
85 Set-Indexed Martingales G. Ivanoff and E. Merzbach (2000)86 The Theory of the Design of Experiments D.R. Cox and N. Reid (2000)
87 Complex Stochastic Systems O.E. Barndorff-Nielsen, D.R. Cox and C. Klüppelberg (2001)
88 Multidimensional Scaling, 2nd edition T.F. Cox and M.A.A. Cox (2001)
C7755_FM.indd 3 4/29/11 2:13 PM
89 Algebraic Statistics—Computational Commutative Algebra in Statistics G. Pistone, E. Riccomagno and H.P. Wynn (2001)
90 Analysis of Time Series Structure—SSA and Related Techniques N. Golyandina, V. Nekrutkin and A.A. Zhigljavsky (2001)
91 Subjective Probability Models for Lifetimes Fabio Spizzichino (2001)
92 Empirical Likelihood Art B. Owen (2001)93 Statistics in the 21st Century
Adrian E. Raftery, Martin A. Tanner, and Martin T. Wells (2001)94 Accelerated Life Models: Modeling and Statistical Analysis
Vilijandas Bagdonavicius and Mikhail Nikulin (2001)95 Subset Selection in Regression, Second Edition Alan Miller (2002)
96 Topics in Modelling of Clustered Data Marc Aerts, Helena Geys, Geert Molenberghs, and Louise M. Ryan (2002)
97 Components of Variance D.R. Cox and P.J. Solomon (2002)98 Design and Analysis of Cross-Over Trials, 2nd Edition
Byron Jones and Michael G. Kenward (2003)99 Extreme Values in Finance, Telecommunications, and the Environment
Bärbel Finkenstädt and Holger Rootzén (2003)100 Statistical Inference and Simulation for Spatial Point Processes
Jesper Møller and Rasmus Plenge Waagepetersen (2004)101 Hierarchical Modeling and Analysis for Spatial Data
Sudipto Banerjee, Bradley P. Carlin, and Alan E. Gelfand (2004)102 Diagnostic Checks in Time Series Wai Keung Li (2004)
103 Stereology for Statisticians Adrian Baddeley and Eva B. Vedel Jensen (2004)104 Gaussian Markov Random Fields: Theory and Applications
Havard Rue and Leonhard Held (2005)105 Measurement Error in Nonlinear Models: A Modern Perspective, Second Edition
Raymond J. Carroll, David Ruppert, Leonard A. Stefanski, and Ciprian M. Crainiceanu (2006)
106 Generalized Linear Models with Random Effects: Unified Analysis via H-likelihood Youngjo Lee, John A. Nelder, and Yudi Pawitan (2006)
107 Statistical Methods for Spatio-Temporal Systems Bärbel Finkenstädt, Leonhard Held, and Valerie Isham (2007)
108 Nonlinear Time Series: Semiparametric and Nonparametric Methods Jiti Gao (2007)
109 Missing Data in Longitudinal Studies: Strategies for Bayesian Modeling and Sensitivity Analysis Michael J. Daniels and Joseph W. Hogan (2008)
110 Hidden Markov Models for Time Series: An Introduction Using R Walter Zucchini and Iain L. MacDonald (2009)
111 ROC Curves for Continuous Data Wojtek J. Krzanowski and David J. Hand (2009)112 Antedependence Models for Longitudinal Data
Dale L. Zimmerman and Vicente A. Núñez-Antón (2009)113 Mixed Effects Models for Complex Data
Lang Wu (2010)114 Intoduction to Time Series Modeling
Genshiro Kitagawa (2010)115 Expansions and Asymptotics for Statistics
Christopher G. Small (2010)116 Statistical Inference: An Integrated Bayesian/Likelihood Approach
Murray Aitkin (2010)117 Circular and Linear Regression: Fitting Circles and Lines by Least Squares
Nikolai Chernov (2010)118 Simultaneous Inference in Regression Wei Liu (2010)
119 Robust Nonparametric Statistical Methods, Second Edition Thomas P. Hettmansperger and Joseph W. McKean (2011)
120 Statistical Inference: The Minimum Distance Approach Ayanendranath Basu, Hiroyuki Shioya, and Chanseok Park (2011)
121 Smoothing Splines : Methods and Applications Yuedong Wang (2011)
C7755_FM.indd 4 4/29/11 2:13 PM
Monographs on Statistics and Applied Probability 121
Smoothing SplinesMethods and Applications
Yuedong WangUniversity of California
Santa Barbara, California, USA
C7755_FM.indd 5 4/29/11 2:13 PM
CRC PressTaylor & Francis Group6000 Broken Sound Parkway NW, Suite 300Boca Raton, FL 33487-2742
© 2011 by Taylor & Francis Group, LLCCRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government worksVersion Date: 20110429
International Standard Book Number-13: 978-1-4200-7756-8 (eBook - PDF)
This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material reproduced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information stor-age or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copy-right.com (http://www.copyright.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400. CCC is a not-for-profit organization that pro-vides licenses and registration for a variety of users. For organizations that have been granted a pho-tocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identification and explanation without intent to infringe.
Visit the Taylor & Francis Web site athttp://www.taylorandfrancis.com
and the CRC Press Web site athttp://www.crcpress.com
TO
YAN, CATHERINE, AND KEVIN
This page intentionally left blankThis page intentionally left blank
Contents
1 Introduction 11.1 Parametric and Nonparametric Regression . . . . . . . 11.2 Polynomial Splines . . . . . . . . . . . . . . . . . . . . . 41.3 Scope of This Book . . . . . . . . . . . . . . . . . . . . 71.4 The assist Package . . . . . . . . . . . . . . . . . . . . 9
2 Smoothing Spline Regression 112.1 Reproducing Kernel Hilbert Space . . . . . . . . . . . . 112.2 Model Space for Polynomial Splines . . . . . . . . . . . 142.3 General Smoothing Spline Regression Models . . . . . . 162.4 Penalized Least Squares Estimation . . . . . . . . . . . 172.5 The ssr Function . . . . . . . . . . . . . . . . . . . . . 202.6 Another Construction for Polynomial Splines . . . . . . 222.7 Periodic Splines . . . . . . . . . . . . . . . . . . . . . . 242.8 Thin-Plate Splines . . . . . . . . . . . . . . . . . . . . . 262.9 Spherical Splines . . . . . . . . . . . . . . . . . . . . . . 292.10 Partial Splines . . . . . . . . . . . . . . . . . . . . . . . 302.11 L-splines . . . . . . . . . . . . . . . . . . . . . . . . . . 39
2.11.1 Motivation . . . . . . . . . . . . . . . . . . . . . 392.11.2 Exponential Spline . . . . . . . . . . . . . . . . . 412.11.3 Logistic Spline . . . . . . . . . . . . . . . . . . . 442.11.4 Linear-Periodic Spline . . . . . . . . . . . . . . . 462.11.5 Trigonometric Spline . . . . . . . . . . . . . . . . 48
3 Smoothing Parameter Selection and Inference 533.1 Impact of the Smoothing Parameter . . . . . . . . . . . 533.2 Trade-Offs . . . . . . . . . . . . . . . . . . . . . . . . . 573.3 Unbiased Risk . . . . . . . . . . . . . . . . . . . . . . . 623.4 Cross-Validation and Generalized Cross-Validation . . . 643.5 Bayes and Linear Mixed-Effects Models . . . . . . . . . 673.6 Generalized Maximum Likelihood . . . . . . . . . . . . 713.7 Comparison and Implementation . . . . . . . . . . . . . 723.8 Confidence Intervals . . . . . . . . . . . . . . . . . . . . 75
3.8.1 Bayesian Confidence Intervals . . . . . . . . . . . 753.8.2 Bootstrap Confidence Intervals . . . . . . . . . . 81
ix
x
3.9 Hypothesis Tests . . . . . . . . . . . . . . . . . . . . . . 843.9.1 The Hypothesis . . . . . . . . . . . . . . . . . . . 843.9.2 Locally Most Powerful Test . . . . . . . . . . . . 853.9.3 Generalized Maximum Likelihood Test . . . . . . 863.9.4 Generalized Cross-Validation Test . . . . . . . . 873.9.5 Comparison and Implementation . . . . . . . . . 87
4 Smoothing Spline ANOVA 914.1 Multiple Regression . . . . . . . . . . . . . . . . . . . . 914.2 Tensor Product Reproducing Kernel Hilbert Spaces . . 924.3 One-Way SS ANOVA Decomposition . . . . . . . . . . 93
4.3.1 Decomposition of Ra: One-Way ANOVA . . . . 95
4.3.2 Decomposition of Wm2 [a, b] . . . . . . . . . . . . 96
4.3.3 Decomposition of Wm2 (per) . . . . . . . . . . . . 97
4.3.4 Decomposition of Wm2 (Rd) . . . . . . . . . . . . 97
4.4 Two-Way SS ANOVA Decomposition . . . . . . . . . . 984.4.1 Decomposition of R
a ⊗ Rb: Two-Way ANOVA . 99
4.4.2 Decomposition of Ra ⊗Wm
2 [0, 1] . . . . . . . . . 1004.4.3 Decomposition of Wm1
2 [0, 1] ⊗Wm2
2 [0, 1] . . . . . 1034.4.4 Decomposition of R
a ⊗Wm2 (per) . . . . . . . . . 106
4.4.5 Decomposition of Wm1
2 (per) ⊗Wm2
2 [0, 1] . . . . . 1074.4.6 Decomposition of W 2
2 (R2) ⊗Wm2 (per) . . . . . . 108
4.5 General SS ANOVA Decomposition . . . . . . . . . . . 1104.6 SS ANOVA Models and Estimation . . . . . . . . . . . 1114.7 Selection of Smoothing Parameters . . . . . . . . . . . 1144.8 Confidence Intervals . . . . . . . . . . . . . . . . . . . . 1164.9 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 117
4.9.1 Tongue Shapes . . . . . . . . . . . . . . . . . . . 1174.9.2 Ozone in Arosa — Revisit . . . . . . . . . . . . . 1264.9.3 Canadian Weather — Revisit . . . . . . . . . . . 1314.9.4 Texas Weather . . . . . . . . . . . . . . . . . . . 133
5 Spline Smoothing with Heteroscedastic and/or Corre-lated Errors 1395.1 Problems with Heteroscedasticity and Correlation . . . 1395.2 Extended SS ANOVA Models . . . . . . . . . . . . . . 142
5.2.1 Penalized Weighted Least Squares . . . . . . . . 1425.2.2 UBR, GCV and GML Criteria . . . . . . . . . . 1445.2.3 Known Covariance . . . . . . . . . . . . . . . . . 1475.2.4 Unknown Covariance . . . . . . . . . . . . . . . . 1485.2.5 Confidence Intervals . . . . . . . . . . . . . . . . 150
5.3 Variance and Correlation Structures . . . . . . . . . . . 1505.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 153
xi
5.4.1 Simulated Motorcycle Accident — Revisit . . . . 153
5.4.2 Ozone in Arosa — Revisit . . . . . . . . . . . . . 154
5.4.3 Beveridge Wheat Price Index . . . . . . . . . . . 157
5.4.4 Lake Acidity . . . . . . . . . . . . . . . . . . . . 158
6 Generalized Smoothing Spline ANOVA 163
6.1 Generalized SS ANOVA Models . . . . . . . . . . . . . 163
6.2 Estimation and Inference . . . . . . . . . . . . . . . . . 164
6.2.1 Penalized Likelihood Estimation . . . . . . . . . 164
6.2.2 Selection of Smoothing Parameters . . . . . . . . 167
6.2.3 Algorithm and Implementation . . . . . . . . . . 168
6.2.4 Bayes Model, Direct GML and ApproximateBayesian Confidence Intervals . . . . . . . . . . . 170
6.3 Wisconsin Epidemiological Study of DiabeticRetinopathy . . . . . . . . . . . . . . . . . . . . . . . . 172
6.4 Smoothing Spline Estimation of Variance Functions . . 176
6.5 Smoothing Spline Spectral Analysis . . . . . . . . . . . 182
6.5.1 Spectrum Estimation of a Stationary Process . . 182
6.5.2 Time-Varying Spectrum Estimation of a LocallyStationary Process . . . . . . . . . . . . . . . . . 183
6.5.3 Epileptic EEG . . . . . . . . . . . . . . . . . . . 185
7 Smoothing Spline Nonlinear Regression 195
7.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 195
7.2 Nonparametric Nonlinear Regression Models . . . . . . 196
7.3 Estimation with a Single Function . . . . . . . . . . . . 197
7.3.1 Gauss–Newton and Newton–Raphson Methods . 197
7.3.2 Extended Gauss–Newton Method . . . . . . . . . 199
7.3.3 Smoothing Parameter Selection and Inference . . 201
7.4 Estimation with Multiple Functions . . . . . . . . . . . 204
7.5 The nnr Function . . . . . . . . . . . . . . . . . . . . . 205
7.6 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 206
7.6.1 Nonparametric Regression Subject to PositiveConstraint . . . . . . . . . . . . . . . . . . . . . . 206
7.6.2 Nonparametric Regression Subject to MonotoneConstraint . . . . . . . . . . . . . . . . . . . . . . 207
7.6.3 Term Structure of Interest Rates . . . . . . . . . 212
7.6.4 A Multiplicative Model for Chickenpox Epidemic 218
7.6.5 A Multiplicative Model for Texas Weather . . . . 223
xii
8 Semiparametric Regression 2278.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . 2278.2 Semiparametric Linear Regression Models . . . . . . . . 228
8.2.1 The Model . . . . . . . . . . . . . . . . . . . . . 2288.2.2 Estimation and Inference . . . . . . . . . . . . . 2298.2.3 Vector Spline . . . . . . . . . . . . . . . . . . . . 233
8.3 Semiparametric Nonlinear Regression Models . . . . . . 2408.3.1 The Model . . . . . . . . . . . . . . . . . . . . . 2408.3.2 SNR Models for Clustered Data . . . . . . . . . 2418.3.3 Estimation and Inference . . . . . . . . . . . . . 2428.3.4 The snr Function . . . . . . . . . . . . . . . . . 245
8.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 2478.4.1 Canadian Weather — Revisit . . . . . . . . . . . 2478.4.2 Superconductivity Magnetization Modeling . . . 2548.4.3 Oil-Bearing Rocks . . . . . . . . . . . . . . . . . 2578.4.4 Air Quality . . . . . . . . . . . . . . . . . . . . . 2598.4.5 The Evolution of the Mira Variable R Hydrae . . 2628.4.6 Circadian Rhythm . . . . . . . . . . . . . . . . . 267
9 Semiparametric Mixed-Effects Models 2739.1 Linear Mixed-Effects Models . . . . . . . . . . . . . . . 2739.2 Semiparametric Linear Mixed-Effects Models . . . . . . 274
9.2.1 The Model . . . . . . . . . . . . . . . . . . . . . 2749.2.2 Estimation and Inference . . . . . . . . . . . . . 2759.2.3 The slm Function . . . . . . . . . . . . . . . . . 2799.2.4 SS ANOVA Decomposition . . . . . . . . . . . . 280
9.3 Semiparametric Nonlinear Mixed-Effects Models . . . . 2839.3.1 The Model . . . . . . . . . . . . . . . . . . . . . 2839.3.2 Estimation and Inference . . . . . . . . . . . . . 2849.3.3 Implementation and the snm Function . . . . . . 286
9.4 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 2889.4.1 Ozone in Arosa — Revisit . . . . . . . . . . . . . 2889.4.2 Lake Acidity — Revisit . . . . . . . . . . . . . . 2919.4.3 Coronary Sinus Potassium in Dogs . . . . . . . . 2949.4.4 Carbon Dioxide Uptake . . . . . . . . . . . . . . 3059.4.5 Circadian Rhythm — Revisit . . . . . . . . . . . 310
A Data Sets 323A.1 Air Quality Data . . . . . . . . . . . . . . . . . . . . . . 324A.2 Arosa Ozone Data . . . . . . . . . . . . . . . . . . . . . 324A.3 Beveridge Wheat Price Index Data . . . . . . . . . . . 324A.4 Bond Data . . . . . . . . . . . . . . . . . . . . . . . . . 324A.5 Canadian Weather Data . . . . . . . . . . . . . . . . . 325
xiii
A.6 Carbon Dioxide Data . . . . . . . . . . . . . . . . . . . 325A.7 Chickenpox Data . . . . . . . . . . . . . . . . . . . . . 325A.8 Child Growth Data . . . . . . . . . . . . . . . . . . . . 326A.9 Dog Data . . . . . . . . . . . . . . . . . . . . . . . . . . 326A.10 Geyser Data . . . . . . . . . . . . . . . . . . . . . . . . 326A.11 Hormone Data . . . . . . . . . . . . . . . . . . . . . . . 327A.12 Lake Acidity Data . . . . . . . . . . . . . . . . . . . . . 327A.13 Melanoma Data . . . . . . . . . . . . . . . . . . . . . . 327A.14 Motorcycle Data . . . . . . . . . . . . . . . . . . . . . . 328A.15 Paramecium caudatum Data . . . . . . . . . . . . . . . 328A.16 Rock Data . . . . . . . . . . . . . . . . . . . . . . . . . 328A.17 Seizure Data . . . . . . . . . . . . . . . . . . . . . . . . 328A.18 Star Data . . . . . . . . . . . . . . . . . . . . . . . . . . 329A.19 Stratford Weather Data . . . . . . . . . . . . . . . . . . 329A.20 Superconductivity Data . . . . . . . . . . . . . . . . . . 329A.21 Texas Weather Data . . . . . . . . . . . . . . . . . . . . 330A.22 Ultrasound Data . . . . . . . . . . . . . . . . . . . . . . 330A.23 USA Climate Data . . . . . . . . . . . . . . . . . . . . 331A.24 Weight Loss Data . . . . . . . . . . . . . . . . . . . . . 331A.25 WESDR Data . . . . . . . . . . . . . . . . . . . . . . . 331A.26 World Climate Data . . . . . . . . . . . . . . . . . . . . 332
B Codes for Fitting Strictly Increasing Functions 333B.1 C and R Codes for Computing Integrals . . . . . . . . . 333B.2 R Function inc . . . . . . . . . . . . . . . . . . . . . . 336
C Codes for Term Structure of Interest Rates 339C.1 C and R Codes for Computing Integrals . . . . . . . . . 339C.2 R Function for One Bond . . . . . . . . . . . . . . . . . 341C.3 R Function for Two Bonds . . . . . . . . . . . . . . . . 342
References 347
Author Index 355
Subject Index 359
This page intentionally left blankThis page intentionally left blank
List of Tables
2.1 Bases of null spaces and RKs for linear and cubic splinesunder the construction in Section 2.2 with X = [0, b] . . 21
2.2 Bases of null spaces and RKs for linear and cubic splinesunder the construction in Section 2.6 with X = [0, 1] . . 23
5.1 Standard varFunc classes . . . . . . . . . . . . . . . . . 1515.2 Standard corStruct classes for serial correlation struc-
tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1525.3 Standard corStruct classes for spatial correlation struc-
tures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 153
A.1 List of all data sets . . . . . . . . . . . . . . . . . . . . . 323
xv
This page intentionally left blankThis page intentionally left blank
List of Figures
1.1 Geyser data, observations, the straight line fit, and resid-uals . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Motorcycle data, observations, and a polynomial fit . . . 21.3 Geyser data, residuals, and the cubic spline fits . . . . . 31.4 Motorcycle data, observations, and the cubic spline fit . 41.5 Relationship between functions in the assist package and
some of the existing R functions . . . . . . . . . . . . . 10
2.1 Motorcycle data, the linear, and cubic spline fits . . . . 222.2 Arosa data, observations, and the periodic spline fits . . 252.3 USA climate data, the thin-plate spline fit . . . . . . . . 282.4 World climate data, the spherical spline fit . . . . . . . . 312.5 Geyser data, the partial spline fit, residuals, and the AIC
and GCV scores . . . . . . . . . . . . . . . . . . . . . . 332.6 Motorcycle data, the partial spline fit, and the AIC and
GCV scores . . . . . . . . . . . . . . . . . . . . . . . . . 342.7 Arosa data, the partial spline estimates of the month and
year effects . . . . . . . . . . . . . . . . . . . . . . . . . 362.8 Canadian weather data, estimate of the weight function,
and confidence intervals . . . . . . . . . . . . . . . . . . 382.9 Weight loss data, observations and the nonlinear regres-
sion, cubic spline, and exponential spline fits . . . . . . 432.10 Paramecium caudatum data, observations and the non-
linear regression, cubic spline, and logistic spline fits . . 452.11 Melanoma data, observations, and the cubic spline and
linear-periodic spline fits . . . . . . . . . . . . . . . . . . 482.12 Arosa data, the overall fits and their projections . . . . 51
3.1 Stratford weather data, observations, and the periodicspline fits with different smoothing parameters . . . . . 54
3.2 Weights of the periodic spline filter . . . . . . . . . . . . 573.3 Stratford data, degrees of freedom, and residual sum of
squares . . . . . . . . . . . . . . . . . . . . . . . . . . . 593.4 Squared bias, variance, and MSE from a simulation . . . 613.5 PSE and UBR functions . . . . . . . . . . . . . . . . . . 65
xvii
xviii
3.6 PSE, CV, and GCV functions . . . . . . . . . . . . . . . 68
3.7 Geyser data, estimates of the smooth components in thecubic and partial spline models . . . . . . . . . . . . . . 79
3.8 Motorcycle data, partial spline fit, and t-statistics . . . . 80
3.9 Pointwise coverages and across-the-function coverages . 83
4.1 Ultrasound data, 3-d plots of observations . . . . . . . . 93
4.2 Ultrasound data, observations, fits, confidence intervals,and the mean curves among three environments . . . . . 118
4.3 Ultrasound data, the overall interaction . . . . . . . . . 119
4.4 Ultrasound data, effects of environment . . . . . . . . . 120
4.5 Ultrasound data, estimated tongue shapes as functions oflength and time . . . . . . . . . . . . . . . . . . . . . . 122
4.6 Ultrasound data, the estimated time effect . . . . . . . . 123
4.7 Ultrasound data, estimated tongue shape as a function ofenvironment, length and time . . . . . . . . . . . . . . 125
4.8 Ultrasound data, the estimated environment effect . . . 126
4.9 Arosa data, estimates of the interactions and smooth com-ponent . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
4.10 Arosa data, estimates of the main effects . . . . . . . . . 129
4.11 Canadian weather data, temperature profiles of stationsin four regions and the estimated profiles . . . . . . . . 132
4.12 Canadian weather data, the estimated region effects totemperature . . . . . . . . . . . . . . . . . . . . . . . . . 134
4.13 Texas weather data, observations as curves . . . . . . . 135
4.14 Texas weather data, observations as surfaces . . . . . . . 135
4.15 Texas weather data, the location effects for four selectedstations . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.16 Texas weather data, the month effects for January, April,July, and October . . . . . . . . . . . . . . . . . . . . . 138
5.1 WMSEs and coverages of Bayesian confidence intervalswith the presence of heteroscedasticity . . . . . . . . . . 140
5.2 Cubic spline fits when data are correlated . . . . . . . . 141
5.3 Cubic spline fits and estimated autocorrelation functionsfor two simulations . . . . . . . . . . . . . . . . . . . . . 149
5.4 Motorcycle data, estimates of the mean and variance func-tions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.5 Arosa data, residuals variances and PWLS fit . . . . . . 155
5.6 Beveridge data, time series and cubic spline fits . . . . . 157
5.7 Lake acidity data, effects of calcium and geological loca-tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160
xix
6.1 WESDR data, the estimated probability functions . . . 176
6.2 Motorcycle data, estimates of the variance function basedon three procedures . . . . . . . . . . . . . . . . . . . . 178
6.3 Motorcycle data, DGML function, and estimate of thevariance and mean functions . . . . . . . . . . . . . . . . 182
6.4 Seizure data, the baseline and preseizure IEEG segments 186
6.5 Seizure data, periodograms, estimates of the spectra basedon the iterative UBR method and confidence intervals . 187
6.6 Seizure data, estimates of the time-varying spectra basedon the iterative UBR method . . . . . . . . . . . . . . . 189
6.7 Seizure data, estimates of the time-varying spectra basedon the DGML method . . . . . . . . . . . . . . . . . . . 192
7.1 Nonparametric regression under positivity constraint . . 207
7.2 Nonparametric regression under monotonicity constraint 210
7.3 Child growth data, cubic spline fit, fit under monotonicityconstraint and estimate of the velocity . . . . . . . . . . 210
7.4 Bond data, unconstrained and constrained estimates ofthe discount functions, forward rates and credit spread . 214
7.5 Chickenpox data, time series plot and the fits by multi-plicative and SS ANOVA models . . . . . . . . . . . . . 219
7.6 Chickenpox data, estimates of the mean and amplitudefunctions in the multiplicative model . . . . . . . . . . . 222
7.7 Chickenpox data, estimates of the shape function in themultiplicative model and its projections . . . . . . . . . 222
7.8 Texas weather data, estimates of the mean and amplitudefunctions in the multiplicative model . . . . . . . . . . . 224
7.9 Texas weather data, temperature profiles . . . . . . . . . 225
7.10 Texas weather data, the estimated interaction against theestimated main effect for two stations . . . . . . . . . . 226
8.1 Separate and joint fits from a simulation . . . . . . . . . 236
8.2 Estimates of the differences . . . . . . . . . . . . . . . . 238
8.3 Canadian weather data, estimated region effects to pre-cipitation . . . . . . . . . . . . . . . . . . . . . . . . . . 249
8.4 Canadian weather data, estimate of the coefficient func-tion for the temperature effect . . . . . . . . . . . . . . 249
8.5 Canadian weather data, estimates of the intercept andweight functions . . . . . . . . . . . . . . . . . . . . . . 253
8.6 Superconductivity data, observations, and the fits by non-linear regression, cubic spline, nonlinear partial spline,and L-spline . . . . . . . . . . . . . . . . . . . . . . . . . 254
xx
8.7 Superconductivity data, estimates of departures from thestraight line model and the “interpolation formula” . . . 256
8.8 Rock data, estimates of functions in the projection pursuitregression model . . . . . . . . . . . . . . . . . . . . . . 259
8.9 Air quality data, estimates of functions in SNR models . 2618.10 Star data, observations, and the overall fit . . . . . . . . 2628.11 Star data, folded observations, estimates of the common
shape function and its projection . . . . . . . . . . . . . 2648.12 Star data, estimates of the amplitude and period functions 2668.13 Hormone data, cortisol concentrations for normal subjects
and the fits based on an SIM . . . . . . . . . . . . . . . 2688.14 Hormone data, estimate of the common shape function in
an SIM and its projection for normal subjects . . . . . . 270
9.1 Arosa data, the overall fit, seasonal trend, and long-termtrend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 290
9.2 Arosa data, the overall fit, seasonal trend, long-term trendand local stochastic trend . . . . . . . . . . . . . . . . . 292
9.3 Lake acidity data, effects of calcium and geological loca-tion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 294
9.4 Dog data, coronary sinus potassium concentrations overtime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 295
9.5 Dog data, estimates of the group mean response curves . 3029.6 Dog data, estimates of the group mean response curves
under new penalty . . . . . . . . . . . . . . . . . . . . . 3049.7 Dog data, predictions for four dogs . . . . . . . . . . . . 3059.8 Carbon dioxide data, observations and fits by the NLME
and SNM models . . . . . . . . . . . . . . . . . . . . . . 3079.9 Carbon dioxide data, overall estimate and projections of
the nonparametric shape function . . . . . . . . . . . . . 3099.10 Hormone data, cortisol concentrations for normal sub-
jects, and the fits based on a mixed-effects SIM . . . . . 3129.11 Hormone data, cortisol concentrations for depressed sub-
jects, and the fits based on a mixed-effects SIM . . . . . 3139.12 Hormone data, cortisol concentrations for subjects with
Cushing’s disease, and the fits based on a mixed-effectsSIM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 314
9.13 Hormone data, estimates of the common shape functionsin the mixed-effect SIM . . . . . . . . . . . . . . . . . . 315
9.14 Hormone data, plot of the estimated 24-hour mean levelsagainst amplitudes . . . . . . . . . . . . . . . . . . . . . 321
xxi
Symbol Description
(x)+ max{x, 0}x ∧ z min{x, z}x ∨ z max{x, z}det+ Product of the nonzero eigenvalueskr(x) Scaled Bernoulli polynomials(·, ·) Inner product‖ · ‖: NormX Domain of a functionS Unit sphereH Function spaceL Linear functionalN Nonlinear functionalP ProjectionA Averaging operatorM Model spaceR(x, z) Reproducing kernelR
d Euclidean d-spaceNS2m(t1, · · · , tk) Natural polynomial spline spaceWm
2 [a, b] Sobolev space on [a, b]Wm
2 (per) Sobolev space on unit circleWm
2 (Rd) Thin-plate spline model spaceWm
2 (S) Sobolev space on unit sphere⊕ Direct sum of function spaces⊗ Tensor product of function spaces
This page intentionally left blankThis page intentionally left blank
Preface
Statistical analysis often involves building mathematical models thatexamine the relationship between dependent and independent variables.This book is about a general class of powerful and flexible modelingtechniques, namely, spline smoothing.
Research on smoothing spline models has attracted a great deal ofattention in recent years, and the methodology has been widely used inmany areas. This book provides an introduction to some basic smoothingspline models, including polynomial, periodic, spherical, thin-plate, L-,and partial splines, as well as an overview of more advanced models, in-cluding smoothing spline ANOVA, extended and generalized smoothingspline ANOVA, vector spline, nonparametric nonlinear regression, semi-parametric regression, and semiparametric mixed-effects models. Meth-ods for model selection and inference are also presented.
The general forms of nonparametric/semiparametric linear/nonlinearfixed/mixed smoothing spline models in this book provide unified frame-works for estimation, inference, and software implementation. This bookdraws on the theory of reproducing kernel Hilbert space (RKHS) topresent various smoothing spline models in a unified fashion. On theother hand, the subject of smoothing spline in the context of RKHSand regularization is often regarded as technical and difficult. One ofmy main goals is to make the advanced smoothing spline methodologybased on RKHS more accessible to practitioners and students. With thisin mind, the book focuses on methodology, computation, implementa-tion, software, and application. It provides a gentle introduction to theRKHS, keeps theory at the minimum level, and provides details on howthe RKHS can be used to construct spline models.
User-friendly software is key to the routine use of any statistical method.The assist library in R implements methods presented in this book forfitting various nonparametric/semiparametric linear/nonlinear fixed/mixedsmoothing spline models. The assist library can be obtained at
http://www.r-project.org
Much of the exposition is based on the analysis of real examples.Rather than formal analysis, these examples are intended to illustratethe power and versatility of the spline smoothing methodology. All dataanalyses are performed in R, and most of them use functions in the
xxiii
xxiv
assist library. Codes for all examples and further developments relatedto this book will be posted on the web page
http://www.pstat.ucsb.edu/faculty/yuedong/book.html
This book is intended for those wanting to learn about smoothingsplines. It can be a reference book for statisticians and scientists whoneed advanced and flexible modeling techniques. It can also serve as atext for an advanced-level graduate course on the subject. In fact, topicsin Chapters 1–4 were covered in a quarter class at the University of Cal-ifornia — Santa Barbara, and the University of Science and Technologyof China.
I was fortunate indeed to have learned the smoothing spline fromGrace Wahba, whose pioneering work has paved the way for much ongo-ing research and made this book possible. I am grateful to Chunlei Ke,my former student and collaborator, for developing the assist pack-age. Special thanks goes to Anna Liu for reading the draft carefully andcorrecting many mistakes. Several people have helped me over variousphases of writing this book: Chong Gu, Wensheng Guo, David Hinkley,Ping Ma, and Wendy Meiring. I must thank my editor, David Grubbes,for his patience and encouragement. Finally, I would like to thank sev-eral researchers who kindly shared their data sets for inclusion in thisbook; they are cited where their data are introduced.
Yuedong WangSanta BarbaraDecember 2010
Chapter 1
Introduction
1.1 Parametric and Nonparametric Regression
Regression analysis builds mathematical models that examine the rela-tionship of a dependent variable to one or more independent variables.These models may be used to predict responses at unobserved and/orfuture values of the independent variables. In the simple case whenboth the dependent variable y and the independent variable x are scalarvariables, given observations (xi, yi) for i = 1, . . . , n, a regression modelrelates dependent and independent variables as follows:
yi = f(xi) + ǫi, i = 1, . . . , n, (1.1)
where f is the regression function and ǫi are zero-mean independent ran-dom errors with a common variance σ2. The goal of regression analysisis to construct a model for f and estimate it based on noisy data.
For example, for the Old Faithful geyser in Yellowstone National Park,consider the problem of predicting the waiting time to the next eruptionusing the length of the previous eruption. Figure 1.1(a) shows the scatterplot of waiting time to the next eruption (y = waiting) against durationof the previous eruption (x = duration) for 272 observations from theOld Faithful geyser. The goal is to build a mathematical model thatrelates the waiting time to the duration of the previous eruption. A firstattempt might be to approximate the regression function f by a straightline
f(x) = β0 + β1x. (1.2)
The least squares straight line fit is shown in Figure 1.1(a). There is noapparent sign of lack-of-fit. Furthermore, there is no clear visible trendin the plot of residuals in Figure 1.1(b).
Often f is nonlinear in x. A common approach to dealing with non-linear relationship is to approximate f by a polynomial of order m
f(x) = β0 + β1x+ · · · + βm−1xm−1. (1.3)
1
2 Smoothing Splines: Methods and Applications
1.5 2.5 3.5 4.5
50
60
70
80
90
duration (min)
waitin
g (
min
)
(a)
1.5 2.5 3.5 4.5
−10
05
10
15
duration (min)
resid
uals
(m
in)
(b)
FIGURE 1.1 Geyser data, plots of (a) observations and the leastsquares straight line fit, and (b) residuals.
Figure 1.2 shows the scatter plot of acceleration (y = acceleration)against time after impact (x = time) from a simulated motorcycle crashexperiment on the efficacy of crash helmets. It is clear that a straight linecannot explain the relationship between acceleration and time. Polyno-mials with m = 1, . . . , 20 are fitted to the data, and Figure 1.2 showsthe best fit selected by Akaike’s information criterion (AIC). There arewaves in the fitted curve at both ends of the range. The fit is still notcompletely satisfactory even when polynomials up to order 20 are con-sidered. Unlike the linear regression model (1.2), except for small m,coefficients in model (1.3) no longer have nice interpretations.
time (ms)
accele
ration (
g)
−100
−50
050
0 10 20 30 40 50 60
ooooo ooo oooooooooo ooo
oooooo
o
oo
oo
o
oo
oo
oo
o
o
o
o
o
o
o
o
o
o
o
oooo
o
o
o
o
ooo
o
oo
o oo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
ooo
o
oo
o
o
o
o
ooo
o
o
o
o
o
o
o
oo
o
o
o
ooo
oo
o
o
o
o
o
o
oo
o o
o
o o
o
o
o
o
o
o
o
FIGURE 1.2 Motorcycle data, plot of observations, and a polynomialfit.
Introduction 3
In general, a parametric regression model assumes that the form off is known except for finitely many unknown parameters. The specificform of f may come from scientific theories and/or approximations tomechanics under some simplified assumptions. The assumptions maybe too restrictive and the approximations may be too crude for someapplications. An inappropriate model can lead to systematic bias andmisleading conclusions. In practice, one should always check the as-sumed form for the function f .
It is often difficult, if not impossible, to obtain a specific functionalform for f . A nonparametric regression model does not assume a prede-termined form. Instead, it makes assumptions on qualitative propertiesof f . For example, one may be willing to assume that f is “smooth”,which does not reduce to a specific form with finite number of param-eters. Rather, it usually leads to some infinite dimensional collectionsof functions. The basic idea of nonparametric regression is to let thedata speak for themselves. That is to let the data decide which functionfits the best without imposing any specific form on f . Consequently,nonparametric methods are in general more flexible. They can uncoverstructure in the data that might otherwise be missed.
1.5 2.5 3.5 4.5
−1
00
51
01
5
duration (min)
resid
uals
(m
in)
(a)
1.5 2.5 3.5 4.5
50
60
70
80
90
duration (min)
waitin
g (
min
)
(b)
FIGURE 1.3 Geyser data, plots of (a) residuals from the straight linefit and the cubic spline fit to the residuals, and (b) the cubic spline fitto the original data.
For illustration, we fit cubic splines to the geyser data. The cubicspline is a special nonparametric regression model that will be introducedin Section 1.2. A cubic spline fit to residuals from the linear model (1.2)reveals a nonzero trend in Figure 1.3(a). This raises the question of
4 Smoothing Splines: Methods and Applications
whether a simple linear regression model is appropriate for the geyserdata. A cubic spline fit to the original data is shown in Figure 1.3(b).It reveals that there are two clusters in the independent variable, and adifferent linear model may be required for each cluster. Sections 2.10,3.8, and 3.9 contain more analysis of the geyser data. A cubic spline fitto the motorcycle data is shown in Figure 1.4. It fits data much betterthan the polynomial model. Sections 2.10, 3.8, 5.4.1, and 6.4 containmore analysis of the motorcycle data.
time (ms)
accele
ration (
g)
−100
−50
050
0 10 20 30 40 50 60
ooooo ooo oooooooooo ooo
oooooo
o
oo
oo
o
oo
oo
oo
o
o
o
o
o
o
o
o
o
o
o
oooo
o
o
o
o
ooo
o
oo
o oo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
ooo
o
oo
o
o
o
o
ooo
o
o
o
o
o
o
o
oo
o
o
o
ooo
oo
o
o
o
o
o
o
oo
o o
o
o o
o
o
o
o
o
o
o
FIGURE 1.4 Motorcycle data, plot of observations, and the cubicspline fit.
The above simple exposition indicates that the nonparametric regres-sion technique can be applied to different steps in regression analysis:data exploration, model building, testing parametric models, and diag-nosis. In fact, as illustrated throughout the book, spline smoothing isa powerful and versatile tool for building statistical models to exploitstructures in data.
1.2 Polynomial Splines
The polynomial (1.3) is a global model which makes it less adaptive tolocal variations. Individual observations can have undue influence on thefit in remote regions. For example, in the motorcycle data, the behav-ior of the mean function varies drastically from one region to another.
Introduction 5
These local variations led to oscillations at both ends of the range in thepolynomial fit. A natural solution to overcome this limitation is to usepiecewise polynomials, the basic idea behind polynomial splines.
Let a < t1 < · · · < tk < b be fixed points called knots. Let t0 =a and tk+1 = b. Roughly speaking, polynomial splines are piecewisepolynomials joined together smoothly at knots. Formally, a polynomialspline of order r is a real-valued function on [a, b], f(t), such that
(i) f is a piecewise polynomial of order r on [ti, ti+1), i = 0, 1, . . . , k;
(ii) f has r − 2 continuous derivatives and the (r − 1)st derivative is astep function with jumps at knots.
Now consider even orders represented as r = 2m. The function f isa natural polynomial spline of order 2m if, in addition to (i) and (ii), itsatisfies the natural boundary conditions
(iii) f (j)(a) = f (j)(b) = 0, j = m, . . . , 2m− 1.
The natural boundary conditions imply that f is a polynomial of orderm on the two outside subintervals [a, t1] and [tk, b]. Denote the functionspace of natural polynomial splines of order 2m with knots t1, . . . , tk asNS2m(t1, . . . , tk).
One approach, known as regression spline, is to approximate f usinga polynomial spline or natural polynomial spline. To get a good approx-imation, one needs to decide the number and locations of knots. Thisbook covers a different approach known as smoothing spline. It startswith a well-defined model space for f and introduces a penalty to preventoverfitting. We now describe this approach for polynomial splines.
Consider the regression model (1.1). Suppose f is “smooth”. Specifi-cally, assume that f ∈ Wm
2 [a, b] where the Sobolev space
Wm2 [a, b] =
{f : f, f ′, . . . , f (m−1) are absolutely continuous,
∫ b
a
(f (m))2dx <∞}. (1.4)
For any a ≤ x ≤ b, Taylor’s theorem states that
f(x) =
m−1∑
ν=0
f (ν)(a)
ν!(x− a)ν
︸ ︷︷ ︸
polynomial of order m
+
∫ x
a
(x − u)m−1
(m− 1)!f (m)(u)du
︸ ︷︷ ︸
Rem(x)
. (1.5)
It is clear that the polynomial regression model (1.3) ignores the re-mainder term Rem(x) in the hope that it is negligible. It is often difficult
6 Smoothing Splines: Methods and Applications
to verify this assumption in practice. The idea behind the spline smooth-ing is to let data decide how large Rem(x) should be. Since Wm
2 [a, b] isan infinite dimensional space, a direct fit to f by minimizing the leastsquares (LS)
1
n
n∑
i=1
(yi − f(xi))2 (1.6)
leads to interpolation. Therefore, certain control over Rem(x) is neces-sary. One natural approach is to control how far f is allowed to departfrom the polynomial model. Under appropriate norms defined later inSections 2.2 and 2.6, one measure of distance between f and polynomials
is∫ b
a(f (m))2dx. It is then reasonable to estimate f by minimizing the
LS (1.6) under the constraint
∫ b
a
(f (m))2dx ≤ ρ (1.7)
for a constant ρ. By introducing a Lagrange multiplier, the constrainedminimization problem (1.6) and (1.7) is equivalent to minimizing thepenalized least squares (PLS):
1
n
n∑
i=1
(yi − f(xi))2 + λ
∫ b
a
(f (m))2dx. (1.8)
In the remainder of this book, a polynomial spline refers to the solu-tion of the PLS (1.8) in the model space Wm
2 [a, b]. A cubic spline isa special case of the polynomial spline with m = 2. Since it measures
the roughness of the function f ,∫ b
a(f (m))2dx is often referred to as a
roughness penalty. It is obvious that there is no penalty for polynomialsof order less than or equal to m. The smoothing parameter λ balancesthe trade-off between goodness-of-fit measured by the LS and roughness
of the estimate measured by∫ b
a(f (m))2dx.
Suppose that n ≥ m and a ≤ x1 < x2 < · · · < xn ≤ b. Then, for fixed0 < λ < ∞, (1.8) has a unique minimizer f and f ∈ NS2m(x1, . . . , xn)(Eubank 1988). This result indicates that even though we started withthe infinite dimensional space Wm
2 [a, b] as the model space for f , the so-lution to the PLS (1.8) belongs to a finite dimensional space. Specifically,the solution is a natural polynomial spline with knots at distinct designpoints. One approach to computing the polynomial spline estimate isto represent f as a linear combination of a basis of NS2m(x1, . . . , xn).Several basis constructions were provided in Section 3.3.3 of Eubank(1988). In particular, the R function smooth.spline implements thisapproach for the cubic spline using the B-spline basis. For example, thecubic spline fit in Figure 1.4 is derived by the following statements:
Introduction 7
> library(MASS); attach(mcycle)
> smooth.spline(times, accel, all.knots=T)
This book presents a different approach. Instead of basis functions,representers of reproducing kernel Hilbert spaces will be used to repre-sent the spline estimate. This approach allows us to deal with manydifferent spline models in a unified fashion. Details of this approach forpolynomial splines will be presented in Sections 2.2 and 2.6.
When λ = 0, there is no penalty, and the natural spline that inter-polates observations is the unique minimizer. When λ = ∞, the uniqueminimizer is the mth order polynomial. As λ varies from ∞ to 0, wehave a family of models ranging from the parametric polynomial modelto interpolation. The value of λ decides how far f is allowed to departfrom the polynomial model. Thus the choice of λ holds the key to thesuccess of a spline estimate. We discuss how to choose λ based on datain Chapter 3.
1.3 Scope of This Book
Driven by many sophisticated applications and fueled by modern com-puting power, many flexible nonparametric and semiparametric mod-eling techniques have been developed to relax parametric assumptionsand to exploit possible hidden structure. There are many different non-parametric methods. This book concentrates on one of them, smoothingspline. Existing books on this topic include Eubank (1988), Wahba(1990), Green and Silverman (1994), Eubank (1999), Gu (2002), andRuppert, Wand and Carroll (2003). The goals of this book are to (a)make the advanced smoothing spline methodology based on reproducingkernel Hilbert spaces more accessible to practitioners and students; (b)provide software and examples so that the spline smoothing methods canbe routinely used in practice; and (c) provide a comprehensive coverageof recently developed smoothing spline nonparametric/semiparametriclinear/nonlinear fixed/mixed models. We concentrate on the methodol-ogy, implementation, software, and application. Theoretical results arestated without proofs. All methods will be demonstrated using real datasets and R functions.
The polynomial spline in Section 1.2 concerns the functions definedon the domain [a, b]. In many applications, the domain of the regres-sion function is not a continuous interval. Furthermore, the regressionfunction may only be observed indirectly. Chapter 2 introduces gen-
8 Smoothing Splines: Methods and Applications
eral smoothing spline regression models with reproducing kernel Hilbertspaces on general domains as model spaces. Penalized LS estimation,Kimeldorf–Wahba representer theorem, computation, and the R func-tion ssr will be covered. Explicit constructions of model spaces will bediscussed in detail for some popular smoothing spline models includingpolynomial, periodic, thin-plate, spherical, and L-splines.
Chapter 3 introduces methods for selecting the smoothing param-eter and making inferences about the regression function. The impactof the smoothing parameter and basic concepts for model selection willbe discussed and illustrated using an example. Connections betweensmoothing spline models and Bayes/mixed-effects models will be es-tablished. The unbiased risk, generalized cross-validation, and gener-alized maximum likelihood methods will be introduced for selecting thesmoothing parameter. Bayesian and bootstrap confidence intervals willbe introduced for the regression function and its components. The lo-cally most powerful, generalized maximum likelihood and generalizedcross-validation tests will also be introduced to test the hypothesis of aparametric model versus a nonparametric alternative.
Analogous to multiple regression, Chapter 4 constructs models formultivariate regression functions based on smoothing spline analysis ofvariance (ANOVA) decompositions. The resulting models have hier-archical structures that facilitate model selection and interpretation.Smoothing spline ANOVA decompositions for tensor products of somecommonly used smoothing spline models will be illustrated. PenalizedLS estimation involving multiple smoothing parameters and componen-twise Bayesian confidence intervals will be covered.
Chapter 5 presents spline smoothing methods for heterogeneous andcorrelated observations. Presence of heterogeneity and correlation maylead to wrong choice of the smoothing parameters and erroneous infer-ence. Penalized weighted LS will be used for estimation. Unbiased risk,generalized cross-validation, and generalized maximum likelihood meth-ods will be extended for selecting the smoothing parameters. Varianceand correlation structures will also be discussed.
Analogous to generalized linear models, Chapter 6 introduces smooth-ing spline ANOVA models for observations generated from a particulardistribution in the exponential family including binomial, Poisson, andgamma distributions. Penalized likelihood will be used for estimation,and methods for selecting the smoothing parameters will be discussed.Nonparametric estimation of variance and spectral density functions willbe presented.
Analogous to nonlinear regression, Chapter 7 introduces spline smooth-ing methods for nonparametric nonlinear regression models where someunknown functions are observed indirectly through nonlinear function-
Introduction 9
als. In addition to fitting theoretical and empirical nonlinear nonpara-metric regression models, methods in this chapter may also be used todeal with constraints on the nonparametric function such as positivityor monotonicity. Several algorithms based on Gauss–Newton, Newton–Raphson, extended Gauss–Newton and Gauss–Seidel methods will bepresented for different situations. Computation and the R function nnr
will be covered.Chapter 8 introduces semiparametric regression models that involve
both parameters and nonparametric functions. The mean function maydepend on the parameters and the nonparametric functions linearly ornonlinearly. The semiparametric regression models include many well-known models such as the partial spline, varying coefficients, projectionpursuit, single index, multiple index, functional linear, and shape invari-ant models as special cases. Estimation, inference, computation, andthe R function snr will also be covered.
Chapter 9 introduces semiparametric linear and nonlinear mixed-effects models. Smoothing spline ANOVA decompositions are extendedfor the construction of semiparametric mixed-effects models that parallelthe classical mixed models. Estimation and inference methods, compu-tation, and the R functions slm and snm will be covered as well.
1.4 The assist Package
The assist package was developed for fitting various smoothing splinemodels covered in this book. It contains five main functions, ssr, nnr,snr, slm, and snm for fitting various smoothing spline models. The func-tion ssr fits smoothing spline regression models in Chapter 2, smoothingspline ANOVA models in Chapter 4, extended smoothing spline ANOVAmodels with heterogeneous and correlated observations in Chapter 5,generalized smoothing spline ANOVA models in Chapter 6, and semi-parametric linear regression models in Chapter 8, Section 8.2. Thefunction nnr fits nonparametric nonlinear regression models in Chap-ter 7. The function snr fits semiparametric nonlinear regression modelsin Chapter 8, Section 8.3. The functions slm and snm fit semiparametriclinear and nonlinear mixed-effects models in Chapter 9. The assist
package is available at
http://cran.r-project.org
Figure 1.5 shows how the functions in assist generalize some of theexisting R functions for regression analysis.
10 Smoothing Splines: Methods and Applications
lm
glm
smooth.spline
nls
lme
gam
nnr
nlme
slm
snr
ssr
snm
�
���3
QQQs
JJ
JJJ
XXXXXXXXz����:XXXXz
XXXXXXXXz
��������:
��������:
XXXXXXXXz
-
-
HHHHHj
XXXXXz
�����1
FIGURE 1.5 Functions in assist (dashed boxes) and some exist-ing R functions (solid boxes). An arrow represents an extension to amore general model. lm: linear models. glm: generalized linear models.smooth.spline: cubic spline models. nls: nonlinear regression mod-els. lme: linear mixed-effects models. gam: generalized additive models.nlme: nonlinear mixed-effects models. ssr: smoothing spline regres-sion models. nnr: nonparametric nonlinear regression models. snr:semiparametric nonlinear regression models. slm: semiparametric lin-ear mixed-effects models. snm: semiparametric nonlinear mixed-effectsmodels.
Chapter 2
Smoothing Spline Regression
2.1 Reproducing Kernel Hilbert Space
Polynomial splines concern functions defined on a continuous interval.This is the most common situation in practice. Nevertheless, manyapplications require modeling functions defined on domains other than acontinuous interval. For example, for spatial data with measurements onlatitude and longitude, the domain of the function is the Euclidean spaceR
2. Specific spline models were developed for different applications. Itis desirable to develop methodology and software on a general platformsuch that special cases are dealt with in a unified fashion. ReproducingKernel Hilbert Space (RKHS) provides such a general platform.
This section provides a very brief review of RKHS. Throughout thisbook, important theoretical results are presented in italic without proofs.Details and proofs related to RKHS can be found in Aronszajn (1950),Wahba (1990), Gu (2002), and Berlinet and Thomas-Agnan (2004).
A nonempty set E of elements f, g, h, . . . forms a linear space if thereare two operations: (1) addition: a mapping (f, g) → f + g from E ×Einto E; and (2) multiplication: a mapping (α, f) → αf from R ×E intoE, such that for any α, β ∈ R, the following conditions are satisfied: (a)f + g = g+ f ; (b) (f + g)+ h = f + (g+ h); (c) for every f, g ∈ E, thereexists h ∈ E such that f + h = g; (d) α(βf) = (αβ)f ; (e) (α + β)f =αf + βf ; (f) α(f + g) = αf + αg; and (g) 1f = f . Property (c) impliesthat there exists a zero element, denoted as 0, such that f + 0 = f forall f ∈ E.
A finite collection of elements f1, . . . , fk in E is called linearly inde-pendent if the relation α1f1+ · · ·+αkfk = 0 holds only in the trivial casewith α1 = · · · = αk = 0. An arbitrary collection of elements A is calledlinearly independent if every finite subcollection is linearly independent.Let A be a subset of a linear space E. Define
spanA , {α1f1 + · · · + αkfk : f1, . . . , fk ∈ A,
α1, . . . , αk ∈ R, k = 1, 2, . . . }.
11
12 Smoothing Splines: Methods and Applications
A set B ⊂ E is called a basis of E if B is linearly independent andspanB = E.
A nonnegative function || · || on a linear space E is called a norm if(a) ||f || = 0 if and only if f = 0; (b) ||αf || = |α|||f ||; and (c) ||f + g|| ≤||f ||+ ||g||. If the function || · || satisfies (b) and (c) only, then it is calleda seminorm. A linear space with a norm is called a normed linear space.
Let E be a linear space. A mapping (·, ·) : E × E → R is called aninner product in E if it satisfies (a) (f, g) = (g, f); (b) (αf + βg, h) =α(f, h) + β(g, h); and (c) (f, f) ≥ 0, and (f, f) = 0 if and only if f = 0.An inner product defines a norm: ||f || ,
√
(f, f). A linear space withan inner product is called an inner product space.
Let E be a normed linear space and fn be a sequence in E. Thesequence fn is said to converge to f ∈ E if limn→∞ ||fn − f || = 0, andf is called the limit point. The sequence fn is called a Cauchy sequenceif liml,n→∞ ||fl − fn|| = 0. The space E is complete if every Cauchysequence converges to an element in E. A complete inner product spaceis called a Hilbert space.
A functional L on a Hilbert space H is a mapping from H to R. L is alinear functional if it satisfies L(αf + βg) = αLf + βLg. L is said to becontinuous if limn→∞ Lfn = Lf when limn→∞ fn = f . L is said to bebounded if there exists a constant M such that |Lf | ≤M ||f || for all f ∈H. L is continuous if and only if L is bounded. For every fixed h ∈ H,Lhf , (h, f) defines a continuous linear functional. Conversely, everycontinuous linear functional L can be represented as an inner productwith a representer.
Riesz representation theoremLet L be a continuous linear functional on a Hilbert space H. Thereexists a unique hL such that Lf = (hL, f) for all f ∈ H. The elementhL is called the representer of L.
Let H be a Hilbert space of real-valued functions from X to R whereX is an arbitrary set. For a fixed x ∈ X , the evaluational functionalLx : H → R is defined as
Lxf , f(x).
Note that the evaluational functional Lx maps a function to a real valuewhile the function f maps a point x to a real value. Lx applies to allfunctions in H with a fixed x. Evaluational functionals are linear sinceLx(αf + βg) = αf(x) + βg(x) = αLxf + βLxg.
Definition A Hilbert space of real-valued functions H is an RKHS ifevery evaluational functional is continuous.
Smoothing Spline Regression 13
Let H be an RKHS. Then, for each x ∈ X , the evaluational functionalLxf = f(x) is continuous. By the Riesz representation theorem, thereexists an element Rx in H such that
Lxf = f(x) = (Rx, f),
where the dependence of the representer on x is expressed explicitlyas Rx. Consider Rx(z) as a bivariate function of x and z and letR(x, z) , Rx(z). The bivariate function R(x, z) is called the repro-ducing kernel (RK) of an RKHS H. The term reproducing kernel comesfrom the fact that (Rx, Rz) = R(x, z). It is easy to check that an RK isnonnegative definite. That is, R is symmetric R(x, z) = R(z, x), and forany α1, . . . , αn ∈ R and x1, . . . , xn ∈ X ,
n∑
i,j=1
αiαjR(xi, xj) ≥ 0.
Therefore, every RKHS has a unique RK that is nonnegative definite.Conversely, an RKHS can be constructed based on a nonnegative definitefunction.
Moore–Aronszajn theoremFor every nonnegative definite function R on X×X , there exists a uniqueRKHS on X with R as its RK.
The above results indicate that there exists an one-to-one correspon-dence between RKHS’s and nonnegative definite functions. For a finitedimensional space H with an orthonormal basis φ1(x), . . . , φp(x), it iseasy to see that
R(x, z) ,
p∑
i=1
φi(x)φi(z)
is the RK of H.The following definitions and results are useful for the construction
and decomposition of model spaces. S is called a subspace of a Hilbertspace H if S ⊂ H and αf + βg ∈ S for every α, β ∈ R and f, g ∈ S. Aclosed subspace S is a Hilbert space. The orthogonal complement of Sis defined as
S⊥ , {f ∈ H : (f, g) = 0 for all g ∈ S}.
S⊥ is a closed subspace of H. If S is a closed subspace of a Hilbert spaceH, then every element f ∈ H has a unique decomposition in the formf = g + h, where g ∈ S and h ∈ S⊥. Equivalently, H is decomposed
14 Smoothing Splines: Methods and Applications
into two subspaces H = S ⊕ S⊥. This decomposition is called a tensorsum decomposition, and elements g and h are called projections onto Sand S⊥, respectively. Sometimes the notation H ⊖ S will be used todenote the subspaces S⊥. Tensor sum decomposition with more thantwo subspaces can be defined recursively.
All closed subspaces of an RKHS are RKHS’s. If H = H0⊕H1 and R,R0, and R1 are RKs of H, H0, and H1 respectively, then R = R0 +R1.Suppose H is a Hilbert space and H = H0 ⊕ H1. If H0 and H1 areRKHS’s with RKs R0 and R1, respectively, then H is an RKHS withRK R = R0 +R1.
2.2 Model Space for Polynomial Splines
Before introducing the general smoothing spline models, it is instruc-tive to see how the polynomial splines introduced in Section 1.2 can bederived under the RKHS setup. Again, consider the regression model
yi = f(xi) + ǫi, i = 1, . . . , n, (2.1)
where the domain of the function f is X = [a, b] and the model space forf is the Sobolev space Wm
2 [a, b] defined in (1.4). The smoothing spline
estimate f is the solution to the PLS (1.8).
Model space construction and decomposition of Wm2 [a, b]
The Sobolev space Wm2 [a, b] is an RKHS with the inner product
(f, g) =
m−1∑
ν=0
f (ν)(a)g(ν)(a) +
∫ b
a
f (m)g(m)dx. (2.2)
Furthermore, Wm2 [a, b] = H0 ⊕H1, where
H0 = span{1, (x− a), . . . , (x− a)m−1/(m− 1)!
},
H1 ={f : f (ν)(a) = 0, ν = 0, . . . ,m− 1,
∫ b
a
(f (m))2dx <∞},
(2.3)
are RKHS’s with corresponding RKs
R0(x, z) =
m∑
ν=1
(x− a)ν−1
(ν − 1)!
(z − a)ν−1
(ν − 1)!,
R1(x, z) =
∫ b
a
(x− u)m−1+
(m− 1)!
(z − u)m−1+
(m− 1)!du.
(2.4)
Smoothing Spline Regression 15
The function (x)+ = max{x, 0}.
Details about the foregoing construction can be found in Schumaker(2007). It is clear that H0 contains the polynomial of order m in theTaylor expansion. Note that the basis listed in (2.3), φν(x) = (x −a)ν−1/(ν − 1)! for ν = 1, . . . ,m, is an orthonormal basis of H0. For anyf ∈ H1, it is easy to check that
f(x) =
∫ x
a
f ′(u)du = · · · =
∫ x
a
dx1
∫ x1
a
dx2· · ·∫ xm−1
a
f (m)(u)du
=
∫ x
a
dx1
∫ x1
a
dx2· · ·∫ xm−2
a
(xm−2 − u)f (m)(u)du = · · ·
=
∫ x
a
(x− u)m−1
(m− 1)!f (m)(u)du.
Thus the subspace H1 contains the remainder term in the Taylorexpansion.
Denote P1 as the orthogonal projection operator onto H1. From thedefinition of the inner product, the roughness penalty
∫ b
a
(f (m))2dx = ||P1f ||2. (2.5)
Therefore,∫ b
a (f (m))2dx measures the distance between f and the para-metric polynomial space H0. There is no penalty to functions in H0.
The PLS (1.8) can be rewritten as
1
n
n∑
i=1
(yi − f(xi))2 + λ||P1f ||2. (2.6)
The solution to (2.6) will be given for the general case in Section 2.4.The above setup for polynomial splines suggests the following ingredientsfor the construction of a general smoothing spline model:
1. An RKHS H as the model space for f
2. A decomposition of the model space into two subspaces, H = H0⊕H1, where H0 consists of functions that are not penalized
3. A penalty ||P1f ||2
Based on prior knowledge and purpose of the study, different choicescan be made on the model space, its decomposition, and the penalty.These options make the spline smoothing method flexible and versatile.Choices of these options will be illustrated throughout the book.
16 Smoothing Splines: Methods and Applications
2.3 General Smoothing Spline Regression Models
A general smoothing spline regression (SSR) model assumes that
yi = f(xi) + ǫi, i = 1, . . . , n, (2.7)
where yi are observations of the function f evaluated at design pointsxi, and ǫi are zero-mean independent random errors with a commonvariance σ2. To deal with different situations in a unified fashion, letthe domain of the function f be an arbitrary set X , and the model spacebe an RKHS H on X with RK R(x, z). The choice of H depends onseveral factors including the domain X and prior knowledge about thefunction f . Suppose H can be decomposed into two subspaces,
H = H0 ⊕H1, (2.8)
where H0 is a finite dimensional space with basis functions φ1(x), . . . ,φp(x), and H1 is an RKHS with RK R1(x, z). H0, often referred to as thenull space, consists of functions that are not penalized. In addition to theconstruction for polynomial splines in Section 2.2, specific constructionsof commonly used model spaces will be discussed in Sections 2.6–2.11.The decomposition (2.8) is equivalent to decomposing the function
f = f0 + f1, (2.9)
where f0 and f1 are projections onto H0 and H1, respectively. Thecomponent f0 represents a linear regression model in space H0, andthe component f1 represents systematic variation not explained by f0.Therefore, the magnitude of f1 can be used to check or test if the para-metric model is appropriate. Projections f0 and f1 will be referred to asthe “parametric” and “smooth” components, respectively.
Sometimes observations of f are made indirectly through linear func-
tionals. For example, f may be observed in the form∫ b
awi(x)f(x)dx
where wi are known functions. Another example is that observationsare taken on the derivatives f ′(xi). Therefore, it is useful to consider aneven more general SSR model
yi = Lif + ǫi, i = 1, . . . , n, (2.10)
where Li are bounded linear functionals on H. Model (2.7) is a specialcase of (2.10) with Li being evaluational functionals at design pointsdefined as Lif = f(xi). By the definition of an RKHS, these evaluationalfunctionals are bounded.
Smoothing Spline Regression 17
2.4 Penalized Least Squares Estimation
The estimation method will be presented for the general model (2.10).
The smoothing spline estimate of f , f , is the minimizer of the PLS
1
n
n∑
i=1
(yi − Lif)2 + λ||P1f ||2, (2.11)
where λ is a smoothing parameter controlling the balance between thegoodness-of-fit measured by the least squares and departure from the nullspace H0 measured by ||P1f ||2. Functions in H0 are not penalized since
||P1f ||2 = 0 when f ∈ H0. Note that f depends on λ even though thedependence is not expressed explicitly. Estimation procedures presentedin this chapter assume that the λ has been fixed. The impact of thesmoothing parameter and methods of selecting it will be discussed inChapter 3.
Since Li are bounded linear functionals, by the Riesz representationtheorem, there exists a representer ηi ∈ H such that Lif = (ηi, f). Fora fixed x, consider Rx(z) , R(x, z) as a univariate function of z. Then,by properties of the reproducing kernel, we have
ηi(x) = (ηi, Rx) = LiRx = Li(z)R(x, z), (2.12)
where Li(z) indicates that Li is applied to what follows as a function ofz. Equation (2.12) implies that the representer ηi can be obtained byapplying the operator to the RK R. Let ξi = P1ηi be the projectionof ηi onto H1. Since R(x, z) = R0(x, z) + R1(x, z), where R0 and R1
are RKs of H0 and H1, respectively, and P1 is self-adjoint such that(P1g, h) = (g, P1h) for any g, h ∈ H, we have
ξi(x) = (ξi, Rx) = (P1ηi, Rx) = (ηi, P1Rx) = Li(z)R1(x, z). (2.13)
Equation (2.13) implies that the representer ξi can be obtained by ap-plying the operator to the RK R1. Furthermore, (ξi, ξj) = Li(x)ξj(x) =Li(x)Lj(z)R1(x, z). Denote
T = {Liφν}ni=1
pν=1,
Σ = {Li(x)Lj(z)R1(x, z)}ni,j=1,
(2.14)
where T is an n × p matrix, and Σ is an n × n matrix. For the specialcase of evaluational functionals Lif = f(xi), we have ξi(x) = R1(x, xi),T = {φν(xi)}n p
i=1 ν=1, and Σ = {R1(xi, xj)}ni,j=1.
18 Smoothing Splines: Methods and Applications
Write the estimate f as
f(x) =
p∑
ν=1
dνφν(x) +
n∑
i=1
ciξi(x) + ρ,
where ρ ∈ H1 and (ρ, ξi) = 0 for i = 1, . . . , n. Since ξi = P1ηi, then ηi
can be written as ηi = ζi + ξi, where ζi ∈ H0. Therefore,
Liρ = (ηi, ρ) = (ζi, ρ) + (ξi, ρ) = 0. (2.15)
Let y = (y1, . . . , yn)T and f = (L1f , . . . ,Lnf)T be the vectors of ob-servations and fitted values, respectively. Let d = (d1, . . . , dp)
T andc = (c1, . . . , cn)T . From (2.15), we have
f = Td+ Σc. (2.16)
Furthermore, ||P1f ||2 = ||∑ni=1 ciξi + ρ||2 = cT Σc + ||ρ||2. Then the
PLS (2.11) becomes
1
n||y − Td− Σc||2 + λcT Σc+ ||ρ||2. (2.17)
It is obvious that (2.17) is minimized when ρ = 0, which leads to thefollowing result in Kimeldorf and Wahba (1971).
Kimeldorf–Wahba representer theoremSuppose T is of full column rank. Then the PLS (2.11) has a uniqueminimizer given by
f(x) =
p∑
ν=1
dνφν(x) +n∑
i=1
ciξi(x). (2.18)
The above theorem indicates that the smoothing spline estimate f fallsin a finite dimensional space. Equation (2.18) represents the smoothing
spline estimate f as a linear combination of basis of H0 and representersin H1. Coefficients c and d need to be estimated from data. Based on(2.18), the PLS (2.17) reduces to
1
n||y − Td− Σc||2 + λcT Σc. (2.19)
Taking the first derivatives leads to the following equations for c and d:
(Σ + nλI)Σc+ ΣTd = Σy,
T T Σc+ T TTd = T Ty,(2.20)
Smoothing Spline Regression 19
where I is the identity matrix. Equations in (2.20) are equivalent to
(Σ + nλI ΣTT T T TT
)(Σcd
)
=
(ΣyT Ty
)
.
There may be multiple sets of solutions for c when Σ is singular. Never-theless, all sets of solutions lead to the same estimate of the function f(Gu 2002). Therefore, it is only necessary to derive one set of solutions.Consider the following equations
(Σ + nλI)c+ Td = y,
T Tc = 0.(2.21)
It is easy to see that a set of solutions to (2.21) is also a set of solutionsto (2.20). The solutions to (2.21) are
d = (T TM−1T )−1T TM−1y,
c = M−1{I − T (T TM−1T )−1T TM−1}y,(2.22)
where M = Σ + nλI.To compute the coefficients c and d, consider the QR decomposition
of T ,
T = (Q1 Q2)
(R0
)
,
where Q1, Q2, and R are n × p, n × (n − p), and p × p matrices;Q = (Q1 Q2) is an orthogonal matrix; and R is upper triangular and in-vertible. Since T Tc = RTQT
1 c = 0, we have QT1 c = 0 and c = QQTc =
(Q1QT1 +Q2Q
T2 )c = Q2Q
T2 c. Multiplying the first equation in (2.21) by
QT2 and using the fact that QT
2 T = 0, we have QT2MQ2Q
T2 c = QT
2 y.Therefore,
c = Q2(QT2 MQ2)
−1QT2 y. (2.23)
Multiplying the first equation in (2.21) by QT1 , we have Rd = QT
1 (y −Mc). Thus,
d = R−1QT1 (y −Mc). (2.24)
Equations (2.23) and (2.24) will be used to compute coefficients c andd.
Based on (2.16), the first equation in (2.21) and equation (2.23), thefitted values
f = Td+ Σc = y − nλc = H(λ)y, (2.25)
where
H(λ) , I − nλQ2(QT2 MQ2)
−1QT2 (2.26)
20 Smoothing Splines: Methods and Applications
is the so-called hat (influence, smoothing) matrix. The dependence of thehat matrix on the smoothing parameter λ is expressed explicitly. Notethat equation (2.25) provides the fitted values while equation (2.18) canbe used to compute estimates at any values of x.
2.5 The ssr Function
The R function ssr in the assist package is designed to fit SSR mod-els. After deciding the model space and the penalty, the estimate f iscompletely decided by y, T , and Σ. Therefore, these terms need to bespecified in the ssr function. A typical call is
ssr(formula, rk)
where formula and rk are required arguments. Together they specifyy, T , and Σ. Suppose the vector y and matrices T and Σ have beencreated in R. Then, formula lists y on the left-hand side, and T matrixon the right-hand side of an operator ~. The argument rk specifies thematrix Σ.
In the most common situation where Li are evaluational functionals,the fitting can be greatly simplified since Li are decided by design pointsxi. There is no need to compute T and Σ matrices before calling thessr function. Instead, they can be computed internally. Specifically, adirect approach to fit the standard SSR model (2.7) is to list y on theleft-hand side and φ1(x), . . . , φp(x) on the right-hand side of an operator~ in the formula, and to specify a function for computing R1 in the rk
argument. Functions for computing the RKs of some commonly usedRKHS’s are available in the assist package. Users can easily writetheir own functions for computing RKs.
There are several optional arguments for the ssr function, some ofwhich will be discussed in the following chapters. In particular, methodsfor selecting the smoothing parameter λ will be discussed in Chapter 3.For simplicity, unless explicitly specified, all examples in this chapter usethe default method that selects λ using the generalized cross-validationcriterion. Bayesian and bootstrap confidence intervals for fitted func-tions are constructed based on the methods in Section 3.8.
We now show how to fit polynomial splines to the motorcycle data.Consider the construction of polynomial splines in Section 2.2. For sim-plicity, we first consider the special cases of polynomial splines withm = 1 and m = 2, which are called linear and cubic splines, respec-tively. Denote x∧ z = min{x, z} and x∨ z = max{x, z}. Based on (2.3)
Smoothing Spline Regression 21
and (2.4), Table 2.1 lists bases for null spaces and RKs of linear andcubic splines for the special domain X = [0, b].
TABLE 2.1 Bases of null spaces and RKs of linearand cubic splines under the construction in Section 2.2with X = [0, b]
m Spline φν R0 R1
1 Linear 1 1 x ∧ z2 Cubic 1, x 1 + xz (x ∧ z)2{3(x ∨ z) − x ∧ z}/6
Functions linear2 and cubic2 in the assist package compute eval-uations of R1 in Table 2.1 for linear and cubic splines, respectively.Functions for higher-order polynomial splines are also available. Notethat the domain for functions linear2 and cubic2 is X = [0, b] for anyfixed b > 0. The RK on the general domain X = [a, b] can be calculatedby a translation, for example, cubic2(x-a).
To fit a cubic spline to the motorcycle data, one may create matricesT and Σ first and then call the ssr function:
> T <- cbind(1, times)
> Sigma <- cubic2(times)
> ssr(accel~T-1, rk=Sigma)
The intercept is automatically included in the formula statement. There-fore, T-1 is used to exclude the intercept since it is already included inthe T matrix.
Since Li are evaluational functionals for the motorcycle example, thessr function can be called directly:
> ssr(accel~times, rk=cubic2(times))
The inputs for formula and rk can be modified for fitting polynomialsplines of different orders. For example, the following statements fitlinear, quintic (m = 3), and septic (m = 4) splines:
> ssr(accel~1, rk=linear2(times))
> ssr(accel~times+I(times^2), rk=quintic2(times))
> ssr(accel~times+I(times^2)+I(times^3), rk=septic2(times))
The linear and cubic spline fits are shown in Figure 2.1.
22 Smoothing Splines: Methods and Applications
time (ms)
accele
ration (
g)
−1
00
−5
00
50
0 10 20 30 40 50 60
ooooo ooooooooooooo ooo
oooooo
o
oo
oo
o
oo
oo
oo
o
o
o
o
o
o
o
o
o
o
o
oooo
o
o
o
o
ooo
o
oo
ooo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
ooo
o
oo
o
o
o
o
ooo
o
o
o
o
o
o
o
oo
o
o
o
ooo
oo
o
o
o
o
o
o
oo
o o
o
oo
o
o
o
o
o
o
o
Linear spline fit
time (ms)0 10 20 30 40 50 60
ooooo ooooooooooooo ooo
oooooo
o
oo
oo
o
oo
oo
oo
o
o
o
o
o
o
o
o
o
o
o
oooo
o
o
o
o
ooo
o
oo
ooo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
ooo
o
oo
o
o
o
o
ooo
o
o
o
o
o
o
o
oo
o
o
o
ooo
oo
o
o
o
o
o
o
oo
o o
o
oo
o
o
o
o
o
o
o
Cubic spline fit
FIGURE 2.1 Motorcycle data, plots of observations (circles), the lin-ear spline fit (left), and the cubic spline fit (right) as solid lines, and 95%Bayesian confidence intervals (shaded regions).
2.6 Another Construction for Polynomial Splines
One construction of RKHS for the polynomial spline was presentedin Section 2.2. This section presents an alternative construction forWm
2 [0, 1] on the domain X = [0, 1]. In practice, without loss of general-ity, a continuous interval [a, b] can always be transformed into [0, 1].
Let kr(x) = Br(x)/r! be scaled Bernoulli polynomials where Br are
defined recursively by B0(x) = 1, B′r(x) = rBr−1(x) and
∫ 1
0 Br(x)dx = 0for r = 1, 2, . . . (Abramowitz and Stegun 1964). The first four scaledBernoulli polynomials are
k0(x) = 1,
k1(x) = x− 0.5,
k2(x) =1
2
{
k21(x) − 1
12
}
,
k4(x) =1
24
{
k41(x) −
1
2k21(x) +
7
240
}
.
(2.27)
Alternative model space construction and decomposition ofWm
2 [0, 1]The Sobolev space Wm
2 [0, 1] is an RKHS with the inner product
(f, g) =
m−1∑
ν=0
(∫ 1
0
f (ν)dx
)(∫ 1
0
g(ν)dx
)
+
∫ 1
0
f (m)g(m)dx. (2.28)
Smoothing Spline Regression 23
Furthermore, Wm2 [0, 1] = H0 ⊕H1, where
H0 = span{k0(x), k1(x), . . . , km−1(x)
},
H1 ={f :
∫ 1
0
f (ν)dx = 0, ν = 0, . . . ,m− 1,
∫ 1
0
(f (m))2dx <∞},
(2.29)
are RKHS’s with corresponding RKs
R0(x, z) =m−1∑
ν=0
kν(x)kν (z),
R1(x, z) = km(x)km(z) + (−1)m−1k2m(|x− z|).(2.30)
The foregoing alternative construction was derived by Craven andWahba (1979). Note that the inner product (2.28) is different from(2.2). Again, H0 contains polynomials, and the basis listed in (2.29),φν(x) = kν−1(x) for ν = 1, . . . ,m, is an orthonormal basis of H0. DenoteP1 as the orthogonal projection operator onto H1. From the definition
of the inner product, the roughness penalty∫ 1
0(f (m))2dx = ||P1f ||2.
Based on (2.29) and (2.30), Table 2.2 lists bases for the null spacesand RKs of linear and cubic splines under the alternative constructionin this section.
TABLE 2.2 Bases of the null spaces and RKs for linear andcubic splines under the construction in Section 2.6 with X = [0, 1]
m Spline φν R0 R1
1 Linear 1 1 k1(x)k1(z) + k2(|x − z|)2 Cubic 1, k1(x) 1 + k1(x)k1(z) k2(x)k2(z) − k4(|x − z|)
Functions linear and cubic in the assist package compute evalua-tions of R1 in Table 2.2 for linear and cubic splines respectively. Func-tions for higher-order polynomial splines are also available. Note thatthe domain under construction in this section is restricted to [0, 1]. Thusthe scale option is needed when the domain is not [0, 1]. For example,the following statements fit linear and cubic splines to the motorcycledata:
24 Smoothing Splines: Methods and Applications
> ssr(accel~1, rk=linear(times), scale=T)
> ssr(accel~times, rk=cubic(times), scale=T)
The scale option scales the independent variable times into the interval[0, 1]. It is a good practice to scale a variable first before fitting. Forexample, the following statements lead to the same cubic spline fit:
> x <- (times-min(times))/(max(times)-min(times))
> ssr(accel~x, rk=cubic(x))
2.7 Periodic Splines
Many natural phenomena follow a cyclic pattern. For example, manybiochemical, physiological, or behavioral processes in living beings followa daily cycle called circadian rhythm, and many Earth processes followan annual cycle. In these cases the mean function f is known to be asmooth periodic function. Without loss of generality, assume that thedomain of the function X = [0, 1] and f is a periodic function on [0, 1].Since periodic functions can be regarded as functions defined on the unitcircle, periodic splines are often referred to as splines on the circle.
The model space for periodic spline of order m is
Wm2 (per) =
{f : f (j) are absolutely continuous, f (j)(0) = f (j)(1),
j = 0, . . . ,m− 1,
∫ 1
0
(f (m))2dx <∞}. (2.31)
Craven and Wahba (1979) derived the following construction.
Model space construction and decomposition of Wm2 (per)
The space Wm2 (per) is an RKHS with inner product
(f, g) =
(∫ 1
0
fdx
)(∫ 1
0
gdx
)
+
∫ 1
0
f (m)g(m)dx.
Furthermore, Wm2 (per) = H0 ⊕H1, where
H0 = span{1},
H1 ={f ∈Wm
2 (per) :
∫ 1
0
fdx = 0},
(2.32)
Smoothing Spline Regression 25
are RKHS’s with corresponding RKs
R0(x, z) = 1,
R1(x, z) = (−1)m−1k2m(|x− z|).(2.33)
Again, the roughness penalty∫ 1
0(f (m))2dx = ||P1f ||2. The function
periodic in the assist library calculates R1 in (2.33). The order m isspecified by the argument order. The default is a cubic periodic splinewith order=2.
We now illustrate how to fit a periodic spline using the Arosa data,which contain monthly mean ozone thickness (Dobson units) in Arosa,Switzerland, from 1926 to 1971. Suppose we want to investigate howozone thickness changes over months in a year. It is reasonable to assumethat the mean ozone thickness is a periodic function of month. Letthick be the dependent variable and x be the independent variablemonth scaled into the interval [0, 1]. The following statements fit a cubicperiodic spline:
> data(Arosa); Arosa$x <- (Arosa$month-0.5)/12
> ssr(thick~1, rk=periodic(x), data=Arosa)
The fit of the periodic spline is shown in Figure 2.2.
month
thic
kness
300
350
400
1 2 3 4 5 6 7 8 9 10 11 12
FIGURE 2.2 Arosa data, plot of observations (points), and the peri-odic spline fits (solid line). The shaded region represents 95% Bayesianconfidence intervals.
26 Smoothing Splines: Methods and Applications
2.8 Thin-Plate Splines
Suppose f is a function of a multivariate independent variable x =(x1, . . . , xd) ∈ R
d, where Rd is the Euclidean d-space. Assume the re-
gression modelyi = f(xi) + ǫi, i = 1, . . . , n, (2.34)
where xi = (xi1, . . . , xid) and ǫi are zero-mean independent randomerrors with a common variance σ2.
Define the model space for a thin-plate spline as
Wm2 (Rd) =
{f : Jd
m(f) <∞}, (2.35)
where
Jdm(f) =
∑
α1+···+αd=m
m!
α1! . . . αd!
∫ ∞
−∞· · ·∫ ∞
−∞
(∂mf
∂xα1
1 . . . ∂xαd
d
)2 d∏
j=1
dxj .
(2.36)Since Jd
m(f) is invariant under a rotation of the coordinates, the thin-plate spline is especially well suited for spatial data (Wahba 1990, Gu2002).
Define an inner product as
(f, g) =∑
α1+···+αd=m
m!
α1! . . . αd!
∫ ∞
−∞· · ·∫ ∞
−∞(
∂mf
∂xα1
1 . . . ∂xαd
d
)(∂mg
∂xα1
1 . . . ∂xαd
d
) d∏
j=1
dxj . (2.37)
Model space construction of Wm2 (Rd)
With the inner product (2.37), Wm2 (Rd) is an RKHS if and only if
2m− d > 0.
Details can be found in Duchon (1977) and Meinguet (1979). A thin-plate spline estimate is the minimizer to the PLS
1
n
n∑
i=1
(yi − f(xi))2 + λJd
m(f) (2.38)
in Wm2 (Rd). The null space H0 of the penalty functional Jd
m(f) is thespace spanned by polynomials in d variables of total degree up to m−1.
Smoothing Spline Regression 27
Thus the dimension of the null space p =
(d+m− 1
d
)
. For example,
when d = 2 and m = 2,
J22 (f) =
∫ ∞
−∞
∫ ∞
−∞
{(∂2f
∂x21
)2
+ 2
(∂2f
∂x1∂x2
)2
+
(∂2f
∂x22
)2}
dx1dx2,
and the null space is spanned by φ1(x) = 1, φ2(x) = x1, and φ3(x) = x2.In general, denote φ1, . . . , φp as the p polynomials of total degree up to
m−1 that span H0. Denote Em as the Green function for the m-iteratedLaplacian Em(x, z) = E(||x − z||), where ||x − z|| is the Euclideandistance and
E(u) =
{
(−1)d2+1+m|u|2m−d log |u|, d even,
|u|2m−d, d odd.
Let T = {φν(xi)}n pi=1 ν=1 and K = {Em(xi,xj)}n
i,j=1. The bivariate
function Em is not the RK ofWm2 (Rd) since it is not nonnegative definite.
Nevertheless, it is conditionally nonnegative definite in the sense thatT Tc = 0 implies that cTKc ≥ 0. Referred to as a semi-kernel, thefunction Em is sufficient for the purpose of estimation. Assume that Tis of full column rank. It can be shown that the unique minimizer of thePLS (2.38) is given by (Wahba 1990, Gu 2002)
f(x) =
p∑
ν=1
dνφν(x) +
n∑
i=1
ciξi(x), (2.39)
where ξi(x) = Em(xi,x). Therefore, Em plays the same role as the RKR1. The coefficients c and d are solutions to
(K + nλI)c + Td = y,
T Tc = 0.(2.40)
The above equations have the same form as those in (2.21). Therefore,computations in Section 2.4 carry over with Σ being replaced by K. Thesemi-kernel Em is calculated by the function tp.pseudo in the assist
package. The order m is specified by the order argument with defaultas order=2.
The USA climate data contain average winter temperatures in 1981from 1214 stations in USA. To investigate how average winter temper-ature (temp) depends on geological locations (long and lat), we fit athin-plate spline as follows:
> attach(USAtemp)
> ssr(temp~long+lat, rk=tp.pseudo(list(long,lat)))
28 Smoothing Splines: Methods and Applications
70
65
60
55
50
45
40
35
30
25
20
15
10
FIGURE 2.3 USA climate data, contour plot of the thin-plate splinefit.
The contour plot of the fit is shown in Figure 2.3.
A genuine RK for Wm2 (Rd) is needed later in the computation of
posterior variances in Chapter 3 and the construction of tensor productsplines in Chapter 4. We now discuss briefly how to derive the genuineRK. Define inner product
(f, g)0 =
J∑
j=1
wjf(uj)g(uj), (2.41)
where uj are fixed points in Rd, and wj are fixed positive weights such
that∑J
j=1 wj = 1. Points uj and weights wj are selected in such a way
that the matrix {(φν , φµ)0}pν,µ=1 is nonsingular. Let φν , ν = 1, . . . , p,
be an orthonormal basis derived from φν with φ1(x) = 1. Let P0 be theprojection operator onto H0 defined as P0f =
∑pν=1(f, φν)0φν . Then it
can be shown that (Gu 2002)
R0(x, z) =
p∑
ν=1
φν(x)φν(z),
R1(x, z) = (I − P0(x))(I − P0(z))E(||x− z||)(2.42)
Smoothing Spline Regression 29
are RKs of H0 and H1 , Wm2 (Rd) ⊖ H0, where P0(x) and P0(z) are
projections applied to the arguments x and z, respectively.Assume that T = {φν(xi)}n p
i=1 ν=1 is of full column rank. Let Σ =
{R1(xi,xj)}ni,j=1. One relatively simple approach to compute φν and
Σ is to let J = n, uj = xj , and wj = n−1. It is easy to see that{(φν , φµ)0}p
ν,µ=1 = n−1T TT , which is nonsingular. Let
T = (Q1 Q2)
(R0
)
be the QR decomposition of T . Then
(φ1(x), . . . , φp(x)) =√n(φ1(x), . . . , φp(x))R−1
andΣ = Q2Q
T2KQ2Q
T2 .
The function tp computes evaluations of R1 in (2.42) with J = n, uj =xj and wj = n−1.
2.9 Spherical Splines
Spherical spline, also called spline on the sphere, is an extension of boththe periodic spline defined on the unit circle and the thin-plate splinedefined on R
2. Let the domain be X = S, where S is the unit sphere.Any point x on S can be represented as x = (θ, φ), where θ (0 ≤ θ ≤ 2π)is the longitude and φ (−π/2 ≤ φ ≤ π/2) is the latitude. Define
J(f) =
∫ 2π
0
∫ π2
−π2
(∆m2 f)2 cosφdφdθ, m even,
∫ 2π
0
∫ π2
−π2
{
(∆m−1
2 f)2θcos2 φ
+ (∆m−1
2 f)2φ
}
cosφdφdθ, m odd,
where the notation (g)z represents the partial derivative of g with respectto z, ∆f represents the surface Laplacian on the unit sphere defined as
∆f =1
cos2 φfθθ +
1
cosφ(cosφfφ)φ.
Consider the model space
Wm2 (S) =
{
f :
∣∣∣∣
∫
Sfdx
∣∣∣∣<∞, J(f) <∞
}
.
30 Smoothing Splines: Methods and Applications
Model space construction and decomposition of Wm2 (S)
Wm2 (S) is an RKHS when m > 1. Furthermore, Wm
2 (S) = H0 ⊕ H1,where
H0 = span{1},
H1 ={f ∈Wm
2 (S) :
∫
Sfdx = 0
},
are RKHS’s with corresponding RKs
R0(x, z) = 1,
R1(x, z) =∞∑
i=1
2i+ 1
4π
1
{i(i+ 1)}mGi(cos γ(x, z)),
where γ(x, z) is the angle between x and z, and Gi are the Legendrepolynomials.
Details of the above construction can be found in Wahba (1981). Thepenalty ||P1f ||2 = J(f). The RK R1 is in the form of an infinite series,which is inconvenient to compute. Closed-form expressions are availableonly when m = 2 and m = 3. Wahba (1981) proposed replacing J by atopologically equivalent seminorm Q under which closed-form RKs canbe derived. The function sphere in the assist package calculates R1
under the seminorm Q for 2 ≤ m ≤ 6. The argument order specifies mwith default as order=2.
The world climate data contain average winter temperatures in 1981from 725 stations around the globe. To investigate how average wintertemperature (temp) depends on geological locations (long and lat), wefit a spline on the sphere:
> data(climate)
> ssr(temp~1, rk=sphere(cbind(long,lat)), data=climate)
The contour plot of the spherical spline fit is shown in Figure 2.4.
2.10 Partial Splines
A partial spline model assumes that
yi = sTi β + Lif + ǫi, i = 1, . . . , n, (2.43)
Smoothing Spline Regression 31
longitude
latit
ude
−40
−35
−30 −25
−20
−15
−10
−5
0
5
5
10
10
15
15
20
20
25
25
25
−150 −100 −50 0 50 100 150
−5
00
50
FIGURE 2.4 World climate data, contour plot of the spherical splinefit.
where s is a q-dimensional vector of independent variables, β is a vectorof parameters, Li are bounded linear functionals, and ǫi are zero-meanindependent random errors with a common variance σ2. We assumethat f ∈ H, where H is an RKHS on an arbitrary domain X . Model(2.43) contains two components: a parametric linear model and a non-parametric function f . The partial spline model is a special case of thesemiparametric linear regression model discussed in Chapter 8.
Suppose H = H0 ⊕ H1, where H0 = span{φ1, . . . , φp} and H1 is anRKHS with RK R1. Denote P1 as the projection onto H1. The functionf and parameters β are estimated as minimizers to the following PLS:
1
n
n∑
i=1
(yi − sTi β − Lif)2 + λ||P1f ||2. (2.44)
Let S = (s1, . . . , sn)T , T = {Liφν}n pi=1 ν=1, X = (S T ), and Σ =
{Li(x)Lj(z)R1(x, z)}ni,j=1. Assume that X is of full column rank. Fol-
lowing similar arguments as in Section 2.4, it can be shown that the PLS(2.44) has a unique minimizer, and the solution of f is given in (2.18).
32 Smoothing Splines: Methods and Applications
Therefore, the PLS (2.44) reduces to
1
n||y −Xα− Σc||2 + λcT Σc,
where α = (βT ,dT )T . As in Section 2.4, we can solve α and c from thefollowing equations:
(Σ + nλI)c+Xα = y,
XTc = 0.(2.45)
The above equations have the same form as those in (2.21). Thus, com-putations in Section 2.4 carry over with T and d being replaced by Xand α, respectively. The ssr function can be used to fit partial splines.When Li are evaluational functionals, the partial spline model (2.43)can be fitted by adding s variables at the right-hand side of the formulaargument. When Li are not evaluational functionals, matrices X and Σneed to be created and supplied in the formula and rk arguments.
One interesting application of the partial spline model is to fit a non-parametric regression model with potential change-points. A change-point is defined as a discontinuity in the mean function or one of itsderivatives. Note that the function g(x) = (x − t)k
+ has a jump in itskth derivative at location t. Therefore, it can be used to model change-points. Specifically, consider the following model
yi =
J∑
j=1
βj(xi − tj)kj
+ + f(xi) + ǫi, i = 1, . . . , n, (2.46)
where xi ∈ [a, b] are design points, tj ∈ [a, b] are change-points, f is asmooth function, and ǫi are zero-mean independent random errors witha common variance σ2. The mean function in model (2.46) has a jumpat tj in its kjth derivative with magnitude βj . The choice of modelspace for f depends on the application. For example, the polynomial orperiodic spline space may be used. When tj and kj are known, model
(2.46) is a special case of partial spline with q = J , sij = (xi − tj)kj
+ , andsi = (si1, . . . , siJ)T .
We now use the geyser data and motorcycle data to illustrate change-points detection using partial splines. For the geyser data, Figure 1.3(b)indicates that there may be a jump in the mean function between 2.5and 3.5 minutes. Therefore, we consider the model
yi = β(xi − t)0+ + f(xi) + ǫi, i = 1, . . . , n, (2.47)
where xi are the duration variable scaled into [0, 1], t is a change-point,and (x − t)0+ = 0 when x ≤ t and 1 otherwise. We assume that f ∈
Smoothing Spline Regression 33
W 22 [0, 1]. For a fixed t, say t = 0.397, we can fit the partial spline as
follows:
> attach(faithful)
> x <- (eruptions-min(eruptions))/diff(range(eruptions))
> ssr(waiting~x+(x>.397), rk=cubic(x))
The partial spline fit is shown in Figure 2.5(a). No trend is shown in theresidual plot in Figure 2.5(b).
The change-point is fixed at t = 0.397 in the above fit. Often it isunknown in practice. To search for the location of the change-point t,we compute AIC and GCV (generalized cross-validation) criteria on agrid points between 0.25 and 0.55:
> aic <- gcv <- NULL
> for (t in seq(.25,.55,by=.001)) {
fit <- ssr(waiting~x+(x>t), rk=cubic(x))
aic <- c(aic, length(x)*log(sum(fit$resi**2))+2*fit$df)
gcv <- c(gcv, fit$rkpk.obj$score)
}
The vector fit$resi contains residuals. The value fit$df representsthe degrees of freedom (trH(λ)) defined later in Chapter 3, Section 3.2.The GCV criterion is defined in (3.24). Figure 2.5(c) shows the AIC andGCV scores scaled into [0, 1]. The scaled scores are identical and reachthe minimum in the same region with the middle point at t = 0.397.
1.5 2.5 3.5 4.5
50
60
70
80
90
duration (min)
waitin
g (
min
)
(a)
1.5 2.5 3.5 4.5
−10
−5
05
10
15
duration (min)
resid
uals
(m
in)
(b)
0.25 0.35 0.45 0.55
0.0
0.4
0.8
t
scale
d A
IC/G
CV
(c)
FIGURE 2.5 Geyser data, plots of (a) observations (points), the par-tial spline fit (solid line) with 95% Bayesian confidence intervals (shadedregion), (b) residuals (points) from the partial spline fit and the cubicspline fit (solid line) to the residuals, and (c) the AIC (dashed line) andGCV scores (solid line) scaled into [0, 1].
34 Smoothing Splines: Methods and Applications
For the motorcycle data, it is apparent that the mean curve is flat onthe left and there is a sharp corner around 15 ms. The linear and cubicsplines fit this region with round corners (Figure 2.1). The sharp cornersuggests that there may be a change-point in the first derivative of themean function. Therefore, we consider the model
yi = β(xi − t)+ + f(xi) + ǫi, i = 1, . . . , n, (2.48)
where x is the time variable scaled into [0, 1] and t is the change-pointin the first derivative. We assume that f ∈ W 2
2 [0, 1]. Again, we use theAIC and GCV criteria to search for the location of the change-point t.Figure 2.6(b) shows the scaled AIC and GCV scores. They both reachthe minimum at t = 0.214. For the fixed t = 0.214, the model (2.48) canbe fitted as follows:
> t <- .214; s <- (x-t)*(x>t)
> ssr(accel~x+s, rk=cubic(x))
The partial spline fit is shown in Figure 2.6(a). The sharp corner around15 ms is preserved. Chapter 3 contains more analysis of potential change-points for the motorcycle data.
time (ms)
accele
ration (
g)
−1
00
−5
00
50
0 10 20 30 40 50 60
ooooo ooooooooooooo ooo
oooooo
o
oo
oo
o
oo
oo
oo
o
o
o
o
o
o
o
o
o
o
o
oooo
o
o
o
o
ooo
o
oo
ooo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
ooo
o
oo
o
o
o
o
ooo
o
o
o
o
o
o
o
oo
o
o
o
ooo
oo
o
o
o
o
o
o
oo
o o
o
oo
o
o
o
o
o
o
o
(a)
0.20 0.22 0.24
0.0
0.4
0.8
t
scale
d A
IC/G
CV
(b)
FIGURE 2.6 Motorcycle data, plots of (a) observations (circles), thepartial spline fit (solid line) with 95% Bayesian confidence intervals(shaded region), and (b) the AIC (dashed line) and GCV (solid line)scores scaled into [0, 1].
Often, in practice, there is enough knowledge to model some com-ponents in the regression function parametrically. For other uncertain
Smoothing Spline Regression 35
components, it may be desirable to leave them unspecified. Combin-ing parametric and nonparametric components, the partial spline (semi-parametric) models are well suited to these situations.
As an illustration, consider the Arosa data. We have investigatedhow ozone thickness changes over months in a year by fitting a periodicspline (Figure 2.2). Suppose now we want to investigate how ozonethickness changes over time. That is, we need to consider both month(seasonal) effect and year (long-term) effect. Let x1 and x2 be the monthand year variables scaled into the interval [0, 1]. For the purpose ofillustration, suppose the seasonal trend can be well approximated bya simple sinusoidal function. The form of long-term trend will be leftunspecified. Therefore, we consider the following partial spline model
yi = β1 + β2 sin(2πxi1) + β3 cos(2πxi1) + f(xi2) + ǫi,
i = 1, . . . , 518,(2.49)
where yi represents the average ozone thickness in month xi1 of year xi2,and f ∈ W 2
2 [0, 1] ⊖ {1}. Note that the constant functions are removedfrom the model space for f such that f is identifiable with the constantβ1. The partial spline model (2.49) can be fitted as follows:
> x1 <- (Arosa$month-0.5)/12; x2 <- (Arosa$year-1)/45
> ssr(thick~sin(2*pi*x1)+cos(2*pi*x1)+x2, rk=cubic(x2))
Estimates of the main effect of month, β2 sin(2πx1) + β3 cos(2πx1),
and the main effect of year, f(x2), are shown in Figure 2.7. The simplesinusoidal model for the seasonal trend is too restrictive. More generalmodels for the Arosa data can be found in Chapter 4, Section 4.9.2.
Functional data are observations in a form of functions. The mostcommon forms of functional data are curves defined on a continuousinterval and surfaces defined on R
2. A functional linear model (FLM) is alinear model that involves functional data as (i) the independent variable,(ii) the dependent variable, or (iii) both independent and dependentvariables. Many methods have been developed for fitting FLMs wherefunctional data are curves or surfaces (Ramsay and Silverman 2005).We will use the Canadian weather data to illustrate how to fit FLMsusing methods in this book. An FLM corresponding to situation (i) isdiscussed in this section. FLMs corresponding to situations (ii) and (iii)will be introduced in Chapter 4, Sections 4.9.3, and Chapter 8, Section8.4.1, respectively. Note that methods illustrated in these sections forfunctional data apply to general functions defined on arbitrary domains.In particular, they can be used to fit surfaces defined on R
2.Now consider the Canadian weather data. To investigate the relation-
ship between total annual precipitation and the temperature function,
36 Smoothing Splines: Methods and Applications
month
thic
kness
−4
0−
20
02
04
0
1 2 3 4 5 6 7 8 9 11
o
o
o
oo
o
o
o
oo
o
o
Main effect of month
yearth
ickness
−4
0−
20
02
04
0
1926 1935 1944 1953 1962 1971
o
o
o
o
o
oo
o
ooo
o
ooo
oo
oo
o
ooo
ooo
oo
o
o
ooo
o
oo
o
oo
o
o
o
o
o
o
Main effect of year
FIGURE 2.7 Arosa data, plots of estimated main effects with 95%Bayesian confidence intervals. A circle in the left panel represents theaverage thickness for a particular month minus the overall mean. Acircle in the right panel represents the average thickness for a particularyear minus the overall mean.
consider the following FLM
yi = β1 +
∫ 1
0
wi(x)f(x)dx + ǫi, i = 1, . . . , 35, (2.50)
where yi is the logarithm of total annual precipitation at station i, β1 is aconstant parameter, x is the variable month transformed into [0, 1], wi(x)is the temperature function at station i, f(x) is an unknown weight func-tion, and ǫi are random errors. Model (2.50) is the same as model (15.2)in Ramsay and Silverman (2005). It is an example when the indepen-dent variable is a curve. The goal is to estimate the weight function f .It is reasonable to assume that f is a smooth periodic function. Specif-ically, we model f using cubic periodic spline space W 2
2 (per). Writef(x) = β2 + f1(x) where f1 ∈W 2
2 (per)⊖{1}. Then model (2.50) can berewritten as a partial spline model
yi = β1 + β2
∫ 1
0
wi(x)dx +
∫ 1
0
wi(x)f1(x)dx + ǫi
, sTi β + Lif1 + ǫi, (2.51)
where si = (1,∫ 1
0wi(x)dx)
T , β = (β1, β2)T , and Lif1 =
∫ 1
0wi(x)f1(x)dx.
Assume that wi are square integrable. Then Li are bounded linear func-tionals. Let R1 be the RK of W 2
2 (per) ⊖ {1}. From (2.14), the (i, j)th
Smoothing Spline Regression 37
element of Σ equals
Li(x)Lj(z)R1(x, z) =
∫ 1
0
∫ 1
0
wi(s)wj(t)R1(s, t)dsdt
≈ 1
144
12∑
k=1
12∑
l=1
wi(xk)wj(xl)R1(xk, xl)
=1
144wT
i R1(x,x)wj , (2.52)
where xk represents the middle point of month k, x = (x1, . . . , x12)T ,
R1(x,x) = {R1(xk, xl)}12k,l=1, wi(xk) is the temperature of month k at
station i, and wi = (wi(x1), . . . , wi(x12))T . The rectangle rule is used to
approximate the integrals. More accurate approximations may be used.Let W = (w1, . . . ,w35). Then Σ ≈ WTR1(x,x)W/144. The followingstatements fit the partial spline model (2.51):
> library(fda); attach(CanadianWeather)
> y <- log(apply(monthlyPrecip,2,sum))
> W <- monthlyTemp
> s <- apply(W,2,mean)
> x <- seq(0.5,11.5,1)/12
> Sigma <- t(W)%*%periodic(x)%*%W/144
> canada.fit1 <- ssr(y~s, rk=Sigma, spar=‘‘m’’)
where the vector s contains elements∑12
j=1 wi(xj)/12, which are approx-
imations of∫ 1
0 wi(x)dx. The generalized maximum likelihood (GML)method in Chapter 3, Section 3.6, is used to select the smoothing pa-rameter since the GCV estimate is too small due to small sample size.This is accomplished by setting the option spar=‘‘m’’.
From equation (2.18), the estimate of the weight function is
f(x) = d2 +35∑
i=1
ciLi(z)R1(x, z),
where d2 is an estimate of β2. To compute f at a set of points, say,
38 Smoothing Splines: Methods and Applications
x0 = (x01, . . . , x0n0)T , we have
f(x0) = d21n0+
35∑
i=1
ci
∫ 1
0
R1(x0, x)wi(x)dx
≈ d21n0+
35∑
i=1
ci
12∑
j=1
R1(x0, xj)wi(xj)/12
= d21n0+
35∑
i=1
ciR1(x0,x)wi/12
= d21n0+ Sc,
where f(x0) = (f(x01), . . . , f(x0n0))T , 1m represents a m-vector of all
ones, R1(x0, x) = (R1(x01, x), . . . , R1(x0n0, x))T , R1(x0,x) =
{R1(x0i, xj)}n0 12i=1 j=1, S = R1(x0,x)W/12, and c = (c1, . . . , cn)T . We
compute f(x0) at 50 equally spaced points in [0, 1] as follows:
> S <- periodic(seq(0,1,len=50),x)%*%W/12
> fhat <- canada.fit1$coef$d[2]+S%*%canada.fit1$coef$c
Figure 2.8 displays the estimated weight function and 95% bootstrapconfidence intervals. The shape of the weight function is similar to thatin Figure 15.5 of Ramsay and Silverman (2005). Note that monthlydata are used in this book while daily data were used in Ramsay andSilverman (2005).
month
weig
ht fu
nction f(x
)
−0.4
0.0
0.2
0.4
0.6
J F M A M J J A S O N D
FIGURE 2.8 Canadian weather data, plot of the estimated weightfunction (solid line), and 95% bootstrap confidence intervals (shadedregion).
Smoothing Spline Regression 39
2.11 L-splines
2.11.1 Motivation
In the construction of a smoothing spline model, one needs to decidethe penalty functional or, equivalently, the null space H0 consisting offunctions that are not penalized. The squared bias of the spline estimatesatisfies the following inequality (Wahba, 1990, p. 59)
1
n
n∑
i=1
{E(Lif) − Lif}2 ≤ λ||P1f ||2.
That is, the squared bias is bounded by the distance of f to the null spaceH0. Therefore, selecting a penalty such that ||P1f ||2 is small can leadto low bias in the spline estimate of the function. This is equivalent toselecting a null space H0 such that it is close to the true function. Ideally,one wants to choose H0 such that ||P1f ||2 = 0, that is f ∈ H0. However,it is usually difficult, if not impossible, to specify such a parametric spacein practice. Nevertheless, often there is prior information suggesting thatf can be well approximated by a parametric model. That is, f is closeto, but not necessarily in, the space H0. Heckman and Ramsay (2000)called such a space a favored parametric model. L-splines allow us toincorporate this kind of indefinite information. L-splines can also beused to check or test parametric models (Chapter 3).
Consider functions on the domain X = [a, b]. Let L be the lineardifferential operator
L = Dm +
m−1∑
j=0
ωj(x)Dj , (2.53)
where m ≥ 1, Dj denotes the jth derivative operator, and ωi are con-tinuous real-valued functions. The minimizer of the following PLS
1
n
n∑
i=1
(yi − Lif)2 + λ
∫ b
a
(Lf)2dx
is called an L-spline. Note that the penalty is in general different fromthat of a polynomial or periodic spline. To utilize the general estimationprocedure developed in Section 2.4, we need to construct an RKHS such
that the penalty∫ b
a (Lf)2dx = ||P1f ||2 with an appropriately definedinner product.
40 Smoothing Splines: Methods and Applications
Suppose f ∈ Wm2 [a, b]. Then Lf exists and is square integrable. Let
H0 = {f : Lf = 0} be the kernel of L. Based on results from dif-ferential equations (Coddington 1961), there exist real-valued functionsu1, . . . , um ∈Wm
2 [a, b] such that u1, . . . , um form a basis of H0. Further-more, the Wronskian matrix associated with u1, . . . , um,
W (x) =
u1(x) u2(x) · · · um(x)u′1(x) u′2(x) · · · u′m(x)
......
...
u(m−1)1 (x) u
(m−1)2 (x) · · · u(m−1)
m (x)
,
is invertible for all x. Define inner product
(f, g) =
m−1∑
ν=0
f (ν)(a)g(ν)(a) +
∫ b
a
(Lf)(Lg)dx. (2.54)
Equation (2.54) defines a proper inner product since f (ν)(a) = 0, ν =0, . . . ,m−1, and Lf = 0 leads to f = 0. Denote u(x) = (u1(x), . . . , um(x))T .Let u∗(x) = (u∗1(x), . . . , u
∗m(x))T be the last column of W−1(x) and
G(x, s) =
{
uT (x)u∗(s), s ≤ x,
0, s > x,
be the Green function associated with L.
Model space construction and decomposition of Wm2 [a, b]
The space Wm2 [a, b] is an RKHS with inner product (2.54). Furthermore,
Wm2 [a, b] = H0 ⊕H1, where
H0 = span{u1, . . . , um},H1 =
{f ∈Wm
2 [a, b] : f (ν)(a) = 0, ν = 0, . . . ,m− 1},
(2.55)
are RKHS’s with corresponding RKs
R0(x, z) = uT (x){WT (a)W (a)}−1u(z),
R1(x, z) =
∫ b
a
G(x, s)G(z, s)ds.(2.56)
The above construction of an L-spline is based on a given differen-tial operator L. In practice, rather than a differential operator, priorknowledge may be in a form of basis functions for the null space H0.
Smoothing Spline Regression 41
Specifically, suppose prior knowledge suggests that f can be well ap-proximated by a parametric space spanned by a basis u1, . . . , um. Thenone can solve the following equations to derive coefficients ωj in (2.53):
(Luν)(x) = u(m)ν (x) +
m−1∑
j=0
ωj(x)u(j)ν (x) = 0, ν = 1, . . . ,m.
Let W (x) be the Wronskian matrix, ω(x) = (ω0(x), . . . , ωm−1(x))T , and
u(m)(x) = (u(m)1 (x), . . . , u
(m)m (x))T . Then the above equations can be
written in a matrix form: WT (x)ω(x) = −u(m)(x). Assume that W (x)is invertible. Then ω(x) = −W−T (x)u(m)(x).
It is clear that the polynomial spline is a special case with L = Dm.Some special L-splines are discussed in the following subsections. Moredetails can be found in Schumaker (2007), Wahba (1990), Dalzell andRamsay (1993), Heckman (1997), Gu (2002), and Ramsay and Silverman(2005).
2.11.2 Exponential Spline
Assume that f ∈W 22 [a, b]. Suppose prior knowledge suggests that f can
be well approximated by a linear combination of 1 and exp(−γx) for afixed γ 6= 0. Consider the parametric model space
H0 = span{1, exp(−γx)}.
It is easy to see that H0 is the kernel of the differential operator L =D2 + γD. Nevertheless, we derive this operator following the procedurediscussed in Section 2.11.1. Let u1(x) = 1 and u2(x) = exp(−γx). TheWronskian matrix
W (x) =
(1 exp(−γx)0 −γ exp(−γx)
)
.
Then
(ω0(x)ω1(x)
)
= −(
1 01γ − 1
γ exp(γx)
)(0
γ2 exp(−γx)
)
=
(0γ
)
.
Thus we have
L = D2 + γD.
Since
{W (a)TW (a)}−1 =
(1 + 1
γ2 − 1γ2 exp(γa)
− 1γ2 exp(γa) 1
γ2 exp(2γa)
)
,
42 Smoothing Splines: Methods and Applications
the RK of H0
R0(x, z) = 1 +1
γ2− 1
γ2exp(−γx∗) − 1
γ2exp(−γz∗)
+1
γ2exp{−γ(x∗ + z∗)}, (2.57)
where x∗ = x− a and z∗ = z − a.The Green function
G(x, s) =
{1γ [1 − exp{−γ(x− s)}], s ≤ x,
0, s > x.
Thus the RK of H1
R1(x, z)
=
∫ b
a
G(x, s)G(z, s)ds
=
∫ x∗∧z∗
0
1
γ2[1 − exp{−γ(x∗ − s∗)}] [1 − exp{−γ(z∗ − s∗)}] ds∗
=1
γ3
{γ(x∗ ∧ z∗) + exp(−γx∗) + exp(−γz∗)
− exp{γ(x∗ ∧ z∗ − x∗)} − exp{γ(x∗ ∧ z∗ − z∗)}
− 1
2exp{−γ(x∗ + z∗)} +
1
2exp[γ{2(x∗ ∧ z∗) − x∗ − z∗}]
}. (2.58)
Evaluations of the RK R1 for some simple L-splines can be calculatedusing the lspline function in the assist package. The argument typespecifies the type of an L-spline. The option type=‘‘exp’’ computesR1 in (2.58) for the special case when a = 0 and γ = 1. For general aand γ 6= 0, the RK R1 can be computed using a simple transformationx = γ(x− a).
The weight loss data contain weight measurements of a male obese pa-tient since the start of a weight rehabilitation program. We now illustratehow to fit an exponential spline to the weight loss data. Observationsare shown in Figure 2.9(a). Let y = Weight and x = Days. We first fita nonlinear regression model as in Venables and Ripley (2002):
yi = β1 + β2 exp(−β3xi) + ǫi, i = 1, . . . , 51. (2.59)
The following statements fit model (2.59):
> library(MASS); attach(wtloss)
> y <- Weight; x <- Days
> weight.nls <- nls(y~b1+b2*exp(-b3*x),
start=list(b1=81,b2=102,b3=.005))
Smoothing Spline Regression 43
The fit is shown in Figure 2.9(a).
0 50 100 150 200 250
11
01
30
15
01
70
Days
Weig
ht (k
g)
ooooo
oo
oo
oo
oo
ooooo
oo
o ooo
oo
oooooooo
ooooooooo
o oooo o
ooo
(a)
0 50 100 150 200 250−
1.5
−0.5
0.5
1.5
Days
(b)
FIGURE 2.9 Weight loss data, plots of (a) observations (circles), thenonlinear regression fit (dotted line), the cubic spline fit (dashed line),and the exponential spline fit (solid line); and (b) the cubic spline fit and95% Bayesian confidence intervals minus the nonlinear regression fit asdashed lines, and the exponential spline fit and 95% Bayesian confidenceintervals minus the nonlinear regression fit as solid lines.
Next we consider the nonparametric regression model (1.1). It is rea-sonable to assume that the regression function can be well approximatedby H0 = span{1, exp(−γx)}, where γ = β3 = 0.0048 is the LS estimateof β3 in the model (2.59). We now fit the exponential spline:
> r <- coef(weight.nls)[3]
> ssr(y~exp(-r*x), rk=lspline(r*x,type=‘‘exp’’))
The exponential spline fit in Figure 2.9(a) is essentially the same as that
from nonlinear regression model. The parameter β3 is fixed as β3 in theabove construction of the exponential spline. One may treat β3 as aparameter and estimate it using the GCV criterion as in Gu (2002):
> gcv.fun <- function(r) ssr(y~exp(-r*x),
rk=lspline(r*x,type=‘‘exp’’))$rkpk.obj$score
> nlm(gcv.fun,.001)$estimate
0.004884513
For comparison, the cubic spline fit is also shown in Figure 2.9(a).To look at the difference between cubic and exponential splines moreclosely, we plot their fits and 95% Bayesian confidence intervals minus
44 Smoothing Splines: Methods and Applications
the fit from nonlinear regression in Figure 2.9(b). It is clear that theconfidence intervals for the exponential spline are narrower.
Figure 2.9 is essentially the same as Figure 4.3 in Gu (2002). A dif-ferent approach was used to fit the exponential spline in Gu (2002): it isshown that fitting the exponential spline is equivalent to fitting a cubicspline to the transformed variable x = 1−exp(−γx). Thus the followingstatements lead to the same fit to the exponential spline:
> tx <- 1-exp(-r*x)
> ssr(y~tx, rk=cubic2(tx))
2.11.3 Logistic Spline
Assume that f ∈W 22 [a, b]. Suppose prior knowledge suggests that f can
be well approximated by a logistic model H0 = span{1/(1+δ exp(−γx))}for some fixed δ > 0 and γ > 0. It is easy to see that H0 is the kernel ofthe differential operator
L = D − δγ exp(−γx)1 + δ exp(−γx) .
The Wronskian is an 1 × 1 matrix W (x) = {1 + δ exp(−γx)}−1. Since{WT (a)W (a)}−1 = {1 + δ exp(−γa)}2, then the RK of H0
R0(x, z) ={1 + δ exp(−γa)}2
{1 + δ exp(−γx)}{1 + δ exp(−γz)} . (2.60)
The Green function
G(x, s) =
1 + δ exp(−γs)1 + δ exp(−γx) , s ≤ x,
0, s > x.
Thus the RK of H1
R1(x, z) = {1 + δ exp(−γx)}−1{1 + δ exp(−γz)}−1
{x ∧ z − a+ 2δγ−1[exp(−γa) − exp{−γ(x∧ z)}]
+δ2(2γ)−1[exp(−2γa)− exp{−2γ(x ∧ z)}]}. (2.61)
The paramecium caudatum data consist of growth of paramecium cau-datum population in the medium of Osterhout. We now illustrate howto fit a logistic spline to the paramecium caudatum data. Observationsare shown in Figure 2.10. Let y = density and x = days. We first fitthe following logistic growth model
yi =β1
1 + β2 exp(−β3x)+ ǫi, i = 1, . . . , 25, (2.62)
Smoothing Spline Regression 45
using the statements
> data(paramecium); attach(paramecium)
> para.nls <- nls(density~b1/(1 + b2*exp(-b3*day)),
start=list(b1=202,b2=164,b3=0.74))
Initial values are the estimates in Neal (2004). The fit is shown in Figure2.10.
0 5 10 15 20 25
050
100
150
200
day
density
oo o o
o
o
o
o
o
o
o
o o
o
o
o
o
o o
o o
o
o o
o
FIGURE 2.10 Paramecium caudatum data, observations (circles),the nonlinear regression fit (dotted line), the cubic spline fit (dashedline), and the logistic spline fit (solid line).
Now we consider the nonparametric regression model (2.7). It is rea-sonable to assume that the regression function can be well approxi-mated by H0 = span{1/(1 + δ exp(−γx)}, where δ = β2 = 705.9496
and γ = β3 = 0.9319 are the LS estimates in model (2.62).The option type=‘‘logit’’ in the lspline function computes R1
in (2.61) for the special case when a = 0, δ = 1, and γ = 1. It can-not be adapted to compute R1 for the general situation. We take thisopportunity to show how to write a function for computing an RK.
> logit.rk <- function(x,a,d,r) {
tmp1 <- x%o%rep(1,length(x))
tmp2 <- (tmp1+t(tmp1)-abs(tmp1-t(tmp1)))/2
tmp3 <- exp(-r*a)-exp(-r*tmp2)
tmp4 <- exp(-2*r*a)-exp(-2*r*tmp2)
46 Smoothing Splines: Methods and Applications
tmp5 <- 1/((1+d*exp(-r*x))%o%(1+d*exp(-r*x)))
(tmp2-a+2*d*tmp3/r+d**2*tmp4/(2*r))*tmp5
}
> bh <- coef(para.nls)
> ssr(density~I(1/(1+bh[2]*exp(-bh[3]*day)))-1,
rk=logit.rk(day,0,bh[2],bh[3]),spar=‘‘m’’)
The function logit.rk computes R1 in (2.61). Since the sample size issmall, the GML method is used to select the smoothing parameter. Thelogistic spline fit is shown in Figure 2.10. The fit is essentially the sameas that from the logistic growth model (2.62). For comparison, the cubicspline fit is also shown in Figure 2.10. The logistic spline fit smoothsout oscillations after 10 days while the cubic spline fit preserves them.
To include the constant function in H0, one may consider the operator
L = D
{
D − δγ exp(−γx)1 + δ exp(−γx)
}
.
Details of this situation can be found in Gu (2002).
2.11.4 Linear-Periodic Spline
Assume that f ∈W 42 [a, b]. Suppose prior knowledge suggests that f can
be well approximated by a parametric model
H0 = span{1, x, cosx, sinx}.It is easy to check that H0 is the kernel of the differential operator
L = D4 +D2.
The Wronskian matrix and its inverse are, respectively,
W (x) =
1 x cosx sinx0 1 − sinx cosx0 0 − cosx − sinx0 0 sinx − cosx
and
W−1(x) =
1 −x 1 −x0 1 0 10 0 − cosx sinx0 0 − sinx − cosx
.
For simplicity, suppose a = 0. Then
{WT (0)W (0)}−1 =
2 0 −1 00 2 0 −1
−1 0 1 00 −1 0 1
.
Smoothing Spline Regression 47
Therefore, the RK of H0
R0(x, z) = 2 − cosx− cos z + 2xz − x sin z − z sinx
+ cosx cos z + sinx sin z. (2.63)
The Green function
G(x, s) =
{
x− s− sin(x− s), s ≤ x,
0, s > x.
Thus the RK of H1
R1(x, z) = −1
6(x ∧ z)3 +
1
2xz(x ∧ z) − |x− z| − sinx− sin z
+ x cos z + z cosx+1
2(x ∧ z) cos(z − x)
+5
4sin |x− z| − 1
4sin(x+ z). (2.64)
The option type=‘‘linSinCos’’ in the lspline function computesR1 in (2.64) for the special case when a = 0. The translation x−amay beused when a 6= 0. When the null space H0 = span{1, x, cos τx, sin τx}for a fixed τ 6= 0, the corresponding differential operator is L = D4 +τ2D2. The linear-periodic spline with a general τ can be fitted using thetransformation x = τx.
The melanoma data contain numbers of melanoma cases per 100,000in Connecticut from 1936 to 1972. We now illustrate how to fit a linear-periodic spline to the melanoma data. The observations are shown inFigure 2.11. There are two apparent trends: a nearly linear long-termtrend over the years, and a cycle of around 10 years corresponding tothe sunspot cycle. Let y = cases and x = year. As in Heckman andRamsay (2000), we fit a linear-periodic spline with L = D4 + τ2D2,where τ = 0.58.
> library(fda); attach(melanoma)
> x <- year-1936; y <- incidence
> tau <- .58; tx <- tau*x
> ssr(y~tx+cos(tx)+sin(tx),
rk=lspline(tx,type=‘‘linSinCos’’), spar=‘‘m’’)
Again, since the sample size is small, the GML method is used to selectthe smoothing parameter. For comparison, the cubic spline fit withsmoothing parameter selected by the GML method is also shown inFigure 2.11.
48 Smoothing Splines: Methods and Applications
year
cases p
er
100,0
00
1940 1950 1960 1970
12
34
oo o
oo
o
oo
oo o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o o o
FIGURE 2.11 Melanoma data, observations (circles), the cubicspline fit (dashed line), and the linear-periodic spline fit (solid line).
2.11.5 Trigonometric Spline
Suppose f is a periodic function on X = [0, 1]. Assume that f ∈Wm
2 (per). Then we have the Fourier expansion
f(x) = a0 +
∞∑
ν=1
aν cos 2πνx+
∞∑
ν=1
bν sin 2πνx
= a0 +
m−1∑
ν=1
aν cos 2πνx+
m−1∑
ν=1
bν sin 2πνx+ Rem(x), (2.65)
where the first three elements in (2.65) represent a trigonometric poly-nomial of degree m−1 and Rem(x) ,
∑∞ν=m(aν cos 2πνx+bν sin 2πνx).
The penalty used in Section 2.7 for a periodic spline corresponds toL = Dm with kernel H0 = span{1}. That is, all nonconstant functionsincluding those in the trigonometric polynomial are penalized. Anal-ogous to the Taylor expansion and polynomial splines, one may wantto include lower-degree trigonometric polynomials in the null space.The operator L = Dm does not decompose Wm
2 (per) into lower-degreetrigonometric polynomials plus the remainder terms.
Now consider the null space
H0 = span{1, sin 2πνx, cos 2πνx, ν = 1, . . . ,m− 1
}, (2.66)
which includes trigonometric polynomials with degree up to m− 1. As-sume that f ∈ Wm+1
2 (per). It is easy to see that H0 is the kernel of the
Smoothing Spline Regression 49
differential operator
L = D
m−1∏
ν=1
{D2 + (2πν)2
}. (2.67)
Model space construction and decomposition of Wm+12 (per)
The space Wm+12 (per) is an RKHS with the inner product
(f, g) =
(∫ 1
0
fdx
)(∫ 1
0
gdx
)
+
m−1∑
ν=1
(∫ 1
0
f cos 2πνxdx
)(∫ 1
0
g cos 2πνxdx
)
+
m−1∑
ν=1
(∫ 1
0
f sin 2πνxdx
)(∫ 1
0
g sin 2πνxdx
)
+
∫ 1
0
(Lf)(Lg)dx,
where L is defined in (2.67). Furthermore, Wm+12 (per) = H0 ⊕ H1,
where H0 is given in (2.66) and H1 = Wm+12 (per) ⊖ H0. H0 and H1
are RKHS’s with corresponding RKs
R0(x, z) = 1 +
m−1∑
ν=1
cos{2π(x− z)},
R1(x, z) =
∞∑
ν=m
2
(2π)4m+2
m−1∏
j=0
(j2 − ν2)−2
cos 2πν(x− z).
(2.68)
Sometimes f satisfies the constraint∫ 1
0 fdx = 0. This constraint canbe handled easily by removing constant functions in the above construc-tion. Specifically, let
H0 = span{
sin 2πνx, cos 2πνx, ν = 1, . . . ,m− 1}. (2.69)
Assume that f ∈Wm2 (per)⊖{1}. Then H0 is the kernel of the differential
operator
L =
m−1∏
ν=1
{D2 + (2πν)2
}. (2.70)
50 Smoothing Splines: Methods and Applications
Model space construction and decomposition of Wm2 (per)⊖{1}
The space Wm2 (per) ⊖ {1} is an RKHS with the inner product
(f, g) =
m−1∑
ν=1
(∫ 1
0
f cos 2πxdx
)(∫ 1
0
g cos 2πxdx
)
+
m−1∑
ν=1
(∫ 1
0
f sin 2πxdx
)(∫ 1
0
g sin 2πxdx
)
+
∫ 1
0
(Lf)(Lg)dx,
where L is defined in (2.70). Furthermore, Wm2 (per) = H0 ⊕H1, where
H0 is given in (2.69) and H1 = Wm2 (per) ⊖ {1} ⊖ H0. H0 and H1 are
RKHS’s with RKs
R0(x, z) =m−1∑
ν=1
cos{2π(x− z)},
R1(x, z) =
∞∑
ν=m
2
(2π)4(m−1)
m−1∏
j=1
(j2 − ν2)−2
cos 2πν(x− z).
(2.71)
In the lspline function, the options type=‘‘sine1’’ and type=
‘‘sine0’’ compute RKs R1 in (2.68) and (2.71), respectively.We now use the Arosa data to show how to fit a trigonometric spline
and illustrate its difference from periodic and partial splines. Suppose wewant to investigate how ozone thickness changes over months in a year.We have fitted a cubic periodic spline with f ∈ W 2
2 (per) in Section2.7. Note that W 2
2 (per) = H10 ⊕ H11, where H10 = {1} and H11 =W 2
2 (per) ⊖ {1}. Therefore, we can rewrite the periodic spline model as
yi = f10(xi) + f11(xi) + ǫi, i = 1, . . . , n, (2.72)
where f10 and f11 are projections onto H10 and H11, respectively. Theestimates of the overall function, f10 + f11, and its projections f10 (para-
metric) and f11 (smooth) are shown in the first row of Figure 2.12.It is apparent that the monthly pattern can be well approximated by
a simple sinusoidal function in the model space
P = span{1, sin 2πx, cos 2πx
}. (2.73)
Suppose we want to check the departure from the parametric model P .One approach is to add the sine and cosine functions to the null space
Smoothing Spline Regression 51
month
thic
kn
ess
0
100
200
300
400
2 4 6 8 10 12
overallL−spline
parametricL−spline
2 4 6 8 10 12
smoothL−spline
overall
partial spline
parametric
partial spline
0
100
200
300
400
smooth
partial spline
0
100
200
300
400
overall
periodic spline
2 4 6 8 10 12
parametric
periodic spline
smooth
periodic spline
FIGURE 2.12 Arosa data, plots the overall fits (left), the paramet-ric components (middle), and the smooth components (right) of theperiodic spline (top), the partial spline (middle), and the L-spline (bot-tom). Parametric components represent components in the spaces H10,P , and P for periodic spline, partial spline, and L-spline, respectively.Smooth components represent components in the spaces H11, H11, andW 3
2 (per)⊖P for periodic spline, partial spline, and L-spline, respectively.Dotted lines are 95% Bayesian confidence intervals.
52 Smoothing Splines: Methods and Applications
H10 of the periodic spline. This leads to the following partial splinemodel
yi = f20(xi) + f21(xi) + ǫi, i = 1, . . . , n, (2.74)
where f20 ∈ P and f21 ∈ H11. The model (2.74) is fitted as follows:
> ssr(thick~sin(2*pi*x)+cos(2*pi*x), rk=periodic(x))
The estimates of the overall function and its two components f20 (para-
metric) and f21 (smooth) are shown in the second row of Figure 2.12.Another approach is to fit a trigonometric spline with m = 2:
yi = f30(xi) + f31(xi) + ǫi, i = 1, . . . , n, (2.75)
where f30 ∈ P and f31 ∈ W 32 (per) ⊖ P . The model (2.75) is fitted as
follows:
> ssr(thick~sin(2*pi*x)+cos(2*pi*x),
rk=lspline(x,type=‘‘sine1’’))
The estimates of the overall function and its two components f30 (para-
metric) and f31 (smooth) are shown in the third row of Figure 2.12.All three models have similar overall fits. However, their components
are quite different. The smooth component of the periodic spline revealsthe departure from a constant function. To check the departure from thesimple sinusoidal model space P , we can look at the smooth componentsfrom the partial and L-splines. Estimates of the smooth componentsfrom the partial and L-splines are similar. However, the confidenceintervals based on the L-spline are narrower than those based on thepartial spline. This is due to the fact that the two components f30 andf31 in the L-spline are orthogonal, while the two components f20 andf21 in the partial spline are not necessarily orthogonal. Therefore, wecan expect the inference based on the L-spline to be more efficient ingeneral.
Chapter 3
Smoothing Parameter Selectionand Inference
3.1 Impact of the Smoothing Parameter
The penalized least squares (2.11) represents a compromise between thegoodness-of-fit and a penalty to the departure from the null space H0.The balance is controlled by the smoothing parameter λ. As λ variesfrom 0 to ∞, we have a family of estimates with f ∈ H0 when λ = ∞.
To illustrate the impact of the smoothing parameter, consider theStratford weather data consisting of daily maximum temperatures inStratford, Texas, during 1990. Observations are shown in Figure 3.1.Consider the regression model (1.1) where n = 73 and f representsexpected maximum temperature as a function of time in a year. Denote xas the time variable scaled into [0, 1]. It is reasonable to assume that f isa smooth periodic function. In particular, we assume that f ∈W 2
2 (per).For a fixed λ, say 0.001, one can fit the cubic periodic spline as follows:
> data(Stratford); attach(Stratford)
> ssr(y~1, rk=periodic(x), limnla=log10(73*.001))
where the argument limnla specifies a search range for log10(nλ). Tosee how a spline fit is affected by the choice of λ, periodic spline fitswith six different values of λ are shown in Figure 3.1. It is obvious thatthe fit with λ = ∞ is a constant, that is, f∞ ∈ H0. The fit with λ = 0interpolates data. A larger λ leads to a smoother fit. Both λ = 0.0001and λ = 0.00001 lead to visually reasonable fits.
In practice it is desirable to select the smoothing parameter using anobjective method rather than visual inspection. In a sense, a data-drivenchoice of λ allows data to speak for themselves. Thus, it is not exagger-ating to say that the choice of λ is the spirit and soul of nonparametricregression.
We now inspect how λ controls the fit. Again, consider model (1.1)for Stratford weather data. Let us first consider a parametric approachthat approximates f using a trigonometric polynomial up to a certain
53
54 Smoothing Splines: Methods and Applications
time
tem
pera
ture
(F
ahre
nheit)
20
40
60
80
100
0.0 0.2 0.4 0.6 0.8 1.0
λ = 1e−05 λ = 1e−06
0.0 0.2 0.4 0.6 0.8 1.0
λ = 0
λ = ∞
0.0 0.2 0.4 0.6 0.8 1.0
λ = 0.001
20
40
60
80
100
λ = 1e−04
FIGURE 3.1 Stratford weather data, plot of observations, and theperiodic spline fits with different smoothing parameters.
degree, say k, where 0 ≤ k ≤ K and K = (n − 1)/2 = 36. Denote thecorresponding parametric model space for f as
Mk = span{1,√
2 sin 2πνx,√
2 cos 2πνx, ν = 1, . . . , k}, (3.1)
where M0 = span{1}. For a fixed k, write the regression model basedon Mk in a matrix form as
y = Xkβk + ǫ,
where
Xk =
1√
2 sin 2πx1
√2 cos 2πx1 · · ·
√2 sin 2πkx1
√2 cos 2πkx1
1√
2 sin 2πx2
√2 cos 2πx2 · · ·
√2 sin 2πkx2
√2 cos 2πkx2
......
... · · ·...
...
1√
2 sin 2πxn
√2 cos 2πxn · · ·
√2 sin 2πkxn
√2 cos 2πkxn
is the design matrix, xi = i/n, βk = (β1, . . . , β2k+1)T , and ǫ = (ǫ1, . . . ,
ǫn)T . Since design points are equally spaced, we have the following
Smoothing Parameter Selection and Inference 55
orthogonality relations:
2
n
n∑
i=1
cos 2πνxi cos 2πµxi = δν,µ, 1 ≤ ν, µ ≤ K,
2
n
n∑
i=1
sin 2πνxi sin 2πµxi = δν,µ, 1 ≤ ν, µ ≤ K,
2
n
n∑
i=1
cos 2πνxi sin 2πµxi = 0, 1 ≤ ν, µ ≤ K,
where δν,µ is the Kronecker delta. Therefore, XTk Xk = nI2k+1, where
I2k+1 is an identity matrix of size 2k+1. Note that XK/√n is an orthog-
onal matrix. Define the discrete Fourier transformation y = XTKy/n.
Then the LS estimate of βk is βk = (XTk Xk)−1XT
k y = XTk y/n = yk,
where yk consists of the first 2k + 1 elements of the discrete Fouriertransformation y. More explicitly,
β1 =1
n
n∑
i=1
yi = y1,
β2ν =
√2
n
n∑
i=1
yi sin 2πνxi = y2ν , 1 ≤ ν ≤ k, (3.2)
β2ν+1 =
√2
n
n∑
i=1
yi cos 2πνxi = y2ν+1, 1 ≤ ν ≤ k.
Now consider modeling f using the cubic periodic spline spaceW 22 (per).
The exact solution was given in Chapter 2. To simplify the argument,let us consider the following PLS
minf∈MK
{
1
n
n∑
i=1
(yi − f(xi))2 + λ
∫ 1
0
(f ′′)2dx
}
, (3.3)
where the model space W 22 (per) is approximated by MK . The follow-
ing discussion holds true for the exact solution in W 22 (per) (Gu 2002).
However, the approximation makes the following argument simpler andtransparent.
Let
f(x) = α1 +
K∑
ν=1
(α2ν
√2 sin 2πνx+ α2ν+1
√2 cos 2πνx)
56 Smoothing Splines: Methods and Applications
be the solution to (3.3). Then f , (f(x1), . . . , f(xn))T = XKα, whereα = (α1, . . . , α2K+1)
T . The LS
1
n||y − f ||2 =
1
n|| 1√
nXT
K(y − f )||2
= || 1nXT
Ky − 1
nXT
KXKα||2
= ||y − α||2.
Thus (3.3) reduces to the following ridge regression problem
(α1 − y1)2 +
K∑
ν=1
{(α2ν − y2ν)2 + (α2ν+1 − y2ν+1)
2}
+ λK∑
ν=1
(2πν)4(α22ν + α2
2ν+1). (3.4)
The solutions to (3.4) are
α1 = y1,
α2ν =y2ν
1 + λ(2πν)4, ν = 1, . . . ,K, (3.5)
α2ν+1 =y2ν+1
1 + λ(2πν)4, ν = 1, . . . ,K.
Thus the periodic spline is essentially a low-pass filter: components atfrequency ν are downweighted by a factor of 1 + λ(2πν)4. Figure 3.2shows how λ controls the nature of the filter: more high frequenciesare filtered out as λ increases. Comparing (3.2) and (3.5), it is clearthat selecting an order k for the trigonometric polynomial model maybe viewed as hard thresholding, and selecting the smoothing parameterλ for the periodic spline may be viewed as soft thresholding.
Now consider the general spline model (2.10). From (2.26), the hatmatrix H(λ) = I − nλQ2(Q
T2MQ2)
−1QT2 . Let UEUT be the eigende-
composition of QT2 ΣQ2, where U(n−p)×(n−p) is an orthogonal matrix and
E = diag(e1, . . . , en−p). The projection onto the space spanned by T
PT , T (T TT )−1T T = Q1R(RTR)−1RTQT1 = Q1Q
T1 .
Then
H(λ) = I − nλQ2U(E + nλI)−1UTQT2 (3.6)
= Q1QT1 +Q2Q
T2 − nλQ2U(E + nλI)−1UTQT
2
= PT +Q2UV UTQT
2 , (3.7)
Smoothing Parameter Selection and Inference 57
0 5 10 15 20 25 30 35
0.0
0.2
0.4
0.6
0.8
1.0
frequency
weig
hts
FIGURE 3.2 Weights of the periodic spline filter, 1/(1 + λ(2πν)4),plotted as a function of frequency ν. Six curves from top down corre-spond to six different λ: 0, 10−6, 10−5, 10−4, 10−3, and ∞.
where V = diag(e1/(e1 + nλ), . . . , en−p/(en−p + nλ)). The hat matrixis divided into two mutually orthogonal matrices: one is the projectiononto the space spanned by T , and the other is responsible for shrinkingpart of the signal that is orthogonal to T . The smoothing parametershrinks eigenvalues in the form eν/(eν + nλ). The choices λ = ∞ andλ = 0 lead to the parametric model H0 and interpolation, respectively.
Equation (3.7) also indicates that the hat matrix H(λ) is nonnegativedefinite. However, unlike the projection matrix for a parametric model,H(λ) is usually not idempotent. H(λ) has p eigenvalues equal to oneand the remaining eigenvalues less than one when λ > 0.
3.2 Trade-Offs
Before introducing methods for selecting the smoothing parameter, it ishelpful to discuss some basic concepts and principles for model selection.In general, model selection boils down to compromises between differentaspects of a model. Occam’s razor has been the guiding principle for thecompromises: the model that fits observations sufficiently well in theleast complex way should be preferred. To be precise on fits observationssufficiently well, one needs a quantity that measures how well a model fitsthe data. One such measure is the LS in (1.6). To be precise on the leastcomplex way, one needs a quantity that measures the complexity of a
58 Smoothing Splines: Methods and Applications
model. For a parametric model, a common measure of model complexityis the number of parameters in the model, often called the degrees offreedom (df). For example, the df of model Mk in (3.1) equals 2k + 1.
What would be a good measure of model complexity for a nonparamet-ric regression procedure? Consider the general nonparametric regressionmodel (2.10). Let fi = Lif , and f = (f1, . . . , fn). Let f be an estimate
of f based on a modeling procedure M, and fi = Lif . Ye (1998) definedgeneralized degrees of freedom (gdf) of M as
gdf(M) ,
n∑
i=1
∂Ef (fi)
∂fi. (3.8)
The gdf is an extension of the standard degrees of freedom for generalmodeling procedures. It can be viewed as the sum of the average sen-sitivities of the fitted values fi to a small change in the response. It iseasy to check that (Efron 2004)
gdf(M) =1
σ2
n∑
i=1
Cov(fi, yi),
where∑n
i=1 Cov(fi, yi) is the so-called covariance penalty (Tibshirani
and Knight 1999). For spline estimate with a fixed λ, we have f = H(λ)y
based on (2.25). Denote the modeling procedure leading to f as Mλ andH(λ) = {hij}n
i,j=1. Then
gdf(Mλ) =n∑
i=1
∂Ef (∑n
j=1 hijyj)
∂fi=
n∑
i=1
∂∑n
j=1 hijfj
∂fi= trH(λ),
where tr represents the trace of a matrix. Even though λ does nothave a physical interpretation as k, trH(λ) is a useful measure of modelcomplexity and will be simply referred to as the degrees of freedom.
For Stratford weather data, Figure 3.3(a) depicts how trH(λ) for thecubic periodic spline depends on the smoothing parameter λ. It is clearthat the degrees of freedom decrease as λ increases.
To illustrate the interplay between the LS and model complexity, wefit trigonometric polynomial models from the smallest model with k = 0to the largest model with k = K. The square root of residual sumof squares (RSS) are plotted against the degrees of freedom (2k + 1)as circles in Figure 3.3(b). Similarly, we fit the periodic spline with awide range of values for the smoothing parameter λ. Again, we plot thesquare root of RSS against the degrees of freedom (trH(λ)) as the solidline in Figure 3.3(b). Obviously, RSS decreases to zero (interpolation)as the degrees of freedom increase to n. The square root of RSS keeps
Smoothing Parameter Selection and Inference 59
−12 −10 −8 −6 −4 −2 0
010
20
30
40
50
60
70
log10(λ)
degre
es o
f fr
eedom
(a)
0 10 20 30 40 50 60 70
05
10
15
20
degrees of freedom
square
root of R
SS
o
ooooooooooooooo
ooooooooooo
oooo
ooooo
o
(b)
FIGURE 3.3 Stratford data, plots of (a) degrees of freedom of theperiodic spline against the smoothing parameter on the logarithm base10 scale, and (b) square root of RSS from the trigonometric polynomialmodel (circles) and periodic spline (line) against the degrees of freedom.
declining almost linearly after the initial big drop. It is quite clear thatthe constant model does not fit data well. However, it is unclear whichmodel fits observations sufficiently well.
Figure 3.3(b) shows that the LS and model complexity are two oppo-site aspects of a model: the approximation error decreases as the modelcomplexity increases. Our goal is to find the “best” model that strikes abalance between these two conflicting aspects. To make the term “best”meaningful, we need a target criterion that quantifies a model’s perfor-mance. It is clear that the LS cannot be used as the target becauseit will lead to the most complex model. Even though there is no uni-versally accepted measure, some criteria are widely accepted and usedin practice. We now introduce a criterion that is commonly used forregression models.
Consider the loss function
L(λ) =1
n||f − f ||2. (3.9)
Define the risk function, also called mean squared error (MSE), as
MSE(λ) , EL(λ) = E
(1
n||f − f ||2
)
. (3.10)
We want the estimate f to be as close to the true function f as possible.Obviously, MSE is the expectation of the Euclidean distance betweenthe estimates and the true values. It can be decomposed into two com-
60 Smoothing Splines: Methods and Applications
ponents:
MSE(λ)
=1
nE||(Ef − f) + (f − Ef )||2
=1
nE||Ef − f ||2 +
2
nE(Ef − f)T (f − Ef) +
1
nE||f − Ef ||2
=1
n||Ef − f ||2 +
1
nE||f − Ef ||2
=1
n||(I −H(λ))f ||2 +
σ2
ntrH2(λ)
, b2(λ) + v(λ), (3.11)
where b2 and v represent squared bias and variance, respectively. Notethat bias depends on the true function, while the variance does not.Based on notations introduced in Section 3.2, let h = (h1, . . . , hn−p)
T ,
UTQT2 f . From (3.6), we have
b2(λ) =1
n||(I −H(λ))f ||2
=1
nfTQ2Udiag
{(nλ
e1 + nλ
)2
, . . . ,
(nλ
en−p + nλ
)2}
UTQT2 f
=1
n
n−p∑
ν=1
(nλhν
eν + nλ
)2
.
From (3.7), we have
v(λ) =σ2
ntrH2(λ) =
σ2
ntr(PT +Q2UV
2UTQT2 )
=σ2
n
{
p+
n−p∑
ν=1
(eν
eν + nλ
)2}
.
The squared bias measures how well f approximates the true functionf , and the variance measures how well the function can be estimated.As λ increases from 0 to ∞, b2(λ) increases from 0 to
∑n−pν=1 h
2ν/n =
||QT2 f ||2/n, while v(λ) decreases from σ2 to pσ2/n. Therefore, the MSE
represents a trade-off between bias and variance. Note that QT2 f repre-
sents the signal that is orthogonal to T .It is easy to check that db2(λ)/dλ|λ=0 = 0 and dv(λ)/dλ|λ=0 < 0.
Thus, dMSE(λ)/dλ|λ=0 < 0, and MSE(λ) has at least one minimizerλ∗ > 0. Therefore, when MSE(0) ≤ MSE(∞), there exists at leastone λ∗ such that the corresponding PLS estimate performs better in
Smoothing Parameter Selection and Inference 61
terms of MSE than the LS estimate in H0 and the interpolation. WhenMSE(0) > MSE(∞), considering the MSE as a function of δ = 1/λ, wehave db2(δ)/dδ|δ=0 < 0 and dv(δ)/dδ|δ=0 = 0. Then, again, there existsat least one δ∗ such that the corresponding PLS estimate performs betterin terms of MSE than the LS estimate in H0 and the interpolation.
To calculate MSE, one needs to know the true function f . The follow-ing simulation illustrates the bias-variance trade-off. Observations aregenerated from model (1.1) with f(x) = sin(4πx2) and σ = 0.5. Thesame design points as in the Stratford weather data are used: xi = i/nfor i = 1, . . . , n and n = 73. The true function and one realization of ob-servations are shown in Figure 3.4(a). For a fixed λ, the bias, variance,and MSE can be calculated since the true function is known in the sim-ulation. For the cubic periodic spline, Figure 3.4(b) shows b2(λ), v(λ),and MSE(λ) as functions of log10(nλ). Obviously, as λ increases, thesquared bias increases and the variance decreases. The MSE representsa compromise between bias and variance.
0.0 0.2 0.4 0.6 0.8 1.0
−1
.5−
0.5
0.5
1.5
x
y
(a)
−8 −7 −6 −5 −4 −3 −2
0.0
00.1
00.2
00.3
0
log10(nλ)
square
d b
ias,v
ariance, and M
SE (b)
FIGURE 3.4 Plots of (a) true function (line) and observations (cir-cles), and (b) squared bias b2(λ) (dashed), variance v(λ) (dotted line),and MSE (solid line) for the cubic periodic spline.
Another closely related target criterion is the average predictive squarederror (PSE)
PSE(λ) = E
(1
n||y+ − f ||2
)
, (3.12)
where y+ = f + ǫ+ are new observations of f , ǫ+ = (ǫ+1 , . . . , ǫ+n )T are
independent of ǫ, and ǫ+i are independent and identically distributed
62 Smoothing Splines: Methods and Applications
with mean zero and variance σ2. PSE measures the performance of amodel’s prediction for new observations. We have
PSE(λ) = E
(1
n||(y+ − f) + (f − f)||2
)
= σ2 + MSE(λ).
Thus PSE differs from MSE only by a constant σ2.Ideally, one would want to select λ that minimizes the MSE (PSE).
This is, however, not practical because MSE (PSE) depends on the un-known true function f that one wants to estimate in the first place.Instead, one may estimate MSE (PSE) from the data and then mini-mize the estimated criterion. We discuss unbiased and cross-validationestimates of PSE (MSE) in Sections 3.3 and 3.4, respectively.
3.3 Unbiased Risk
First consider the case when the error variance σ2 is known. Since
E
(1
n||y − f ||2
)
= E
{1
n||y − f ||2 +
2
n(y − f )T (f − f ) +
1
n||f − f ||2
}
= σ2 − 2σ2
ntrH(λ) + MSE(λ), (3.13)
then,
UBR(λ) ,1
n||(I −H(λ))y||2 +
2σ2
ntrH(λ) (3.14)
is an unbiased estimate of PSE(λ). Since PSE differs from MSE onlyby a constant σ2, one may expect the minimizer of UBR(λ) to be closeto the minimizer of the risk function MSE(λ). In fact, a stronger re-sult holds: under certain regularity conditions, UBR(λ) is a consistentestimate of the relative loss function L(λ) + n−1ǫT ǫ (Gu 2002). Thefunction UBR(λ) is referred to as the unbiased risk (UBR) criterion,and the minimizer of UBR(λ) is referred to as the UBR estimate of λ.It is obvious that UBR(λ) is an extension of the Mallow’s Cp criterion.
The error variance σ2 is usually unknown in practice. In general,there are two classes of estimators for σ2: residual-based and difference-based estimators. The first class of estimators is based on residuals froman estimate of f . For example, analogous to parametric regression, an
Smoothing Parameter Selection and Inference 63
estimator of σ2 based on fit f = H(λ)y, is
σ2 ,||(I −H(λ))y||2n− trH(λ)
. (3.15)
The estimator σ2 is consistent under certain regularity conditions (Gu2002). However, it depends critically on the smoothing parameter λ.Thus, it cannot be used in the UBR criterion since the purpose of thiscriterion is to select λ. For choosing the amount of smoothing, it isdesirable to have an estimator of σ2 without needing to fit the functionf first.
The difference-based estimators of σ2 do not require an estimate of themean function f . The basic idea is to remove the mean function f by tak-ing differences based on some well-chosen subsets of data. Consider thegeneral SSR model (2.10). Let Ij = {i(j, 1), . . . , i(j,Kj)} ⊂ {1, . . . , n}be a subset of index and d(j, k) be some fixed coefficients such that
Kj∑
k=1
d2(j, k) = 1,
Kj∑
k=1
d(j, k)Li(j,k)f ≈ 0, j = 1, . . . , J.
Since
E
Kj∑
k=1
d(j, k)yi(j,k)
2
≈ E
Kj∑
k=1
d(j, k)ǫi(j,k)
2
= σ2,
then
σ2 ≈ 1
J
J∑
j=1
Kj∑
k=1
d(j, k)yi(j,k)
2
(3.16)
provides an approximately unbiased estimator of σ2. The estimator σ2
is referred to as a differenced-based estimator since d(j, k) are usually
chosen to be contrasts such that∑Kj
k=1 d(j, k) = 0. The specific choicesof subsets and coefficients depend on factors including prior knowledgeabout f and the domain X .
Several methods have been proposed for the common situation whenx is a univariate continuous variable, f is a smooth function, and Li areevaluational functionals. Suppose design points are ordered such thatx1 ≤ x2 ≤ · · · ≤ xn. Since f is smooth, then f(xj+1) − f(xj) ≈ 0 whenneighboring design points are close to each other. Setting Ij = {j, j+1}and d(j, 1) = −d(j, 2) = 1/
√2 for j = 1, . . . , n−1, we have the first-order
difference-based estimator proposed by Rice (1984):
σ2R =
1
2(n− 1)
n∑
i=2
(yi − yi−1)2. (3.17)
64 Smoothing Splines: Methods and Applications
Hall, Kay and Titterington (1990) proposed the mth order difference-based estimator
σ2HKT =
1
n−m
n−m∑
j=1
{m∑
k=1
δkyj+k
}2
, (3.18)
where coefficients δk satisfy∑m
k=1 δk = 0,∑m
k=1 δ2k = 1, and δ1δm 6= 0.
Optimal choices of δk are studied in Hall et al. (1990). It is easy tosee that σ2
HKT corresponds to Ij = {j, . . . , j +m} and d(j, k) = δk forj = 1, . . . , n−m.
Both σ2R and σ2
HKT require an ordering of design points that couldbe problematic for multivariate independent variables. Tong and Wang(2005) proposed a different method for a general domain X . Suppose Xis equipped with a norm. Collect squared distances, dij = ||xi − xj ||2,for all pairs {xi, xj}, and half squared differences, sij = (yi − yj)
2/2,for all pairs {yi, yj}. Then E(sij) = {f(xi) − f(xj)}2/2 + σ2. Suppose{f(xi)−f(xj)}2/2 can be approximated by βdij when dij is small. Thenthe LS estimate of the intercept in the simple linear model
sij = α+ βdij + ǫij , dij ≤ D, (3.19)
provides an estimate of σ2. Denote such an estimator as σ2TW. Theoret-
ical properties and the choice of bandwidth D were studied in Tong andWang (2005).
To illustrate the UBR criterion as an estimate of PSE, we generateresponses from model (1.1) with f(x) = sin(4πx2), σ = 0.5, xi = i/nfor i = 1, . . . , n, and n = 73. For the cubic periodic spline, the UBRfunctions based on 50 replications of simulation data are shown in Figure3.5 where the true variance is used in (a) and the Rice estimator is usedin (b).
3.4 Cross-Validation and GeneralizedCross-Validation
Equation (3.13) shows that the RSS underestimates the PSE by theamount of 2σ2trH(λ)/n. The second term in the UBR criterion correctsthis bias. The bias in RSS is a consequence of using the same data formodel fitting and model evaluation. Ideally, these two tasks should beseparated using independent samples. This can be achieved by splittingthe whole data into two subsamples: a training (calibration) sample for
Smoothing Parameter Selection and Inference 65
−8 −7 −6 −5 −4 −3 −2
0.1
0.2
0.3
0.4
0.5
0.6
log10(nλ)
PS
E a
nd U
BR
UBR with true variance
log10(nλ)
−8 −7 −6 −5 −4 −3 −2
UBR with Rice estimator
FIGURE 3.5 Plots of the PSE function as solid lines, the UBR func-tions with true σ2 as dashed lines (left), and the UBR functions withRice estimator σ2
R. The minimum point of the PSE is marked as longbars at the bottom. The UBR estimates of log10(nλ) are marked asshort bars.
model fitting, and a test (validation) sample for model evaluation. Thisapproach, however, is not efficient unless the sample size is large. Theidea behind cross-validation is to recycle data by switching the roles oftraining and test samples. For simplicity, we present leaving-out-onecross-validation only. That is, each time one observation will be left outas the test sample, and the remaining n− 1 samples will be used as thetraining sample.
Let f [i] be the minimizer of the PLS based on all observations exceptyi:
1
n
∑
j 6=i
(yj − Ljf)2 + λ||P1f ||2. (3.20)
The cross-validation estimate of PSE is
CV(λ) ,1
n
n∑
i=1
(
Lif[i] − yi
)2
. (3.21)
CV(λ) is referred to as the cross-validation criterion, and the minimizerof CV(λ) is called the cross-validation estimate of the smoothing param-
eter. Computation of f [i] based on (3.21) for each i = 1, . . . , n would becostly. Fortunately, this is unnecessary due to the following lemma.
66 Smoothing Splines: Methods and Applications
Leaving-out-one LemmaFor any fixed i, f [i] is the minimizer of
1
n
(
Lif[i] − Lif
)2
+1
n
∑
j 6=i
(yj − Ljf)2 + λ||P1f ||2. (3.22)
[Proof] For any function f , we have
1
n
(
Lif[i] − Lif
)2
+1
n
∑
j 6=i
(yj − Lif)2 + λ||P1f ||2
≥ 1
n
∑
j 6=i
(yj − Lif)2 + λ||P1f ||2
≥ 1
n
∑
j 6=i
(
yj − Lj f[i])2
+ λ||P1f[i]||2
=1
n
(
Lif[i] − Lif
[i])2
+1
n
∑
j 6=i
(
yj − Lj f[i])2
+ λ||P1f[i]||2,
where the second inequality holds since f [i] is the minimizer of (3.20).The above lemma indicates that the solution to the PLS (3.20) without
the ith observation, f [i], is also the solution to the PLS (2.11) with the
ith observation yi being replaced by the fitted value Lif[i]. Note that
the hat matrix H(λ) depends on the model space and operators Li only.It does not depend on observations of the dependent variable. Therefore,both fits based on (2.11) and (3.22) have the same hat matrix. That is,
f = H(λ)y and f[i]
= H(λ)y[i], where f[i]
= (L1f[i], . . . ,Lnf
[i])T and
y[i] is the same as y except that the ith element is replaced by Lif[i].
Denote H(λ) = {hij}ni,j=1. Then
Lif =
n∑
j=1
hijyj,
Lif[i] =
∑
j 6=i
hijyj + hiiLif[i].
Solving for Lif[i], we have
Lif[i] =
Lif − hiiyi
1 − hii.
Then
Lif[i] − yi =
Lif − hiiyi
1 − hii− yi =
Lif − yi
1 − hii.
Smoothing Parameter Selection and Inference 67
Plugging into (3.21), we have
CV(λ) =1
n
n∑
i=1
(Lif − yi)2
(1 − hii)2. (3.23)
Therefore, the cross-validation criterion can be calculated using the fitbased on the whole sample and the diagonal elements of the hat matrix.
Replacing hii by the average of diagonal elements, trH(λ), we havethe generalized cross-validation (GCV) criterion
GCV(λ) ,
1n
∑ni=1(Lif − yi)
2
{1n tr(I −H(λ))
}2 . (3.24)
The GCV estimate of λ is the minimizer of GCV(λ). Since trH(λ)/n isusually small in the neighborhood of the optimal λ, we have
E{GCV(λ)} ≈{
1
n||(I −H(λ))f ||2 +
σ2
ntrH2(λ) + σ2 − 2σ2
ntrH(λ)
}
{
1 +2
ntrH(λ)
}
= PSE(λ){1 + o(1)}.
The above approximation provides a very crude argument supportingthe GCV criterion as a proxy for the PSE. More formally, under certainregularity conditions, GCV(λ) is a consistent estimate of the relativeloss function. Furthermore, GCV(λ) is invariant to an orthogonal trans-formation of y. See Wahba (1990) and Gu (2002) for details. Onedistinctive advantage of the GCV criterion over the UBR criterion isthat the former does not require an estimate of σ2.
To illustrate the CV(λ) and GCV(λ) criteria as estimates of the PSE,we generate responses from model (1.1) with f(x) = sin(4πx2), σ = 0.5,xi = i/n for i = 1, . . . , n, and n = 73. For cubic periodic spline, theCV and GCV scores for 50 replications of simulation data are shown inFigure 3.6.
3.5 Bayes and Linear Mixed-Effects Models
Assume a prior for f as
F (x) =
p∑
ν=1
ζνφν(x) + δ1
2U(x), (3.25)
68 Smoothing Splines: Methods and Applications
−8 −7 −6 −5 −4 −3 −2
0.1
0.2
0.3
0.4
0.5
0.6
log10(nλ)
PS
E a
nd C
V
CV
−8 −7 −6 −5 −4 −3 −2
0.1
0.2
0.3
0.4
0.5
0.6
log10(nλ)
PS
E a
nd G
CV
GCV
FIGURE 3.6 Plots of the PSE function as solid lines, the CV func-tions as dashed lines (left), and the GCV functions as dashed lines(right). The minimum point of the PSE is marked as long bars at thebottom. CV and GCV estimates of log10(nλ) are marked as short bars.
where ζ1, . . . , ζpiid∼ N(0, κ), U(x) is a zero-mean Gaussian stochastic pro-
cess with covariance function R1(x, z), ζν and U(x) are independent, andκ and δ are positive constants. Note that the bounded linear functionalsLi are defined for elements in H. Its application to the random processF (x) is yet to be defined. For simplicity, the subscript i in Li is ignoredin the following definition. Define L(ζνφν) = ζνLφν . The definition ofLU requires the duality between the Hilbert space spanned by a familyof random variables and its associated RKHS.
Consider the linear space
U ={
W : W =∑
αjU(xj), xj ∈ X , αj ∈ R
}
with inner product (W1,W2) = E(W1W2). Let L2(U) be the Hilbertspace that is the completion of U . Note that the RK R1 of H1 coincideswith the covariance function of U(x). Consider a linear map Ψ : H1 →L2(U) such that Ψ{R1(xj , ·)} = U(xj). Since
(R1(x, ·), R1(z, ·)) = R1(x, z) = E{U(x)U(z)} = (U(x), U(z)),
the map Ψ is inner product preserving. In fact, H1 is isometricallyisomorphic to L2(U). See Parzen (1961) for details. Since L is a boundedlinear functional in H1, by the Riesz representation theorem, there existsa representer h such that Lf = (h, f). Finally we define
LU , Ψh.
Smoothing Parameter Selection and Inference 69
Note that LU is a random variable in L2(U). The application of L to F
is defined as LF =∑p
ν=1 ζνLφν + δ12LU .
When L is an evaluational functional, say Lf = f(x0) for a fixed x0,we have h(·) = R1(x0, ·). Consequently,
LU = Ψh = Ψ{R1(x0, ·)} = U(x0),
the evaluation of U at x0. Therefore, as expected, LF = F (x0) when Lis an evaluational functional.
Suppose observations are generated by
yi = LiF + ǫi, i = 1, . . . , n, (3.26)
where the prior F is defined in (3.25) and ǫiiid∼ N(0, σ2). Note that the
normality assumption has been made for random errors.We now compute the posterior mean E(L0F |y) for a bounded linear
functional L0 on H. Note that L0 is arbitrary, which could be quitedifferent from Li. For example, suppose f ∈ Wm
2 [a, b] and Li are eval-uational functionals. Setting L0f = f ′(x0) leads to an estimate of f ′.Using the correspondence between H and L2(U), we have
E(LiULjU) = (LiU,LjU) = (Li(x)R1(x, ·),Lj(z)R1(z, ·))= Li(x)Lj(z)R1(x, z),
E(L0ULiU) = (L0U,LjU) = (L0(x)R1(x, ·),Lj(z)R1(z, ·))= L0(x)Lj(z)R1(x, z).
Let ζ = (ζ1, . . . , ζp)T , φ = (φ1, . . . , φp)
T and L0φ = (L0φ1, . . . ,L0φp)T .
Then F (x) = φT (x)ζ + δ12U(x). It is easy to check that
y = Tζ + δ12 (L1U, . . . ,LnU)T + ǫ ∼ N(0, κT TT + δΣ + σ2I), (3.27)
and
L0F = (L0φ)T ζ + δ12L0U
∼ N(0, κ(L0φ)TL0φ+ δL0(x)L0(z)R1(x, z)). (3.28)
Furthermore,
E{(L0F )y} = κTL0φ+ δL0ξ, (3.29)
where
ξ(x) = (L1(z)R1(x, z), . . . ,Ln(z)R1(x, z))T ,
L0ξ = (L0(x)L1(z)R1(x, z), . . . ,L0(x)Ln(z)R1(x, z))T .
(3.30)
70 Smoothing Splines: Methods and Applications
Let λ = σ2/nδ and η = κ/δ. Using properties of multivariate normalrandom variables and equations (3.27), (3.28), and (3.29), we have
E(L0F |y)
= (L0φ)T ηT T (ηTT T +M)−1y + (L0ξ)T (ηTT T +M)−1y. (3.31)
It can be shown (Wahba 1990, Gu 2002) that for any full-column rankmatrix T and symmetric and nonsingular matrix M ,
limη→∞
(ηT TT +M)−1 = M−1 −M−1T (T TM−1T )−1T TM−1,
limη→∞
ηT T (ηT TT +M)−1 = T (T TM−1T )−1T TM−1. (3.32)
Combining results in (3.31), (3.32), and (2.22), we have
limκ→∞
E(L0F |y) = (L0φ)TT (T TM−1T )−1T TM−1y
+ (L0ξ)T {M−1 −M−1T (T TM−1T )−1T TM−1}y
= (L0φ)Td+ (L0ξ)Tc
= L0f .
The above result indicates that the smoothing spline estimate f is aBayes estimator with a diffuse prior for ζ. From a frequentist perspec-tive, the smoothing spline estimate may be regarded as the best linearunbiased prediction (BLUP) estimate of a linear mixed-effects (LME)model. We now present three corresponding LME models. The firstLME model assumes that
y = Tζ + u+ ǫ, (3.33)
where ζ = (ζ1, . . . , ζp)T are deterministic parameters, u = (u1, . . . , un)T
are random effects with distribution u ∼ N(0, σ2Σ/nλ), ǫ = (ǫ1, . . . , ǫn)T
are random errors with distribution ǫ ∼ N(0, σ2I), and u and ǫ are in-dependent. The second LME model assumes that
y = Tζ + Σu+ ǫ, (3.34)
where ζ are deterministic parameters, u are random effects with dis-tribution u ∼ N(0, σ2Σ+/nλ), Σ+ is the Moore–Penrose inverse of Σ,ǫ are random errors with distribution ǫ ∼ N(0, σ2I), and u and ǫ areindependent.
It is inconvenient to use the above two LME models for computationsince Σ may be singular. Write Σ = ZZT , where Z is a n ×m matrixwith m = rank(Σ). The third LME model assumes that
y = Tζ + Zu+ ǫ, (3.35)
Smoothing Parameter Selection and Inference 71
where ζ are deterministic parameters, u are random effects with dis-tribution u ∼ N(0, σ2I/nλ), ǫ are random errors with distributionǫ ∼ N(0, σ2I), and u and ǫ are independent.
It can be shown that the BLUP estimates for each of the three LMEmodels (3.33), (3.34), and (3.35) are the same as the smoothing splineestimate. See Wang (1998b) and Chapter 9 for more details.
3.6 Generalized Maximum Likelihood
The connection between smoothing spline models and Bayes models canbe exploited to develop a likelihood-based estimate for the smoothingparameter. From (3.27), the marginal distribution of y is N(0, δ(ηTT T +M)). Consider the following transformation
(w1
w2
)
=
(QT
21√ηT
T
)
y. (3.36)
It is easy to check that
w1 = QT2 y ∼ N(0, δQT
2MQ2),
Cov(w1,w2) =δ√ηQT
2 (ηTT T +M)T → 0, η → ∞,
Var(w2) =δ
ηT T (ηTT T +M)T → δ(T TT )(T TT ), η → ∞.
Note that the distribution of w2 is independent of λ. Therefore, weconsider the negative marginal log-likelihood of w1
l(λ, δ|w1) =1
2log |δQT
2MQ2| +1
2δwT
1 (QT2MQ2)
−1w1 + C1, (3.37)
where C1 is a constant. Minimizing l(λ, δ|w1) with respect to δ, we have
δ =wT
1 (QT2 MQ2)
−1w1
n− p. (3.38)
The profile negative log-likelihood
lp(λ, δ|w1) =1
2log |QT
2MQ2| +n− p
2log δ + C2
=n− p
2log
wT1 (QT
2MQ2)−1w1
{det(QT2MQ2)−1} 1
n−p
+ C2, (3.39)
72 Smoothing Splines: Methods and Applications
where C2 is another constant. The foregoing profile negative log-likelihoodis equivalent to
GML(λ) ,wT
1 (QT2MQ2)
−1w1
{det(QT2MQ2)−1} 1
n−p
=yT (I −H(λ))y
[det+{(I −H(λ))}] 1n−p
, (3.40)
where the second equality is based on (2.26), and det+ represents theproduct of the nonzero eigenvalues. The function GML(λ) is referred toas the generalized maximum likelihood (GML) criterion, and the mini-mizer of GML(λ) is called the GML estimate of the smoothing parame-ter.
From (3.38), a likelihood-based estimate of σ2 is
σ2 ,nλwT
1 (QT2MQ2)
−1w1
n− p=yT (I −H(λ))y
n− p. (3.41)
The GML criterion may also be derived from the connection betweensmoothing spline models and LME models. Consider any one of the threecorresponding LME models (3.33), (3.34), and (3.35). The smoothingparameter λ is part of the variance component for the random effects.It is common practice in the mixed-effects literature to estimate thevariance components using restricted likelihood based on an orthogonalcontrast of original observations where the orthogonal contrast is usedto eliminate the fixed effects. Note that w1 is one such orthogonal con-trast since Q2 is orthogonal to T . Therefore, l(λ, δ|w1) in (3.37) is thenegative log restricted likelihood, and the GML estimate of the smooth-ing parameter is the restricted maximum likelihood (REML) estimate.Furthermore, the estimate of error variance in (3.41) is the REML esti-mate of σ2. The connection between a smoothing spline estimate withGML estimate of the smoothing parameter and a BLUP estimate withREML estimate of the variance component in a corresponding LMEmodel may be utilized to fit a smoothing spline model using software forLME models. This approach will be adopted in Chapters 5, 8, and 9 tofit smoothing spline models for correlated observations.
3.7 Comparison and Implementation
Theoretical properties of the UBR, GCV, and GML criteria can be foundin Wahba (1990) and Gu (2002). The UBR criterion requires an estimateof the variance σ2. No distributional assumptions are required for theUBR and GCV criteria, while the normality assumption is required in
Smoothing Parameter Selection and Inference 73
the derivation of the GML criterion. Nevertheless, limited simulationssuggest that the GML method is quite robust to the departure from thenormality assumption.
Theoretical comparisons between UBR, GCV, and GML criteria havebeen studied using large-sample asymptotics (Wahba 1985, Li 1986, Stein1990) and finite sample arguments (Efron 2001). Conclusions based ondifferent perspectives do not always agree with each other. In practice,all three criteria usually perform well and lead to similar estimates. Eachmethod has its own strengths and weaknesses. The UBR and GCV crite-ria occasionally lead to gross undersmooth (interpolation) when samplesize is small. Fortunately, this problem diminishes quickly when samplesize increases (Wahba and Wang 1995).
The argument spar in the ssr function specifies which method shouldbe used for selecting the smoothing parameter λ. The options spar=‘‘v’’,spar=‘‘m’’, and spar=‘‘u’’ correspond to the GCV, GML, and UBRmethods, respectively. The default choice is the GCV method.
We now use the motorcycle data to illustrate how to specify the spar
option. For simplicity, the variable times is first scaled into [0, 1]. Wefirst use the Rice method to estimate error variance and use the esti-mated variance in the UBR criterion:
> x <- (times-min(times))/(max(times)-min(times))
> vrice <- mean((diff(accel))**2)/2
> mcycle.ubr.1 <- ssr(accel~x, rk=cubic(x),
spar=‘‘u’’, varht=vrice)
> summary(mcycle.ubr.1)
Smoothing spline regression fit by UBR method
...
UBR estimate(s) of smoothing parameter(s) : 8.60384e-07
Equivalent Degrees of Freedom (DF): 12.1624
Estimate of sigma: 23.09297
The option varht specifies the parameter σ2 required for the UBRmethod. The summary function provides a synopsis including the es-timate of the smoothing parameter, the degrees of freedom trH(λ), andthe estimate of standard deviation σ.
Instead of the Rice estimator, we can estimate the error variance usingTong and Wang’s estimator σ2
TW. Note that there are multiple observa-tions at some time points. We use these replicates to estimate the errorvariance. That is, we select all pairs with zero distances:
> d <- s <- NULL
> for (i in 1:132) {
74 Smoothing Splines: Methods and Applications
for (j in (i+1):133) {
d <- c(d, (x[i]-x[j])**2)
s <- c(s, (accel[i]-accel[j])**2/2)
}}
> vtw <- coef(lm(s~d,subset=d==0))[1]
> mcycle.ubr.2 <- ssr(accel~x, rk=cubic(x),
spar=‘‘u’’, varht=vtw)
> summary(mcycle.ubr.2)
Smoothing spline regression fit by UBR method
...
UBR estimate(s) of smoothing parameter(s) : 8.35209e-07
Equivalent Degrees of Freedom (DF): 12.24413
Estimate of sigma: 22.69913
Next we use the GCV method to select the smoothing parameter:
> mcycle.gcv <- ssr(accel~x, rk=cubic(x), spar=‘‘v’’)
> summary(mcycle.gcv)
Smoothing spline regression fit by GCV method
...
GCV estimate(s) of smoothing parameter(s) : 8.325815e-07
Equivalent Degrees of Freedom (DF): 12.25284
Estimate of sigma: 22.65806
Finally, we use the GML method to select the smoothing parameter:
> mcycle.gml <- ssr(accel~x, rk=cubic(x), spar=‘‘m’’)
> summary(mcycle.gml)
Smoothing spline regression fit by GML method
...
GML estimate(s) of smoothing parameter(s) : 4.729876e-07
Equivalent Degrees of Freedom (DF): 13.92711
Estimate of sigma: 22.57701
For the motorcycle data, all three methods lead to similar estimates ofthe smoothing parameter and the function f .
Smoothing Parameter Selection and Inference 75
3.8 Confidence Intervals
3.8.1 Bayesian Confidence Intervals
Consider the Bayes model (3.25) and (3.26). The computation in Section3.5 can be carried out one step further to derive posterior distributions.In the following arguments, as in Section 3.5, a diffuse prior is assumedfor ζ with κ→ ∞. For simplicity of notation, the limit is not expressedexplicitly.
Let F0ν = ζνφν for ν = 1, . . . , p, and F1 = δ12U . Let L0, L01, and
L02 be bounded linear functionals. Since F0ν , F1, and ǫi are all normalrandom variables, then the posterior distributions of L0F0ν and L0F1
are normal with the following mean and covariances.
Posterior means and covariancesFor ν, µ = 1, . . . , p, the posterior means are
E(L0F0ν |y) = (L0φν)eTν d,
E(L0F1|y) = (L0ξ)T c,
(3.42)
and the posterior covariances are
δ−1Cov(L01F0ν ,L02F0µ|y) = (L01φν)(L02φµ)eTν Aeµ,
δ−1Cov(L01F0ν ,L02F1|y) = −(L01φν)eTν B(L02ξ), (3.43)
δ−1Cov(L01F1,L02F1|y) = L01L02R1 − (L01ξ)TC(L02ξ),
where eν is a vector of dimension p with the νth element being one andall other elements being zero, the vectors c and d are given in (2.22), andthe matrices A = (T TM−1T )−1, B = AT TM−1, and C = M−1(I −B).
The vectors ξ and Lξ are defined in (3.30). Proofs can be found inWahba (1990) and Gu (2002). Note that H = span{φ1, . . . , φp} ⊕ H1.Then any f ∈ H can be represented as
f = f01 + · · · + f0p + f1, (3.44)
where f0ν ∈ span{φν} for ν = 1, . . . , p, and f1 ∈ H1. The estimate fcan also be decomposed similarly
f = f01 + · · · + f0p + f1, (3.45)
where f0ν = φνdν for ν = 1, . . . , p, and f1 = ξTc.
76 Smoothing Splines: Methods and Applications
The functionals L0, L01, and L02 are arbitrary as long as they are welldefined. Equations in (3.42) indicate that the posterior means of compo-nents in F equal their corresponding components in the spline estimatef . Equations (3.42) and (3.43) can be used to compute posterior meansand variances for any combinations of components of F . Specifically,consider the linear combination
Fγ (x) =
p∑
ν=1
γνF0ν(x) + γp+1F1(x), (3.46)
where γν equals 1 when the corresponding component in F is to beincluded and 0 otherwise, and γ = (γ1, . . . , γp+1)
T . Then, for any linearfunctional L0,
E(L0Fγ |y) =
p∑
ν=1
γν(L0φν)dν + γp+1(L0ξ)Tc,
Var(L0Fγ |y) =
p∑
ν=1
p∑
µ=1
γνγµCov(L0F0ν ,L0F0µ|y)
+
p∑
ν=1
γνγp+1Cov(L0F0ν ,L0F1|y)
+ γp+1Cov(L0F1,L0F1|y).
(3.47)
For various reasons it is often desirable to have interpretable con-fidence intervals for the function f and its components. For example,one may want to decide whether a nonparametric model is more suitablethan a particular parametric model. A parametric regression model maybe considered not suitable if a larger portion of its estimate is outsidethe confidence intervals of a smoothing spline estimate.
Consider a collection of points x0j ∈ X , j = 1, . . . , J . For eachj, posterior mean E{Fγ (x0j)|y} and variance Var{Fγ (x0j)|y} can becalculated using equations in (3.47) by setting L0F = F (x0j). Then,100(1 − α)% Bayesian confidence intervals for
fγ (x0j) =
p∑
ν=1
γνf0ν(x0j) + γp+1f1(x0j), j = 1, . . . , J (3.48)
are
E{Fγ(x0j)|y} ± zα2
√
Var{Fγ(x0j)|y}, j = 1, . . . , J, (3.49)
where zα2
is the 1 − α/2 percentile of a standard normal distribution.
Smoothing Parameter Selection and Inference 77
In particular, let F = (L1F, . . . ,LnF )T . Applying (3.42) and (3.43),we have
E(F |y) = H(λ)y,
Cov(F |y) = σ2H(λ).(3.50)
Therefore, the posterior variances of the fitted values Var(LiF |y) =σ2hii, where hii are diagonal elements of the matrix H(λ). When Li
are evaluational functionals Lif = f(xi), Wahba (1983) proposed thefollowing 100(1 − α)% confidence intervals
f(xi) ± zα2σ√
hii, (3.51)
where σ is an estimates of σ. Note that confidence intervals for a linearcombination of components of f can be constructed similarly.
Though based on the Bayesian argument, it has been found that theBayesian confidence intervals have good frequentist properties providedthat the smoothing parameter has been estimated properly. They mustbe interpreted as “across-the-function” rather than pointwise. More pre-cisely, define the average coverage probability (ACP) as
ACP =1
n
n∑
i=1
P{f(xi) ∈ C(α, xi)}
for some (1 − α)100% confidence intervals {C(α, xi), i = 1, . . . , n}.Rather than considering a confidence interval for f(τ), where f(·) isthe realization of a stochastic process and τ is fixed, one may con-sider confidence intervals for f(τn), where f is now a fixed functionand τn is a point randomly selected from {xi, i = 1, . . . , n}. ThenACP = P{f(τn) ∈ C(α, τn)}. Note that the ACP coverage property isweaker than the pointwise coverage property. For polynomial splines andC(α, xi) being the Bayesian confidence intervals defined in (3.51), undercertain regularity conditions, Nychka (1988) showed that ACP ≈ 1− α.
The predict function in the assist package computes the posteriormean and standard deviation of Fγ (x) in (3.46). The option terms spec-ifies the coefficients γ, and the option newdata specifies a data frame con-sisting of the values at which predictions are required. We now use thegeyser, motorcycle, and Arosa data to illustrate how to use the predict
function to compute the posterior means and standard deviations.For the geyser data, we have fitted a cubic spline in Chapter 1, Section
1.1 and a partial spline in Chapter 2, Section 2.10. In the following wefit a cubic spline using the ssr function, compute posterior means andstandard deviations for the estimate of the smooth component P1f usingthe predict function, and plot the estimate of the smooth componentsand 95% Bayesian confidence intervals:
78 Smoothing Splines: Methods and Applications
> geyser.cub.fit <- ssr(waiting~x, rk=cubic(x))
> grid <- seq(0,1,len=200)
> geyser.cub.pred <- predict(geyser.cub.fit, pstd=T,
terms=c(0,0,1), newdata=data.frame(x=grid))
> grid1 <- grid*diff(range(eruptions))+min(eruptions)
> plot(eruptions, waiting, xlab=‘‘duration (mins)’’,
ylim=c(-6,6), ylab=‘‘smooth component (mins)’’,
type=‘‘n’’)
> polygon(c(grid1,rev(grid1)),
c(geyser.cub.pred$fit-1.96*geyser.cub.pred$pstd,
rev(geyser.cub.pred$fit+1.96*geyser.cub.pred$pstd)),
col=gray(0:8/8)[8], border=NA)
> lines(grid1, geyser.cub.pred$fit)
> abline(0,0,lty=2)
where the option pstd specifies whether the posterior standard devia-tions should be calculated. Note that the option pstd=T can be droppedin the above statement since it is the default. There are in total threecomponents in the cubic spline fit: two basis functions φ1 (constant) andφ2 (linear) for the null space, and the smooth component in the spaceH1. In the order in which they appear in the ssr function, these threecomponents correspond to the intercept (~1, which is automatically in-cluded), the linear basis specified by ~x, and the smooth componentspecified by rk=cubic(x). Therefore, the option terms=c(0,0,1) wasused to compute the posterior means and standard deviations for thesmooth component f1 in the space H1. The estimate of the smoothcomponents and 95% Bayesian confidence intervals are shown in Figure3.7(a). A large portion of the zero constant line is outside the confidenceintervals, indicating the lack-of-fit of a linear model (the null space ofthe cubic spline).
For the partial spline model (2.43), consider the following Bayes model
yi = sTi β + LiF + ǫi, i = 1, . . . , n, (3.52)
where the prior for β is assumed to be N(0, κIq), the prior F is defined
in (3.25), and ǫiiid∼ N(0, σ2). Again, it can be shown that the PLS
estimates of components in β and f based on (2.44) equal the posteriormeans of their corresponding components in the Bayes model as κ→ ∞.Posterior covariances and Bayesian confidence intervals for β and fγ canbe calculated similarly.
We now refit the partial spline model (2.47) and compute posteriormeans and standard deviations for the estimate of the smooth compo-nent P1f :
Smoothing Parameter Selection and Inference 79
1.5 2.5 3.5 4.5
−6
−4
−2
02
46
duration (min)
sm
ooth
com
ponent (m
in)
(a)
1.5 2.5 3.5 4.5
−0.0
40.0
00.0
4
duration (min)
sm
ooth
com
ponent (m
in)
(b)
FIGURE 3.7 Geyser data, plots of estimates of the smooth compo-nents, and 95% Bayesian confidence intervals for (a) the cubic spline and(b) the partial spline models. The constant zero is marked as the dottedline in each plot.
> t <- .397; s <- 1*(x>t)
> geyser.ps.fit <- ssr(waiting~x+s, rk=cubic(x))
> geyser.ps.pred <- predict(geyser.ps.fit,
terms=c(0,0,0,1),
newdata=data.frame(x=grid,s=1*(grid>t)))
Since there are in total four components in the partial spline fit, theoption terms=c(0,0,0,1) was used to compute the posterior meansand standard deviations for the smooth component f1.
Figure 3.7(b) shows the estimate of the smooth components and 95%Bayesian confidence intervals for the partial spline. The zero constantline is well inside the confidence intervals, indicating that this smoothcomponent may be dropped in the partial spline model. That is, a simplelinear change-point model may be appropriate for this data.
For the motorcycle data, based on visual inspection, we have searcheda potential change-point t to the first derivative in the interval [0.2, 0.25]for the variable x in Section 2.10. To search for all possible change-pointsto the first derivative, we fit the partial spline model (2.48) repeatedlywith t taking values on a grid point in the interval [0.1, 0.9]. We thencalculate the posterior mean and standard deviation for β. Define at-statistic at point t as E(β|y)/
√
Var(β|y). The t-statistics were calcu-lated as follows:
> tgrid <- seq(0.05,.95,len=200); tstat <- NULL
> for (t in tgrid) {
s <- (x-t)*(x>t)
80 Smoothing Splines: Methods and Applications
tmp <- ssr(accel~x+s, rk=cubic(x))
tmppred <- predict(tmp, terms=c(0,0,1,0),
newdata=data.frame(x=.5,s=1))
tstat <- c(tstat, tmppred$fit/tmppred$pstd)
}
Note that term=c(0,0,1,0) requests the posterior mean and standarddeviation for the component β × s, where s = (x − t)+, and s is set toone in the newdata argument. The t-statistics are shown in the bottompanel of Figure 3.8.
accele
ration (
g)
−100
−50
050
oo oo
o ooo oo oo ooo
oo
ooo
o
o
ooo
o
o
o
o
o
oo
o
oo
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
ooo
o
o
o
o
o
ooo
o
o
o
o oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
o
oo
o
o
o
oo
o
o
o
o
o
o
o
o
o
oo
o o
o
o o
o
o
o
o
o
o
o
time (ms)
t−sta
tistics
−2
02
0 10 20 30 40 50 60
FIGURE 3.8 Motorcycle data, plots of t-statistics (bottom), and thepartial spline fit with 95% Bayesian confidence intervals (top).
There are three clusters of large t-statistics that suggests three po-tential change points in the first derivative at t1 = 0.2128, t2 = 0.3666,and t3 = 0.5113, respectively (see also Speckman (1995)). So we fit the
Smoothing Parameter Selection and Inference 81
following partial spline model
yi =3∑
j=1
βj(xi − tj)+ + f(xi) + ǫi, i = 1, . . . , n, (3.53)
where x is the variable times scaled into [0, 1], tj are the change-points inthe first derivative, and f ∈ W 2
2 [0, 1]. We fit model (3.53) and computeposterior means and standard deviations as follows:
> t1 <- .2128; t2 <- .3666; t3 <- .5113
> s1 <- (x-t1)*(x>t1); s2 <- (x-t2)*(x>t2)
> s3 <- (x-t3)*(x>t3)
> mcycle.ps.fit2 <- ssr(accel~x+s1+s2+s3, rk=cubic(x))
> grid <- seq(0,1,len=100)
> mcycle.ps.pred2 <- predict(mcycle.ps.fit2,
newdata=data.frame(x=grid, s1=(grid-t1)*(grid>t1),
s2=(grid-t2)*(grid>t2), s3=(grid-t3)*(grid>t3)))
The fit and Bayesian confidence intervals are shown in Figure 3.8.For the Arosa data, Figure 2.12 in Chapter 2 shows the estimates
and 95% confidence intervals for the overall function and its decomposi-tion. In particular, the posterior means and standard deviations for theperiodic spline were calculated as follows:
> arosa.per.fit <- ssr(thick~1, rk=periodic(x))
> grid <- data.frame(x=seq(.5/12,11.5/12,length=50))
> arosa.per.pred <- predict(arosa.per.fit,grid,
terms=matrix(c(1,0,0,1,1,1),nrow=3,byrow=T))
where the input for the term argument is a 3 × 2 matrix with the firstrow (1,0) specifying the parametric component, the second row (0,1)
specifying the smooth component, and the third row (1,1) specifyingthe overall function.
3.8.2 Bootstrap Confidence Intervals
Consider the general SSR model (2.10). Let f and σ2 be the estimatesof f and σ2, respectively. Let
y∗i,b = Lif + ǫ∗i,b, i = 1, . . . , n; b = 1, . . . , B (3.54)
be B bootstrap samples where ǫ∗i,biid∼ N(0, σ2). The random errors ǫ∗i,b
may also be drawn from residuals with replacement when the normalityassumption is undesirable. Let f∗
γ,b be the estimate of fγ in (3.48) based
82 Smoothing Splines: Methods and Applications
on the bth bootstrap sample {y∗i,b, i = 1, . . . , n}. For any well-defined
functional L0, there are B bootstrap estimates L0f∗γ,b for b = 1, . . . , B.
Then the 100(1−α)% percentile bootstrap confidence interval of L0fγ is
(L0fγ,L,L0fγ,U ), (3.55)
where L0fγ,L and L0fγ,U are the lower and upper α/2 quantiles of
{L0f∗γ,b, b = 1, . . . , B}.
The percentile-t bootstrap confidence intervals can be constructed asfollows. Let τ = (L0fγ − L0fγ)/σ, where division by σ is introduced
to reduce the dependence on σ. Let τ∗b = (L0f∗γ,b − L0fγ)/σ∗
b be thebootstrap estimates of τ for b = 1, . . . , B, where σ∗
b is the estimate of σbased on the bth bootstrap sample. Then the 100(1 − α)% percentile-tbootstrap confidence interval of L0fγ is
(L0fγ − q1−α2σ,L0fγ − qα
2σ), (3.56)
where qα2
and q1−α2
are the lower and upper α/2 quantiles of {τ∗b , b =1, . . . , B}. Note that the bounded linear condition is not required for L0
in the above construction of bootstrap confidence intervals. Other formsof bootstrap confidence intervals and comparison between Bayesian andbootstrap approaches can be found in Wang and Wahba (1995).
For example, the 95% percentile bootstrap confidence intervals in Fig-ure 2.8 in Chapter 2 were computed as follows:
> nboot <- 9999
> fb <- NULL
> for (i in 1:nboot) {
yb <- canada.fit1$fit +
sample(canada.fit1$resi, 35, replace=T)
bfit <- ssr(yb~s, rk=Sigma, spar=‘‘m’’)
fb <- cbind(fb, bfit$coef$d[2]+S%*%bfit$coef$c)
}
> lb <- apply(fb, 1, quantile, prob=.025)
> ub <- apply(fb, 1, quantile, prob=.975)
where random errors for bootstrap samples were drawn from residualswith replacement, and lb and ub represent lower and upper bounds.
We use the following simulation to show the performance of Bayesianand bootstrap confidence intervals. Observations are generated frommodel (1.1) with f(x) = exp{−64(x − 0.5)2}, σ = 0.1, xi = i/n fori = 1, . . . , n, and n = 100. We fit a cubic spline and construct 95%Bayesian, percentile bootstrap (denoted as Per), and percentile-t boot-strap (denoted as T) confidence intervals. We set B = 1000 and repeat
Smoothing Parameter Selection and Inference 83
the simulation 100 times. Figures 3.9(a)(b)(c) show average pointwisecoverages for the Bayesian, percentile bootstrap, and percentile-t boot-strap confidence intervals, respectively. The absolute value of f ′′ is alsoplotted to show the curvature of the function. The pointwise coverage isusually smaller than the nominal value at high curvature points. Box-plots of the across-the-function coverages for these three methods areshown in Figure 3.9(d). The average and median ACPs are close to thenominal value.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
poin
twis
e c
overa
ge
(a)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
poin
twis
e c
overa
ge
(b)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
poin
twis
e c
overa
ge
(c)
Bayes Per T
0.6
50.7
50.8
50.9
5avera
ge c
overa
ge +
+ +
(d)
FIGURE 3.9 Plots of pointwise coverages (solid line) and nominalvalue (dotted line) for (a) the Bayesian confidence intervals, (b) the per-centile bootstrap confidence intervals, and (c) the percentile-t bootstrapconfidence intervals. Dashed lines in (a), (b), and (c) represent a scaledversion of |f ′′|. Boxplots of ACPs are shown in (d) with mean coveragesmarked as pluses and nominal value plotted as a dotted line.
84 Smoothing Splines: Methods and Applications
3.9 Hypothesis Tests
3.9.1 The Hypothesis
One of the most useful applications of the nonparametric regression mod-els is to check or suggest a parametric model. When appropriate, para-metric models, especially linear models, are preferred in practice becauseof their simplicity and interpretability. One important step in buildinga parametric model is to investigate potential departure from a specifiedmodel. Tests with specific alternatives are often performed in practice.This kind of tests would not perform well for different forms of departurefrom the parametric model, especially those orthogonal to the specificalternative. For example, to detect departure from a straight line model,one may consider a quadratic polynomial as the alternative. Then de-parture in the form of higher-order polynomials may be missed. It isdesirable to have tools that can detect general departures from a spe-cific parametric model.
One approach to check a parametric model is to construct confidenceintervals for f or its smooth component in an SSR model using meth-ods in Section 3.8. The parametric model may be deemed unsuitableif a larger portion of its estimate is outside the confidence intervals off . When the null space H0 corresponds to the parametric model un-der consideration, one may check the magnitude of the estimate of thesmooth component P1f since it represents the remaining systematic vari-ation not explained by the parametric model under consideration. If alarger portion of the confidence intervals for P1f does not contain zero,the parametric model may be deemed unsuitable. This approach wasillustrated using the geyser data in Section 3.8.1.
Often the confidence intervals are all one needs in practice. Never-theless, sometimes it may be desirable to conduct a formal test on thedeparture from a parametric model. In this section we consider thefollowing hypothesis
H0 : f ∈ H0, H1 : f ∈ H and f /∈ H0, (3.57)
where H0 corresponds to the parametric model. Note that when theparametric model satisfies Lf = 0 with a differential operator L given in(2.53), the L-spline may be used to check or test the parametric model.
The alternative is equivalent to ||P1f || > 0. Note that λ = ∞ in(2.11), or equivalently, δ = 0 in the corresponding Bayes model (3.25)leads to f ∈ H0. Thus the hypothesis (3.57) can be reexpressed as
H0 : λ = ∞, H1 : λ <∞, (3.58)
Smoothing Parameter Selection and Inference 85
orH0 : δ = 0, H1 : δ > 0. (3.59)
3.9.2 Locally Most Powerful Test
Consider the hypothesis (3.59). As in Section 3.8, we use the marginallog-likelihood of w1, where w1 = QT
2 y ∼ N(0, δQT2MQ2). Since Q2
is orthogonal to T , it is clear that the transformation QT2 y eliminates
contribution from the model under the null hypothesis. Thus w1 reflectssignals, if any, from H1. Note that M = Σ+nλI and QT
2 ΣQ2 = UEUT ,where notations Q2, U , and E were defined in Section 3.2. Let z ,
UTw1 and denote z = (z1, . . . , zn−p)T . Then z ∼ N(0, δE + σ2I). First
assume that σ2 is known. Note that λ = σ2/nδ. The negative log-likelihood of z
l(δ|z) =1
2
n−p∑
ν=1
log(δeν + σ2) +1
2
n−p∑
ν=1
z2ν
δeν + σ2+ C1, (3.60)
where C1 is a constant. Let Uδ(δ) and Iδδ(δ) be the score and Fisherinformation of δ. It is not difficult to check that the score test statistic(Cox and Hinkley 1974)
tscore ,Uδ(0)√
Iδδ(0)= C2
(n−p∑
ν=1
eνz2ν + C3
)
,
where C2 and C3 are constants. Therefore, the score test is equivalentto the following test statistic
tLMP =
n−p∑
ν=1
eνz2ν . (3.61)
For polynomial splines, Cox, Koh, Wahba and Yandell (1988) showedthat there is no uniformly most powerful test and that tLMP is the locallymost powerful test.
The variance is usually unknown in practice. Replacing σ2 by itsMLE under the null hypothesis (3.59), σ2
0 =∑n−p
ν=1 z2ν/(n− p), leads to
the approximate LMP test statistic
tappLMP =
∑n−pν=1 eνz
2ν
∑n−pν=1 z
2ν
. (3.62)
The null hypothesis is rejected for large values of tappLMP. The teststatistic tappLMP does not follow a simple distribution under H0. Nev-ertheless, it is straightforward to simulate the null distribution. Under
86 Smoothing Splines: Methods and Applications
H0, zνiid∼ N(0, σ2). Without loss of generality, σ2 can be set to one in
the simulation for the null distribution since both the numerator andthe denominator depend on z2
ν . Specifically, samples zν,jiid∼ N(0, 1) for
ν = 1, . . . , n − p and j = 1, . . . , N are generated, and the statisticstappLMP,j =
∑n−pν=1 eνz
2ν,j/
∑n−pν=1 z
2ν,j are computed. Note that tappLMP,j
are N realizations of the statistic under H0. Then the proportion thattappLMP,j is greater than tappLMP provides an estimate of the p-value.This approach usually requires a very large N . The p-value can alsobe calculated numerically using the algorithm in Davies (1980). Theapproximation method is very fast and agrees with the results from theMonte Carlo method (Liu and Wang 2004).
3.9.3 Generalized Maximum Likelihood Test
Consider the hypothesis (3.58). Since z ∼ N(0, δ(E + nλI)), the MLEof δ is
δ =1
n− p
n−p∑
ν=1
z2ν
eν + nλ.
From (3.39), the profile likelihood of λ is
Lp(λ, δ|z) = C
{ ∑n−pν=1 z
2ν/(eν + nλ)
∏n−pν=1 (eν + nλ)−
1
n−p
}−n−p2
, (3.63)
where C is a constant. The GML estimate of λ, λGML, is the maximizerof (3.63). The GML test statistic for the hypothesis (3.58) is
tGML ,
{
Lp(λGML|z)Lp(∞|z)
}− 2n−p
=
∑n−pν=1 z
2ν/(eν + nλGML)
∏n−pν=1 (eν + nλGML)−
1
n−p
1∑n−p
ν=1 z2ν
. (3.64)
It is clear that the GML test is equivalent to the ratio of restrictedlikelihoods. The null hypothesis is rejected when tGML is too small.The standard theory for likelihood ratio tests does not apply becausethe parameter λ locates on the boundary of the parameter space underthe null hypothesis. Thus, it is difficult to derive the null distributionfor tGML. The Monte Carlo method described for the LMP test can beadapted to compute an estimate of the p-value. Note that tGML involvesthe GML estimate of the smoothing parameter. Therefore, λGML needsto be estimated for each simulation sample, which makes this approachcomputationally intensive.
Smoothing Parameter Selection and Inference 87
The null distribution of −(n−p) log tGML can be well approximated bya mixture of χ2
1 and χ20, denoted by rχ2
0 + (1− r)χ21. However, the ratio
r is not fixed. It is difficult to derive a formula for r since it depends onmany factors. One can approximate the ratio r first and then calculatethe p-value based on the mixture of χ2
1 and χ20 with the approximated
r. A relatively small sample size is required to approximate r. See Liuand Wang (2004) for details.
3.9.4 Generalized Cross-Validation Test
Consider the hypothesis (3.58). Let λGCV be the GCV estimate of λ.Similar to the GML test statistic, the GCV test statistic is defined asthe ratio between GCV scores
tGCV ,GCV(λGCV)
GCV(∞)
= (n− p)2∑n−p
ν=1 z2ν/(1 + eν/nλGCV)2
{∑n−pν=1 1/(1 + eν/nλGCV)}2
1∑n−p
ν=1 z2ν
. (3.65)
H0 is rejected when tGCV is too small. Again, similar to the GML test,the Monte Carlo method can be used to compute an estimate of thep-value.
3.9.5 Comparison and Implementation
The LMP and GML tests derived based on Bayesian arguments performwell under deterministic models. In terms of eigenvectors of QT
2 ΣQ2,the LMP test is more powerful in detecting departure in the directionof the first eigenvector, the GML test is more powerful in detectingdeparture in low-frequencies, and the GCV test is more powerful indetecting departure in high frequencies. See Liu and Wang (2004) formore details.
These tests can be carried out using the anova function in the assistpackage. We now use the geyser and Arosa data to illustrate how to usethis function.
For the geyser data, first consider the following hypotheses
H0 : f ∈ span{1, x}, H1 : f ∈ W 22 [0, 1] and f /∈ span{1, x}.
We have fitted cubic spline model where the null space corresponds tothe model under H0. Therefore, we can test the hypothesis as follows:
> anova(geyser.cub.fit, simu.size=500)
88 Smoothing Splines: Methods and Applications
Testing H_0: f in the NULL space
test.value simu.size simu.p-value
LMP 0.02288057 500 0
GCV 0.00335470 500 0
where the option simu.size specifies the Monte Carlo sample size N .The simple linear model is rejected. Next we consider the hypothesis
H0 : f ∈ span{1, x, s}, H1 : f ∈ W 22 [0, 1] and f /∈ span{1, x, s},
where s = (x−0.397)0+. We have fitted a partial spline model where thenull space corresponds to the model under H0. Therefore, we can testthe hypothesis as follows:
> anova(geyser.ps.fit, simu.size=500)
Testing H_0: f in the NULL space
test.value simu.size simu.p-value
LMP 0.000351964 500 0.602
GCV 0.003717477 500 0.602
The null hypothesis is not rejected. The conclusions are the same asthose based on Bayesian confidence intervals. To apply the GML test,we need to first fit using the GML method to select the smoothing pa-rameter:
> geyser.ps.fit.m <- ssr(waiting~x+s, rk=cubic(x),
spar=‘‘m’’)
> anova(geyser.ps.fit.m, simu.size=500)
Testing H_0: f in the NULL space
test.value simu.size simu.p-value approximate.p-value
LMP 0.000352 500 0.634
GML 1.000001 500 0.634 0.5
where the approximate.p-value was computed using the mixture oftwo Chi-square distributions.
For the Arosa data, consider the hypothesis
H0 : f ∈ P , H1 : f ∈W 22 (per) and f /∈ P ,
where P = span{1, sin 2πx, cos 2πx} is the model space for the sinusoidalmodel. Two approaches can be used to test the above hypothesis: fit apartial spline and fit an L-spline:
Smoothing Parameter Selection and Inference 89
> arosa.ps.fit <- ssr(thick~sin(2*pi*x)+cos(2*pi*x),
rk=periodic(x), data=Arosa)
> anova(arosa.ps.fit,simu.size=500)
Testing H_0: f in the NULL space
test.value simu.size simu.p-value
LMP 0.001262064 500 0
GCV 0.001832394 500 0
> arosa.ls.fit <- ssr(thick~sin(2*pi*x)+cos(2*pi*x),
rk=lspline(x,type=‘‘sine1’’))
> anova(arosa.ls.fit,simu.size=500)
Testing H_0: f in the NULL space
test.value simu.size simu.p-value
LMP 2.539163e-06 500 0
GCV 0.001828071 500 0
The test based on the L-spline is usually more powerful since the para-metric and the smooth components are orthogonal.
This page intentionally left blankThis page intentionally left blank
Chapter 4
Smoothing Spline ANOVA
4.1 Multiple Regression
Consider the problem of building regression models that examine therelationship between a dependent variable y and multiple independentvariables x1, . . . , xd. For generality, let the domain of each xk be anarbitrary set Xk. Denote x = (x1, . . . , xd). Given observations (xi, yi)for i = 1, . . . , n, where xi = (xi1, . . . , xid), a multiple regression modelrelates the dependent variable and independent variables as follows:
yi = f(xi) + ǫi, i = 1, . . . , n, (4.1)
where f is a multivariate regression function, and ǫi are zero-mean in-dependent random errors with a common variance σ2. The goal is toconstruct a model for f and estimate it based on noisy data.
There exist many different methods to construct a model space forf , parametrically, semiparametrically, or nonparametrically. For exam-ple, a thin-plate spline model may be used when all xk are univariatecontinuous variables, and partial spline models may be used if a linearparametric model can be assumed for all but one variable. This chapterintroduces a nonparametric approach called smoothing spline analysis ofvariance (smoothing spline ANOVA or SS ANOVA) decomposition forconstructing model spaces for the multivariate function f .
The multivariate function f is defined on the product domain X =X1×X2×· · ·×Xd. Note that each Xk is arbitrary: it may be a continuousinterval, a discrete set, a unit circle, a unit sphere, or R
d. Constructionof model spaces for a single variable was introduced in Chapter 2. LetH(k) be an RKHS on Xk. The choice of the marginal space H(k) dependson the domain Xk and prior knowledge about f as a function of xk. Tomodel the joint function f , we start with the tensor product of thesemarginal spaces defined in the following section.
91
92 Smoothing Splines: Methods and Applications
4.2 Tensor Product Reproducing Kernel HilbertSpaces
First consider the simple case when d = 2. Denote RKs for H(1) andH(2) as R(1) and R(2), respectively. It is known that the product of non-negative definite functions is nonnegative definite (Gu 2002). As RKs,both R(1) and R(2) are nonnegative definite. Therefore, the bivariatefunction on X = X1 ×X2
R((x1, x2), (z1, z2)) , R(1)(x1, z1)R(2)(x2, z2)
is nonnegative definite. By Moore–Aronszajn theorem, there exists aunique RKHS H on X = X1 × X2 such that R is its RK. The resultingRKHS H is called the tensor product RKHS and is denoted as H(1)⊗H(2).For d > 2, the tensor product RKHS of H(1), . . . ,H(d) on the productdomain X = X1 × X2 × · · · × Xd, H(1) ⊗ H(2) ⊗ · · · ⊗ H(d), is definedrecursively. Note that the RK for a tensor product space equals theproduct of RKs of marginal spaces. That is, the RK of H(1) ⊗ H(2) ⊗· · · ⊗ H(d) equals
R(x, z) = R(1)(x1, z1)R(2)(x2, z2) · · ·R(d)(xd, zd),
where x ∈ X , z = (z1, . . . , zd) ∈ X , X = X1 × X2 × · · · × Xd, and R(k)
is the RK of H(k) for k = 1, . . . , d.For illustration, consider the ultrasound data consisting of tongue
shape measurements over time from ultrasound imaging. The data setcontains observations on the response variable height (y) and three in-dependent variables: environment (x1), length (x2), and time (x3).The variable x1 is a factor with three levels: x1 = 1, 2, 3 correspondingto 2words, cluster, and schwa, respectively. Both continuous variablesx2 and x3 are scaled into [0, 1]. Interpolations of the raw data are shownin Figure 4.1.
In linguistic studies, researchers want to determine (1) how tongueshapes for an articulation differ under different environments, (2) howthe tongue shape changes as a function of time, and (3) how changes overtime differ under different environments. To address the first questionat a fixed time point, we need to model a bivariate regression functionf(x1, x2). Assume marginal spaces R
3 and Wm2 [0, 1] for variables x1 and
x2, respectively. Then we may consider the tensor product space R3 ⊗
Wm2 [0, 1] for the bivariate function f . To address the second question,
for a fixed environment, we need to model a bivariate regression functionf(x2, x3). Assume marginal spacesWm1
2 [0, 1] andWm2
2 [0, 1] for variables
Smoothing Spline ANOVA 93
80100
120140
0
50
100150
200
40
45
50
55
60
65
70
length (mm)
time (m
s)
he
igh
t (m
m)
2words
80100
120140
0
50
100150
200
40
45
50
55
60
65
70
length (mm)
time (m
s)
he
igh
t (m
m)
cluster
80100
120140
0
50
100150
200
40
45
50
55
60
65
70
length (mm)
time (m
s)
he
igh
t (m
m)
schwa
FIGURE 4.1 Ultrasound data, 3-d plots of observations.
x2 and x3, respectively. Then we may consider the tensor product spaceWm1
2 [0, 1] ⊗Wm2
2 [0, 1] for the bivariate function. To address the thirdquestion, we need to model a trivariate regression function f(x1, x2, x3).We may consider the tensor product space R
3 ⊗Wm1
2 [0, 1] ⊗Wm2
2 [0, 1]for the trivariate function. Analysis of the ultrasound data are given inSection 4.9.1.
The SS ANOVA decomposition decomposes a tensor product spaceinto subspaces with a hierarchical structure similar to the main ef-fects and interactions in the classical ANOVA. The resulting hierarchi-cal structure facilitates model selection and interpretation. Sections 4.3,4.4, and 4.5 present SS ANOVA decompositions for a single space, tensorproduct of two spaces, and tensor product of d spaces. More SS ANOVAdecompositions can be found in Sections 4.9, 5.4.4, 6.3, and 9.2.4.
4.3 One-Way SS ANOVA Decomposition
SS ANOVA decompositions of tensor product RKHS’s are based on de-compositions of marginal spaces for each independent variable. There-fore, decompositions for a single space are introduced first in this section.Spline models for a single independent variable have been introduced inChapter 2. Denote the independent variable as x and the regression
94 Smoothing Splines: Methods and Applications
function as f . It was assumed that f belongs to an RKHS H. Thefunction f was decomposed into a parametric and a smooth component,f = f0 + f1, or in terms of model space, H = H0 ⊕ H1. This can beregarded as one form of the SS ANOVA decomposition. We now intro-duce general decompositions based on averaging operators. Consider afunction space H on the domain X . An operator A is called an aver-aging operator if A = A2. Instead of the more appropriate term as anidempotent operator, the term averaging operator is used since it is mo-tivated by averaging in the classical ANOVA decomposition. Note thatan averaging operator does not necessarily involve averaging. As we willsee in the following subsections, the commonly used averaging operatorsare projection operators. Thus they are idempotent.
Suppose model space H = H0 ⊕H1, where H0 is a finite dimensionalspace with orthogonal basis φ1(x), . . . , φp(x). Let Aν be the projectionoperator onto the subspace {φν(x)} for ν = 1, . . . , p, and Ap+1 be theprojection operator onto H1. Then the function can be decomposed as
f = (A1 + · · · + Ap + Ap+1)f , f01 + · · · + f0p + f1. (4.2)
Correspondingly, the model space is decomposed into
H = {φ1(x)} ⊕ · · · ⊕ {φp(x)} ⊕H1.
For simplicity, {·} represents the space spanned by the basis functionsinside the bracket. Some averaging operators Aν (subspaces) can becombined. For example, combining A1, . . . ,Ap leads to the decompo-sition f = f0 + f1, a parametric component plus a smooth compo-nent. When φ1(x) = 1, combining A2, . . . ,Ap+1 leads to the decom-
position f = f01 + f1, where f01 is a constant independent of x, andf1 = f02 + · · · + f0p + f1 collects all components that depend on x.
Therefore, f = f01 + f1 decomposes the function into a constant plus anonconstant function.
For the same model space, different SS ANOVA decompositions maybe constructed for different purposes. In general, we denote the one-waySS ANOVA decomposition as
f = A1f + · · · + Arf, (4.3)
where A1 + · · · + Ar = I, and I is the identity operator. The aboveequality always holds since f = If . Equivalently, in terms of the modelspace, the one-way SS ANOVA decomposition is denoted as
H = H(1) ⊕ · · · ⊕ H(r).
The following subsections provide one-way SS ANOVA decompositionsfor some special model spaces.
Smoothing Spline ANOVA 95
4.3.1 Decomposition of Ra: One-Way ANOVA
Suppose x is a discrete variable with a levels. The classical one-waymean model assumes that
yik = µi + ǫik, i = 1, . . . , a; k = 1, . . . , ni, (4.4)
where yik represents the observation of the kth replication at level i ofx, µi represents the mean at level i, and ǫik represent random errors.
Regarding µi as a function of i and expressing explicitly as f(i) , µi,then f is a function defined on the discrete domain X = {1, . . . , a}. Itis easy to see that the model space for f is the Euclidean a-space R
a.
Model space construction and decomposition of Ra
The space Ra is an RKHS with the inner product (f, g) = fT g. Further-
more, Ra = H0 ⊕H1, where
H0 = {f : f(1) = · · · = f(a)},
H1 = {f :
a∑
i=1
f(i) = 0}, (4.5)
are RKHS’s with corresponding RKs
R0(i, j) =1
a,
R1(i, j) = δi,j −1
a,
(4.6)
and δi,j is the Kronecker delta.
Details about the above construction can be found in Gu (2002). De-fine an averaging operator A1 : R
a → Ra such that
A1f =1
a
a∑
i=1
f(i).
The operator A1 maps f to the constant function that equals the averageover all indices. Let A2 = I −A1. The one-way ANOVA effect model isbased on the following decomposition of the function f :
f = A1f + A2f , µ+ αi, (4.7)
where µ is the overall mean, and αi is the effect at level i. From the defi-nition of the averaging operator, αi satisfy the sum-to-zero side condition∑a
i=1 αi = 0. It is clear that A1 and A2 are the projection operators
96 Smoothing Splines: Methods and Applications
from Ra onto H0 and H1 defined in (4.5). Thus, they divide R
a into H0
and H1.Under the foregoing construction, ||P1f ||2 =
∑ai=1{f(i) − f}2, where
f =∑a
i=1 f(i)/a. For balanced designs with ni = n, the solution to thePLS (2.11) is
f(i) = yi· −aλ
1 + aλ(yi· − y··), (4.8)
where yi· =∑n
k=1 yik/n, and y·· =∑a
i=1 yi·/a. It is easy to checkthat when n = 1, σ2 = 1 and λ = (a − 3)/a(
∑ai=1 y
2i· − a + 3), the
spline estimate f is the James–Stein estimator (shrink toward mean).Therefore, in a sense, the Stein’s shrinkage estimator can be regarded asspline estimate on a discrete domain.
There exist other ways to decompose Ra. For example, the averaging
operator A1f = f(1) leads to the same decomposition as (4.7) with theset-to-zero side condition: α1 = 0 (Gu 2002).
4.3.2 Decomposition of Wm2 [a, b]
Under the construction in Section 2.2, let
Aνf(x) = f (ν−1)(a)(x − a)ν−1
(ν − 1)!, ν = 1, . . . ,m. (4.9)
It is easy to see that A2v = Av. Thus, they are averaging operators.
In fact, Aν is the projection operator onto {(x − a)ν−1/(ν − 1)!}. LetAm+1 = I −A1 − · · · − Am. The decomposition
f = A1f + · · · + Amf + Am+1f
corresponds to the Taylor expansion (1.5). It decomposes the modelspace Wm
2 [a, b] into
Wm2 [a, b] = {1} ⊕ {x− a} ⊕ · · · ⊕ {(x− a)m−1/(m− 1)!} ⊕H1,
where H1 is given in (2.3). For f ∈ H1, conditions f (ν)(a) = 0 forν = 0, . . . ,m−1 are analogous to the set-to-zero condition in the classicalone-way ANOVA model.
Under the construction in Section 2.6 for Wm2 [0, 1], let
Aνf(x) =
{∫ 1
0
f (ν−1)(u)du
}
kν(x), ν = 1, . . . ,m.
Again, Aν is an averaging (projection) operator extracting the polyno-
mial of order ν. In particular, A1f =∫ 1
0 f(u)du is a natural extension
Smoothing Spline ANOVA 97
of the averaging in the discrete domain. Let Am+1 = I −A1−· · ·−Am.The decomposition
f = A1f + · · · + Amf + Am+1f
decomposes the model space Wm2 [0, 1] into
Wm2 [0, 1] = {1} ⊕ {k1(x)} ⊕ · · · ⊕ {km−1(x)} ⊕H1,
where H1 is given in (2.29). For f ∈ H1, conditions∫ 1
0f (ν)dx = 0
for ν = 0, . . . ,m − 1 are analogous to the sum-to-zero condition in theclassical one-way ANOVA model.
4.3.3 Decomposition of Wm2 (per)
Under the construction in Section 2.7, let
A1f =
∫ 1
0
fdu
be an averaging (projection) operator. Let A2 = I −A1. The decompo-sition
f = A1f + A2f
decomposes the model space
Wm2 (per) =
{1}⊕{f ∈ Wm
2 (per) :
∫ 1
0
fdu = 0}.
4.3.4 Decomposition of Wm2 (Rd)
For simplicity, consider the special case with d = 2 and m = 2. Decom-positions for general d and m can be derived similarly. Let φ1(x) = 1,φ2(x) = x1, and φ3(x) = x2 be polynomials of total degree less thanm = 2. Let φ1 = 1, and φ2 and φ3 be an orthonormal basis suchthat (φν , φµ)0 = δν,µ based on the norm (2.41). Define two averagingoperators
A1f(x) =
J∑
j=1
wjf(uj),
A2f(x) =
J∑
j=1
wjf(uj){φ2(uj)φ2(x) + φ3(uj)φ3(x)
},
(4.10)
98 Smoothing Splines: Methods and Applications
where uj are fixed points in R2, and wj are fixed positive weights such
that∑J
j=1 wj = 1. It is clear that A1 and A2 are projection operators
onto spaces {φ1} and {φ2, φ3}, respectively. To see how they general-ize averaging operators, define a probability measure µ on X = R
2 byassigning probability wj to the point uj , j = 1, . . . , J . Then A1f(x) =∫
R2 fφ1dµ and A2f(x) = (∫
R2 fφ2dµ)φ2(x) + (∫
R2 fφ3dµ)φ3(x). There-fore, A1 and A2 take averages with respect to the discrete probabilitymeasure µ. In particular, µ puts mass 1/n on design points xj whenJ = n, wj = 1/n and uj = xj . A continuous density on R
2 may be used.However, the resulting integrals usually do not have closed forms, andapproximations such as a quadrature formula would have to be used. Itis then essentially equivalent to using an approximate discrete probabil-ity measure.
Let A3 = I −A1 −A2. The decomposition
f = A1f + A2f + A3f
divides the model space
Wm2 (R2) =
{1}⊕{φ2, φ3
}⊕{f ∈Wm
2 (R2) : J22 (f) = 0
},
where Jdm(f) is defined in (2.36).
4.4 Two-Way SS ANOVA Decomposition
Suppose there are two independent variables, x1 ∈ X1 and x2 ∈ X2.Consider the tensor product space H(1) ⊗ H(2) on the product domainX1 × X2. For f as a marginal function of xk, assume the followingone-way decomposition based on Section 4.3,
f = A(k)1 f + · · · + A(k)
rkf, k = 1, 2, (4.11)
where A(k)j are averaging operators on H(k) and
∑rk
j=1 A(k)j = I. Then
for the joint function, we have
f ={A(1)
1 +· · ·+A(1)r1
}{A(2)
1 +· · ·+A(2)r2
}f =
r1∑
j1=1
r2∑
j2=1
A(1)j1
A(2)j2f. (4.12)
The above decomposition of the bivariate function f is referred to as thetwo-way SS ANOVA decomposition.
Smoothing Spline ANOVA 99
Denote
H(k) = H(k)(1) ⊕ · · · ⊕ H(k)
(rk), k = 1, 2,
as the one-way decomposition to H(k) associated with (4.11). Then,(4.12) decomposes the tensor product space
H(1) ⊗H(2) ={
H(1)(1) ⊕ · · · ⊕ H(1)
(r1)
}
⊗{
H(2)(1) ⊕ · · · ⊕ H(2)
(r2)
}
=
r1∑
j1=1
r2∑
j2=1
H(1)(j1) ⊗H(2)
(j2).
Consider the special case when rk = 2 for k = 1, 2. Assume that
A(k)1 f is independent of xk, or equivalently, H(k)
0 = {1}. Then thedecomposition (4.12) can be written as
f = A(1)1 A(2)
1 f + A(1)2 A(2)
1 f + A(1)1 A(2)
2 f + A(1)2 A(2)
2 f
, µ+ f1(x1) + f2(x2) + f12(x1, x2), (4.13)
where µ represents the grand mean, f1(x1) and f2(x2) represent themain effects of x1 and x2, respectively, and f12(x1, x2) represents theinteraction between x1 and x2.
For general rk, assuming A(k)1 f is independent of xk, the decomposi-
tion (4.13) can be derived by combining operators A(k)2 , . . . , A(k)
rk into
one averaging operator A(k)2 = A(k)
2 + · · · + A(k)rk . Therefore, decompo-
sition (4.13) combines components in (4.12) and reorganizes them intothe overall main effects and interactions.
The following subsections provide two-way SS ANOVA decomposi-tions for combinations of some special model spaces.
4.4.1 Decomposition of Ra ⊗ R
b: Two-Way ANOVA
Suppose both x1 and x2 are discrete variables with a and b levels, re-spectively. The classical two-way mean model assumes that
yijk = µij + ǫijk, i = 1, . . . , a; j = 1, . . . , b; k = 1, . . . , nij ,
where yijk represents the observation of the kth replication at level i ofx1, and level j of x2, µij represents the mean at level i of x1 and level jof x2, and ǫijk represent random errors.
Regarding µij as a bivariate function of (i, j) and letting f(i, j) ,
µij , then f is a bivariate function defined on the product domain X ={1, . . . , a}×{1, . . . , b}. The model space for f is the tensor product space
100 Smoothing Splines: Methods and Applications
Ra ⊗ R
b. Define two averaging operators A(1)1 and A(2)
2 such that
A(1)1 f =
1
a
a∑
i=1
f(i, j),
A(2)1 f =
1
b
b∑
j=1
f(i, j).
A(1)1 and A(2)
1 map f to univariate functions by averaging over all levels
of x1 and x2, respectively. Let A(1)2 = I−A(1)
1 and A(2)2 = I−A(2)
1 . Thenthe classical two-way ANOVA effect model is based on the following SSANOVA decomposition of the function f :
f = (A(1)1 + A(1)
2 )(A(2)1 + A(2)
2 )f
= A(1)1 A(2)
1 f + A(1)2 A(2)
1 f + A(1)1 A(2)
2 f + A(1)2 A(2)
2 f
, µ+ αi + βj + (αβ)ij , (4.14)
where µ represents the overall mean, αi represents the main effect of x1,βj represents the main effect of x2, and (αβ)ij represents the interactionbetween x1 and x2. The sum-to-zero side conditions are satisfied fromthe definition of the averaging operators.
Based on one-way ANOVA decomposition in Section 4.3.1, we have
Ra = H(1)
0 ⊕ H(1)1 and R
b = H(2)0 ⊕ H(2)
1 , where H(1)0 and H(2)
0 are
subspaces containing constant functions, and H(1)1 and H(2)
1 are orthog-
onal complements of H(1)0 and H(2)
0 , respectively. The classical two-wayANOVA model decomposes the tensor product space
Ra ⊗ R
b ={
H(1)0 ⊕H(1)
1
}
⊗{
H(2)0 ⊕H(2)
1
}
={
H(1)0 ⊗H(2)
0
}
⊕{
H(1)1 ⊗H(2)
0
}
⊕{
H(1)0 ⊗H(2)
1
}
⊕{
H(1)1 ⊗H(2)
1
}
. (4.15)
The four subspaces in (4.15) contain components µ, αi, βj , and (αβ)ij
in (4.14), respectively.
4.4.2 Decomposition of Ra ⊗Wm
2 [0, 1]
Suppose x1 is a discrete variable with a levels, and x2 is a continuousvariable in [0, 1]. A natural model space for x1 is R
a and a natural modelspace for x2 is Wm
2 [0, 1]. Therefore, we consider the tensor productspace R
a ⊗Wm2 [0, 1] for the bivariate regression function f(x1, x2). For
Smoothing Spline ANOVA 101
simplicity, we derive SS ANOVA decompositions for m = 1 and m = 2only. SS ANOVA decompositions for higher-order m can be derivedsimilarly. In this and the remaining sections, the construction in Section2.6 for the marginal space Wm
2 [0, 1] will be used. Similar SS ANOVAdecompositions can be derived under the construction in Section 2.2.
Consider the tensor product space Ra ⊗ W 1
2 [0, 1] first. Define two
averaging operators A(1)1 and A(2)
1 as
A(1)1 f =
1
a
a∑
x1=1
f,
A(2)1 f =
∫ 1
0
fdx2,
where A(1)1 and A(2)
1 extract the constant term out of all possible func-
tions for each variable. Let A(1)2 = I −A(1)
1 and A(2)2 = I −A(2)
1 . Then
f ={A(1)
1 + A(1)2
}{A(2)
1 + A(2)2
}f
= A(1)1 A(2)
1 f + A(1)2 A(2)
1 f + A(1)1 A(2)
2 f + A(1)2 A(2)
2 f
, µ+ f1(x1) + f2(x2) + f12(x1, x2). (4.16)
Obviously, (4.16) is a natural extension of the classical two-way ANOVAdecomposition (4.14) from the product of two discrete domains to theproduct of one discrete and one continuous domain. As in the classicalANOVA model, components in (4.16) have nice interpretations: µ rep-resents the overall mean, f1(x1) represents the main effect of x1, f2(x2)represents the main effect of x2, and f12(x1, x2) represents the interac-tion. Components also have nice interpretation collectively: µ+ f2(x2)represents the mean curve among all levels of x1, and f1(x1)+f12(x1, x2)represents the departure from the mean curve at level x1. Write R
a =
H(1)0 ⊕H(1)
1 and W 12 [0, 1] = H(2)
0 ⊕H(2)1 , where H(1)
0 and H(1)1 are given
in (4.5), H(2)0 = {1} and H(2)
1 = {f ∈ W 12 [0, 1] :
∫ 1
0 fdu = 0}. Then, interms of the model space, (4.16) decomposes
Ra ⊗W 1
2 [0, 1]
={
H(1)0 ⊕H(1)
1
}
⊗{
H(2)0 ⊕H(2)
1
}
={
H(1)0 ⊗H(2)
0
}
⊕{
H(1)1 ⊗H(2)
0
}
⊕{
H(1)0 ⊗H(2)
1
}
⊕{
H(1)1 ⊗H(2)
1
}
, H0 ⊕H1 ⊕H2 ⊕H3. (4.17)
To fit model (4.16), we need to find basis for H0 and RKs for H1, H2,and H3. It is clear that H0 contains all constant functions. Thus, H0
102 Smoothing Splines: Methods and Applications
is an one-dimensional space with the basis φ(x) = 1. The RKs of H(1)0
and H(1)1 are given in (4.6), and the RKs of H(2)
0 and H(2)1 are given in
Table 2.2. The RKs of H1, H2, and H3 can be calculated using the factthat the RK of a tensor product space equals the product of RKs of the
involved marginal spaces. For example, the RK of H3 = H(1)1 ⊗ H(2)
1
equals (δx1,z1− 1/a){k1(x2)k1(z2) + k2(|x2 − z2|)}.
Now suppose we want to model the effect of x2 using the cubic splinespace W 2
2 [0, 1]. Consider the tensor product space Ra ⊗W 2
2 [0, 1]. Define
three averaging operators A(1)1 , A(2)
1 , and A(2)2 as
A(1)1 f =
1
a
a∑
x1=1
f,
A(2)1 f =
∫ 1
0
fdx2,
A(2)2 f =
(∫ 1
0
f ′dx2
)
(x2 − 0.5),
where A(1)1 and A(2)
1 extract the constant function out of all possible
functions for each variable, and A(2)2 extracts the linear function for x2.
Let A(1)2 = I −A(1)
1 and A(2)3 = I −A(2)
1 −A(2)2 . Then
f ={A(1)
1 + A(1)2
}{A(2)
1 + A(2)2 + A(2)
3
}f
= A(1)1 A(2)
1 f + A(1)1 A(2)
2 f + A(1)1 A(2)
3 f
+ A(1)2 A(2)
1 f + A(1)2 A(2)
2 f + A(1)2 A(2)
3 f
, µ+ β × (x2 − 0.5) + fs2 (x2)
+ f1(x1) + γx1× (x2 − 0.5) + fss
12 (x1, x2), (4.18)
where µ represents the overall mean, f1(x1) represents the main effect ofx1, β×(x2−0.5) represents the linear main effect of x2, f
s2 (x2) represents
the smooth main effect of x2, γx1×(x2−0.5) represents the smooth–linear
interaction, and fss12(x1, x2) represents the smooth–smooth interaction.
The overall main effect of x2
f2(x2) = β × (x2 − 0.5) + fs2 (x2),
and the overall interaction between x1 and x2
f12(x1, x2) = γx1× (x2 − 0.5) + fss
12(x1, x2).
It is obvious that f2 and f12 are the results of combining averaging
operators A(2)2 and A(3)
2 . One may look at the components in the overall
Smoothing Spline ANOVA 103
main effects and interactions to decide whether to include them in themodel. The first three terms in (4.18) represent the mean curve amongall levels of x1, and the last three terms represent the departure fromthe mean curve. The simple ANCOVA (analysis of covariance) modelwith x2 being modeled by a straight line is a special case of (4.18) withfs2 = fss
12 = 0. Thus, checking whether fs2 and fss
12 are negligible provides
a diagnostic tool for the ANCOVA model. Write Ra = H(1)
0 ⊕H(1)1 and
W 22 [0, 1] = H(2)
0 ⊕H(2)1 ⊕ H(2)
2 , where H(1)0 and H(1)
1 are given in (4.5),
H(2)0 = {1}, H(2)
1 = {x2 − 0.5}, and H(2)2 = {f ∈ W 2
2 [0, 1],∫ 1
0fdu =
∫ 1
0f ′du = 0}. Then, in terms of the model space, (4.18) decomposes
Ra ⊗W 2
2 [0, 1]
={
H(1)0 ⊕H(1)
1
}
⊗{
H(2)0 ⊕H(2)
1 ⊕H(2)2
}
={
H(1)0 ⊗H(2)
0
}
⊕{
H(1)0 ⊗H(2)
1
}
⊕{
H(1)0 ⊗H(2)
2
}
⊕{
H(1)1 ⊗H(2)
0
}
⊕{
H(1)1 ⊗H(2)
1
}
⊕{
H(1)1 ⊗H(2)
2
}
, H0 ⊕H1 ⊕H2 ⊕H3 ⊕H4, (4.19)
where H0 = {H(1)0 ⊗ H(2)
0 } ⊕ {H(1)0 ⊗ H(2)
1 }, H1 = H(1)0 ⊗ H(2)
2 , H2 =
H(1)1 ⊗H(2)
0 , H3 = H(1)1 ⊗H(2)
1 , and H4 = H(1)1 ⊗H(2)
2 . It is easy to seethat H0 is a two-dimensional space with basis functions φ1(x) = 1, andφ2(x) = x2 − 0.5. RKs of H1, H2, H3, and H4 can be calculated from
RKs of H(1)0 and H(1)
1 given in (4.6) and the RKs of H(2)0 , H(2)
1 , and H(2)2
given in Table 2.2.
4.4.3 Decomposition of Wm1
2 [0, 1] ⊗Wm2
2 [0, 1]
Suppose both x1 and x2 are continuous variables in [0, 1]. Wm2 [0, 1] is a
natural model space for both effects of x1 and x2. Therefore, we considerthe tensor product space Wm1
2 [0, 1] ⊗Wm2
2 [0, 1]. For simplicity, we willderive SS ANOVA decompositions for combinations m1 = m2 = 1 andm1 = m2 = 2 only. SS ANOVA decompositions for other combinationsof m1 and m2 can be derived similarly.
Consider the tensor product space W 12 [0, 1] ⊗W 1
2 [0, 1] first. Definetwo averaging operators as
A(k)1 f =
∫ 1
0
fdxk, k = 1, 2,
where A(k)1 extracts the constant term out of all possible functions of
104 Smoothing Splines: Methods and Applications
xk. Let A(k)2 = I −A(k)
1 for k = 1, 2. Then
f ={A(1)
1 + A(1)2
}{A(2)
1 + A(2)2
}f
= A(1)1 A(2)
1 f + A(1)2 A(2)
1 f + A(1)1 A(2)
2 f + A(1)2 A(2)
2 f
, µ+ f1(x1) + f2(x2) + f12(x1, x2). (4.20)
Obviously, (4.20) is a natural extension of the classical two-way ANOVAdecomposition (4.14) from the product of two discrete domains to theproduct of two continuous domains. Components µ, f1(x1), f2(x2), andf12(x1, x2) represent the overall mean, the main effect of x1, the maineffect of x2, and the interaction between x1 and x2, respectively. Interms of the model space, (4.20) decomposes
W 12 [0, 1]⊗W 1
2 [0, 1]
={
H(1)0 ⊕H(1)
1
}
⊗{
H(2)0 ⊕H(2)
1
}
={
H(1)0 ⊗H(2)
0
}
⊕{
H(1)1 ⊗H(2)
0
}
⊕{
H(1)0 ⊗H(2)
1
}
⊕{
H(1)1 ⊗H(2)
1
}
, H0 ⊕H1 ⊕H2 ⊕H3,
where H(k)0 = {1} and H(k)
1 = {f ∈ W 12 [0, 1] :
∫ 1
0 fdxk = 0} for
k = 1, 2, H0 = H(1)0 ⊗ H(2)
0 , H1 = H(1)1 ⊗ H(2)
0 , H2 = H(1)0 ⊗ H(2)
1 , and
H3 = H(1)1 ⊗H(2)
1 . H0 is an one-dimensional space with basis φ(x) = 1.
The RKs of H1, H2, and H3 can be calculated from RKs of H(k)0 and
H(k)1 given in Table 2.2.
Now suppose we want to model both x1 and x2 using cubic splines.That is, we consider the tensor product space W 2
2 [0, 1]⊗W 22 [0, 1]. Define
four averaging operators
A(k)1 f =
∫ 1
0
fdxk,
A(k)2 f =
(∫ 1
0
f ′dxk
)
(xk − 0.5), k = 1, 2,
where A(1)1 and A(2)
1 extract the constant function out of all possible
functions for each variable, and A(1)2 and A(2)
2 extract the linear function
Smoothing Spline ANOVA 105
for each variable. Let A(k)3 = I −A(k)
1 −A(k)2 for k = 1, 2. Then
f ={A(1)
1 + A(1)2 + A(1)
3
}{A(2)
1 + A(2)2 + A(2)
3
}f
= A(1)1 A(2)
1 f + A(1)1 A(2)
2 f + A(1)1 A(2)
3 f
+ A(1)2 A(2)
1 f + A(1)2 A(2)
2 f + A(1)2 A(2)
3 f
+ A(1)3 A(2)
1 f + A(1)3 A(2)
2 f + A(1)3 A(2)
3 f
, µ+ β2 × (x2 − 0.5) + fs2 (x2)
+ β1 × (x1 − 0.5) + β3 × (x1 − 0.5)× (x2 − 0.5) + f ls12(x1, x2)
+ fs1 (x1) + fsl
12(x1, x2) + fss12(x1, x2), (4.21)
where µ represents the overall mean; β1 × (x1 − 0.5) and β2 × (x2 −0.5) represent the linear main effects of x1 and x2; f
s1 (x1) and fs
2 (x2)represent the smooth main effect of x1 and x2; β3 × (x1 − 0.5) × (x2 −0.5), f ls
12(x1, x2), fsl12(x1, x2) and fss
12 (x1, x2) represent the linear–linear,linear–smooth, smooth–linear, and smooth–smooth interactions betweenx1 and x2. The overall main effect of xk
fk(xk) = βk × (xk − 0.5) + fsk(xk), k = 1, 2,
and the overall interaction between x1 and x2
f12(x1, x2) = β3 × (x1 − 0.5)× (x2 − 0.5) + f ls12(x1, x2) + fsl
12(x1, x2)
+ fss12 (x1, x2).
The simple regression model with both x1 and x2 being modeled bystraight lines is a special case of (4.21) with fs
1 = fs2 = f ls
12 = fsl12 =
fss12 = 0.In terms of the model space, (4.21) decomposes
W 22 [0, 1]⊗W 2
2 [0, 1]
={
H(1)0 ⊕H(1)
1 ⊕H(1)2
}
⊗{
H(2)0 ⊕H(2)
1 ⊕H(2)2
}
={
H(1)0 ⊗H(2)
0
}
⊕{
H(1)1 ⊗H(2)
0
}
⊕{
H(1)2 ⊗H(2)
0
}
⊕{
H(1)0 ⊗H(2)
1
}
⊕{
H(1)1 ⊗H(2)
1
}
⊕{
H(1)2 ⊗H(2)
1
}
⊕{
H(1)0 ⊗H(2)
2
}
⊕{
H(1)1 ⊗H(2)
2
}
⊕{
H(1)2 ⊗H(2)
2
}
, H0 ⊕H1 ⊕H2 ⊕H3 ⊕H4 ⊕H5,
where H(k)0 = {1}, H(k)
1 = {xk − 0.5}, and H(k)2 = {f ∈ W 2
2 [0, 1] :∫ 1
0f
dxk =∫ 1
0 f′dxk = 0} for k = 1, 2, H0 = {H(1)
0 ⊗H(2)0 }⊕{H(1)
1 ⊗H(2)0 }⊕
106 Smoothing Splines: Methods and Applications
{H(1)0 ⊗ H(2)
1 } ⊕ {H(1)1 ⊗ H(2)
1 }, H1 = H(1)2 ⊗ H(2)
0 , H2 = H(0)2 ⊗ H(2)
1 ,
H3 = H(1)0 ⊗ H(2)
2 , H4 = H(1)1 ⊗ H(2)
2 , and H5 = H(1)2 ⊗ H(2)
2 . H0 is afour-dimensional space with basis functions φ1(x) = 1, φ2(x) = x1−0.5,φ3(x) = x2 − 0.5, and φ4(x) = (x1 − 0.5)× (x2 − 0.5). The RKs of H1,
H2, H3, H4 and H5 can be calculated from RKs of H(k)0 , H(k)
1 , and H(k)2
given in Table 2.2.
4.4.4 Decomposition of Ra ⊗Wm
2 (per)
Suppose x1 is a discrete variable with a levels, and x2 is a continuousvariable in [0, 1]. In addition, suppose that f is a periodic function ofx2. A natural model space for x1 is R
a, and a natural model spacefor x2 is Wm
2 (per). Therefore, we consider the tensor product spaceR
a ⊗Wm2 (per).
Define two averaging operators A(1)1 and A(2)
1 as
A(1)1 f =
1
a
a∑
x1=1
f,
A(2)1 f =
∫ 1
0
fdx2.
Let A(1)2 = I −A(1)
1 and A(2)2 = I −A(2)
1 . Then
f ={A(1)
1 + A(1)2
}{A(2)
1 + A(2)2
}f
= A(1)1 A(2)
1 f + A(1)2 A(2)
1 f + A(1)1 A(2)
2 f + A(1)2 A(2)
2 f
, µ+ f1(x1) + f2(x2) + f12(x1, x2), (4.22)
where µ represents the overall mean, f1(x1) represents the main effect ofx1, f2(x2) represents the main effect of x2, and f12(x1, x2) represents the
interaction between x1 and x2. Write Ra = H(1)
0 ⊕H(1)1 and Wm
2 (per) =
H(2)0 ⊕ H(2)
1 , where H(1)0 and H(1)
1 are given in (4.5), H(2)0 = {1}, and
H(2)1 = {f ∈ Wm
2 (per) :∫ 1
0 fdu = 0}. Then, in terms of the modelspace, (4.22) decomposes
Ra ⊗Wm
2 (per)
={
H(1)0 ⊕H(1)
1
}
⊗{
H(2)0 ⊕H(2)
1
}
={
H(1)0 ⊗H(2)
0
}
⊕{
H(1)1 ⊗H(2)
0
}
⊕{
H(1)0 ⊗H(2)
1
}
⊕{
H(1)1 ⊗H(2)
1
}
, H0 ⊕H1 ⊕H2 ⊕H3,
Smoothing Spline ANOVA 107
where H0 = H(1)0 ⊗ H(2)
0 , H1 = H(1)1 ⊗ H(2)
0 , H2 = H(1)0 ⊗ H(2)
1 , and
H3 = H(1)1 ⊗ H(2)
1 . H0 is an one-dimensional space with basis functionφ(x) = 1. The RKs of H1, H2, and H3 can be calculated from RKs of
H(1)0 and H(1)
1 given in (4.6) and RKs of H(2)0 and H(2)
1 given in (2.33).
4.4.5 Decomposition of Wm1
2 (per) ⊗Wm2
2 [0, 1]
Suppose both x1 and x2 are continuous variables in [0, 1]. In addition,suppose f is a periodic function of x1. A natural model space for x1 isWm1
2 (per), and a natural model space for x2 is Wm2
2 [0, 1]. Therefore, weconsider the tensor product space Wm1
2 (per)⊗Wm2
2 [0, 1]. For simplicity,we derive the SS ANOVA decomposition for m2 = 2 only.
Define three averaging operators
A(1)1 f =
∫ 1
0
fdx1,
A(2)1 f =
∫ 1
0
fdx2,
A(2)2 f =
(∫ 1
0
f ′dx2
)
(x2 − 0.5).
Let A(1)2 = I −A(1)
1 and A(2)3 = I −A(2)
1 −A(2)2 . Then
f ={
A(1)1 + A(1)
2
}{
A(2)1 + A(2)
2 + A(2)3
}
f
= A(1)1 A(2)
1 f + A(1)1 A(2)
2 f + A(1)1 A(2)
3 f
+ A(1)2 A(2)
1 f + A(1)2 A(2)
2 f + A(1)2 A(2)
3 f
, µ+ β × (x2 − 0.5) + fs2 (x2)
+ f1(x1) + fsl12(x1, x2) + fss
12 (x1, x2), (4.23)
where µ represents the overall mean, f1(x1) represents the main effectof x1, β × (x2 − 0.5) and fs
2 (x2) represent the linear and smooth maineffects of x2, f
sl12(x1, x2) and fss
12 (x1, x2) represent the smooth–linear andsmooth–smooth interactions. The overall main effect of x2
f2(x2) = β × (x2 − 0.5) + fs2 (x2),
and the overall interaction between x1 and x2
f12(x1, x2) = fsl12(x1, x2) + fss
12(x1, x2).
108 Smoothing Splines: Methods and Applications
Write Wm1
2 (per) = H(1)0 ⊕ H(1)
1 and W 22 [0, 1] = H(2)
0 ⊕ H(2)1 ⊕ H(2)
2 ,
where H(1)0 = {1}, H(1)
1 = {f ∈ Wm1
2 (per) :∫ 1
0 fdu = 0}, H(2)0 = {1},
H(2)1 = {x2 − 0.5}, and H(2)
2 = {f ∈ W 22 [0, 1] :
∫ 1
0fdu =
∫ 1
0f ′du = 0}.
Then, in terms of the model space, (4.23) decomposes
Wm1
2 (per) ⊗W 22 [0, 1]
={
H(1)0 ⊕H(1)
1
}
⊗{
H(2)0 ⊕H(2)
1 ⊕H(2)2
}
={
H(1)0 ⊗H(2)
0
}
⊕{
H(1)0 ⊗H(2)
1
}
⊕{
H(1)0 ⊗H(2)
2
}
⊕{
H(1)1 ⊗H(2)
0
}
⊕{
H(1)1 ⊗H(2)
1
}
⊕{
H(1)1 ⊗H(2)
2
}
, H0 ⊕H1 ⊕H2 ⊕H3 ⊕H4,
where H0 = {H(1)0 ⊗ H(2)
0 } ⊕ {H(1)0 ⊗ H(2)
1 }, H1 = H(1)0 ⊗ H(2)
2 , H2 =
H(1)1 ⊗ H(2)
0 , H3 = H(1)1 ⊗ H(2)
1 , and H4 = H(1)1 ⊗ H(2)
2 . H0 is a two-dimensional space with basis functions φ(x) = 1 and φ(x) = x2 − 0.5.
The RKs of H1, H2, H3, and H4 can be calculated from the RKs H(1)0
and H(1)1 given in (2.33) and the RKs of H(2)
0 , H(2)1 , and H(2)
2 given inTable 2.2.
4.4.6 Decomposition of W 22 (R2) ⊗Wm
2 (per)
Suppose x1 = (x11, x12) is a bivariate continuous variable in R2, and x2
is a continuous variable in [0, 1]. In addition, suppose that f is a periodicfunction of x2. We consider the tensor product space W 2
2 (R2)⊗W 22 (per)
for the joint regression function f(x1, x2).
Let φ1(x1) = 1, φ2(x1) = x11, and φ3(x1) = x12 be polynomials oftotal degree less than 2. Define three averaging operators
A(1)1 f =
J∑
j=1
wjf(uj),
A(1)2 f =
J∑
j=1
wjf(uj){φ2(uj)φ2 + φ3(uj)φ3},
A(2)1 f =
∫ 1
0
fdx2,
where uj are fixed points in R2, wj are fixed positive weights such that
∑Jj=1 wj = 1, φ1 = 1, and φ2 and φ3 are orthonormal bases based on
Smoothing Spline ANOVA 109
the norm (2.41). Let A(1)3 = I −A(1)
1 −A(1)2 and A(2)
2 = I −A(2)1 . Then
f ={
A(1)1 + A(1)
2 + A(1)3
}{
A(2)1 + A(2)
2
}
f
= A(1)1 A(2)
1 f + A(1)2 A(2)
1 f + A(1)3 A(2)
1 f
+ A(1)1 A(2)
2 f + A(1)2 A(2)
2 f + A(1)3 A(2)
2 f
= µ+ β1φ2(x1) + β2φ3(x1)
+ fs1 (x1) + f2(x2) + f ls
12(x1, x2) + fss12(x1, x2), (4.24)
where µ is the overall mean, β1φ2(x1) + β2φ3(x1) is the linear maineffect of x1, f
s1 (x1) is the smooth main effect of x1, f2(x2) is the main
effect of x2, fls12(x1, x2) is the linear–smooth interaction, and fss
12(x1, x2)is smooth–smooth interaction. The overall main effect of x1
f1(x1) = β1φ2(x1) + β2φ3(x1) + fs1 (x1),
and the overall interaction
f12(x1, x2) = f ls12(x1, x2) + fss
12 (x1, x2).
Write W 22 (R2) = H(1)
0 ⊕H(1)1 ⊕H(1)
2 and Wm2 (per) = H(2)
0 ⊕H(2)1 , where
H(1)0 = {1}, H(1)
1 = {φ2, φ3}, H(1)2 = {f ∈ W 2
2 (R2) : J22 (f) = 0},
H(2)0 = {1}, and H(2)
1 = {f ∈ Wm1
2 (per) :∫ 1
0fdu = 0}. Then, in terms
of the model space, (4.24) decomposes
W 22 (R2) ⊗Wm
2 (per)
={
H(1)0 ⊕H(1)
1 ⊕H(1)2
}
⊗{
H(2)0 ⊕H(2)
1
}
={
H(1)0 ⊗H(2)
0
}
⊕{
H(1)1 ⊗H(2)
0
}
⊕{
H(1)2 ⊗H(2)
0
}
⊕{
H(1)0 ⊗H(2)
1
}
⊕{
H(1)1 ⊗H(2)
1
}
⊕{
H(1)2 ⊗H(2)
1
}
, H0 ⊕H1 ⊕H2 ⊕H3 ⊕H4, (4.25)
where H0 = {H(1)0 ⊗H(2)
0 } ⊕ {H(1)1 ⊗H(2)
0 }, H1 = {H(1)2 ⊗H(2)
0 }, H2 =
{H(1)0 ⊗H(2)
1 }, H3 = {H(1)1 ⊗H(2)
1 }, and H4 = {H(1)2 ⊗H(2)
1 }. The basis
functions of H0 are 1, φ2, and φ3. The RKs of H(1)0 and H(1)
1 are 1 and
φ2(x1)φ2(z1) + φ3(x1)φ3(z1), respectively. The RK of H(1)2 is given in
(2.42). The RKs of H(2)0 and H(2)
1 are given in (2.33). The RKs of H1,H2, H3, and H4 can be calculated as products of the RKs of the involvedmarginal spaces.
110 Smoothing Splines: Methods and Applications
4.5 General SS ANOVA Decomposition
Consider the general case with d independent variables x1 ∈ X1, x2 ∈X2, . . . , xd ∈ Xd, and the tensor product space H(1)⊗H(2)⊗· · ·⊗H(d) onX = X1 ×X2 × · · · ×Xd. For f as a function of xk, assume the followingone-way decomposition as in (4.3),
f = A(k)1 f + · · · + A(k)
rkf, 1 ≤ k ≤ d, (4.26)
where A(k)1 + · · · + A(k)
rk = I. Then, for the joint function,
f ={A(1)
1 + · · · + A(1)r1
}. . .{A(d)
1 + · · · + A(d)rd
}f
=
r1∑
j1=1
. . .
rd∑
jd=1
A(1)j1. . .A(d)
jdf. (4.27)
The above decomposition of the function f is referred to as the SSANOVA decomposition.
DenoteH(k) = H(k)
(1) ⊕ · · · ⊕ H(k)(rk), k = 1, 2, . . . , d
as the one-way decomposition to H(k) associated with (4.26). Then,(4.27) decomposes the tensor product space
H(1) ⊗H(2) ⊗ · · · ⊗ H(d)
={
H(1)(1) ⊕ · · · ⊕ H(1)
(r1)
}
⊗ · · · ⊗{
H(d)(1) ⊕ · · · ⊕ H(d)
(rd)
}
=
r1∑
j1=1
. . .
rd∑
jd=1
H(1)(j1) ⊗ . . .⊗H(d)
(jd). (4.28)
The RK of H(1)(j1) ⊗· · ·⊗H(d)
(jd) equals∏d
k=1 R(k)(jk), where R
(k)(jk) is the RK
of H(k)(jk) for k = 1, . . . , d.
Consider the special case when rk = 2 for all k = 1, . . . , d. Assume
that A(k)1 f is independent of xk, or equivalently, H(k)
(1) = {1}. Then the
decomposition (4.27) can be written as
f =∑
B⊆{1,...,d}
{∏
k∈B
A(k)2
∏
k∈Bc
A(k)1 f
}
= µ+
d∑
k=1
fk(xk) +∑
k<l
fkl(xk, xl) + · · · + f1...d(x1, . . . , xd), (4.29)
Smoothing Spline ANOVA 111
where µ represents the grand mean, fk(xk) represents the main effectof xk, fkl(xk, xl) represents the two-way interaction between xk and xl,and the remaining terms represent higher-order interactions.
For general rk, assuming A(k)1 f is independent of xk, the decomposi-
tion (4.29) can be derived by combining operators A(k)2 , . . . , A(k)
rkinto
one averaging operator A(k)2 = A(k)
2 + · · · + A(k)rk
. Therefore, decompo-sition (4.29) combines components in (4.27) and reorganizes them intooverall main effects and interactions.
When all xk are discrete variables, the SS ANOVA decomposition(4.27) leads to the classical d-way ANOVA model. Therefore, SS ANOVAdecompositions are natural extensions of classical ANOVA decomposi-tions from discrete domains to general domains and from finite dimen-sional spaces to infinite dimensional spaces. They decompose tensorproduct RKHS’s into meaningful subspaces. As the classical ANOVAdecompositions, SS ANOVA decompositions lead to hierarchical struc-tures that are useful for model selection and interpretation.
Different SS ANOVA decompositions can be derived based on differentaveraging operators (or equivalently different decompositions of marginalspaces). Therefore, the SS ANOVA decomposition should be regardedas a general prescription for building multivariate nonparametric modelsrather than some fixed models. They can also be used to constructsubmodels for components in more complicated models.
4.6 SS ANOVA Models and Estimation
The curse of dimensionality is a major problem in dealing with multi-variate functions. From the SS ANOVA decomposition in (4.27), it isclear that the number of components in the decomposition increases ex-ponentially as the dimension d increases. To overcome this problem, asin classical ANOVA, high-order interactions are often deleted from themodel space.
A model containing any subset of components in the SS ANOVA de-composition (4.27) is referred to as an SS ANOVA model. The well-known additive model is a special case that contains main effects only(Hastie and Tibshirani 1990). Given an SS ANOVA model, we can re-group and write the model space as
M = H0 ⊕H1 ⊕ · · · ⊕ Hq, (4.30)
where H0 is a finite dimensional space collecting all functions that are
112 Smoothing Splines: Methods and Applications
not going to be penalized, and H1, . . . ,Hq are orthogonal RKHS’s withRKs Rj for j = 1, . . . , q. The RK Rj equals the product of RKs of thesubspaces involved in the tensor product space Hj . The norms on thecomposite Hj are the tensor product norms induced by the norms on thecomponent subspaces. Details about the induced norm can be found inAronszajn (1950), and an illustrative example can be found in Chapter10 of Wahba (1990). Note that ‖f‖2 = ‖P0f‖2 +
∑qj=1 ‖Pjf‖2, where
Pj is the orthogonal projector in M onto Hj .For generality, suppose observations are generated by
yi = Lif + ǫi, i = 1, . . . , n, (4.31)
where f is a multivariate function in the model space M defined in(4.30), Li are bounded linear functionals, and ǫi are zero-mean indepen-dent random errors with a common variance σ2.
The PLS estimate of f is the solution to
minf∈M
1
n
n∑
i=1
(yi − Lif)2 +
q∑
j=1
λj‖Pjf‖2
. (4.32)
Different smoothing parameters λj allow different penalties for each com-ponent. For fixed smoothing parameters, the following rescaling allowsus to derive and compute the solution to (4.32) using results in Chapter2. Let H∗
1 = ⊕qj=1Hj . Then, for any f ∈ H∗
1,
f(x) = f1(x) + · · · + fq(x), fj ∈ Hj , j = 1, . . . , q.
Write λj , λ/θj . The set of parameters λ and θj are overparameterized.The penalty is controlled by the ratio λ/θj , that is, λj . Define the innerproduct in H∗
1 as
(f, g)∗ =
q∑
j=1
θ−1j (fj , gj). (4.33)
Then, ||f ||2∗ =∑q
j=1 θ−1j ||fj ||2. Let R∗
1 =∑q
j=1 θjRj . Since
(R∗1(x, ·), f(·))∗ =
q∑
j=1
θ−1j (θjR
j(x, ·), fj(·)) =
q∑
j=1
fj(x) = f(x),
then R∗1 is the RK of H∗
1 with the inner product (4.33).Let P ∗
1 =∑q
j=1 Pj be the orthogonal projection in M onto H∗1. Then
the minimization problem (4.32) is reduced to
minf∈M
{
1
n
n∑
i=1
(yi − Lif)2 + λ‖P ∗1 f‖2
∗
}
. (4.34)
Smoothing Spline ANOVA 113
The PLS (4.34) has the same form as (2.11), with H1 and P1 beingreplaced by H∗
1 and P ∗1 , respectively. Therefore, results in Section 2.4
apply. Specifically, let θ = (θ1, . . . , θq), φ1, . . . , φp be basis functions ofH0, and
T = {Liφν}n pi=1 ν=1,
Σk = {LiLjRk}n
i,j=1, k = 1, . . . , q,
Σθ = θ1Σ1 + · · · + θqΣq.
(4.35)
Applying the Kimeldorf–Wahba representer theorem and noting that
ξi(x) = Li(z)R∗1(x, z) =
q∑
j=1
θjLi(z)Rj(x, z),
the solution can be represented as
f(x) =
p∑
ν=1
dνφν(x) +
n∑
i=1
ci
q∑
j=1
θjLi(z)Rj(x, z), (4.36)
where d = (d1, . . . , dp)T and c = (c1, . . . , cn)T are solutions to
(Σθ + nλI)c + Td = y,
T Tc = 0.(4.37)
Equations in (4.37) have the same form as those in (2.21), with Σ beingreplaced by Σθ. Therefore, the coefficients c and d can be computedsimilarly.
Let f = (L1f , . . . ,Lnf)T be the vector of fitted values. Let the QRdecomposition of T be
T = (Q1 Q2)
(R0
)
and M = Σθ + nλI. Then
f = H(λ,θ)y
whereH(λ,θ) = I − nλQ2(Q
T2MQ2)
−1QT2 (4.38)
is the hat matrix.The ssr function in the assist package can be used to fit the SS
ANOVA model (4.31). As in Chapter 2, observations y and T matrixcan be specified using the argument formula. Instead of a single matrix
114 Smoothing Splines: Methods and Applications
Σ, we now have multiple matrices Σj for j = 1, . . . , q. They are specifiedas elements of a list for the argument rk. Examples are given in Section4.9.
Sometimes it may be desirable to use the same smoothing parameterfor a subset of penalties in the PLS (4.32). For illustration, suppose wewant to solve the PLS (4.32) with λq−1 = λq. This can be achieved
by combining Hq−1 and Hq into one space, say Hq−1, and fit the SSANOVA model with model space M = H0 ⊕ H1 ⊕ · · · ⊕ Hq−1. TheRK of Hq−1 is Rq−1 = Rq−1 + Rq. Then the model can be fitted by acall to the ssr function with the combined RK. The same approach canbe applied to multiple subsets such that penalties in each subset sharethe same smoothing parameter. When appropriate, this approach cangreatly reduce the computation time when q is large. An example willbe given in Section 4.9.1.
4.7 Selection of Smoothing Parameters
The set of parameters λ and θ are overparameterized. Therefore, eventhough the criteria in this section are presented as functions of λ and θ,they should be understood as functions of (λ1, . . . , λq), where λj = λ/θj .
Define mean squared error (MSE) as
MSE(λ,θ) = E
(1
n||f − f ||2
)
, (4.39)
where f = (L1f, . . . ,Lnf)T . Following the same arguments as in Section3.3, it is easy to check that the function
UBR(λ,θ) ,1
n||(I −H(λ,θ))y||2 +
2σ2
ntrH(λ,θ) (4.40)
is an unbiased estimate of MSE(λ,θ) + σ2. The function UBR(λ,θ)is referred to as the unbiased risk (UBR) criterion and the minimizer ofUBR(λ,θ) is referred to as the UBR estimate of (λ,θ). The UBR methodrequires an estimate of error variance σ2. Few methods are available forestimating σ2 in a multivariate nonparametric model without needing toestimate the function f first. When the product domain X is equippedwith a norm, the method in Tong and Wang (2005) may be used toestimate σ2.
Smoothing Spline ANOVA 115
A parallel derivation as in Section 3.4 leads to the following extensionof the GCV criterion
GCV(λ,θ) ,
1n
∑ni=1(Lif − yi)
2
{1n tr(I −H(λ,θ))
}2 . (4.41)
The GCV estimate of (λ,θ) is the minimizer of GCV(λ,θ).We now construct a Bayes model for the SS ANOVA model (4.31).
Assume a prior for f as
F (x) =
p∑
ν=1
ζνφν(x) + δ12
q∑
j=1
√
θjUj(x), (4.42)
where ζνiid∼ N(0, κ); Uj(x) are independent, zero-mean Gaussian stochas-
tic processes with covariance function Rj(x, z); ζν and Uj(x) are mutu-ally independent; and κ and δ are positive constants. Suppose observa-tions are generated from
yi = LiF + ǫi, i = 1, . . . , n, (4.43)
where ǫiiid∼ N(0, σ2).
Let L0 be a bounded linear functional on M. Let λ = σ2/nδ. Thesame arguments in Section 3.6 hold when M = Σ + nλI is replaced byM = Σθ + nλI in this chapter. Therefore,
limκ→∞
E(L0F |y) = L0f .
That is, the PLS estimate f is a Bayes estimate. Furthermore, we havethe following extension of the GML criterion:
GML(λ,θ) ,y
T
(I −H(λ,θ))y
[det+((I −H(λ,θ)))]1
n−p
. (4.44)
All three forms of LME models for the SSR model in Section 3.5can be extended for the SS ANOVA model. We present the extensionof (3.35) only. Let Σk = ZkZ
Tk , where Zk is an n × mk matrix with
mk = rank(Σk). It is not difficult to see that the GML criterion is theREML criterion based on the following linear mixed-effects model
y = Tζ +
q∑
k=1
Zkuk + ǫ, (4.45)
where ζ = (ζ1, . . . , ζp)T are deterministic parameters, uk are mutually
independent random effects, uk ∼ N(0, σ2θkImk/nλ), ǫ ∼ N(0, σ2I),
and uk are independent of ǫ. Details can be found in Chapter 9.
116 Smoothing Splines: Methods and Applications
4.8 Confidence Intervals
Any function f ∈ M can be represented as
f =
p∑
ν=1
f0ν +
q∑
j=1
f1j , (4.46)
where f0ν ∈ span{φν} for ν = 1, . . . , p, and f1j ∈ Hj for j = 1, . . . , q.Our goal is to construct Bayesian confidence intervals for
L0fγ =
p∑
ν=1
γνL0f0ν +
q∑
j=1
γp+jL0f1j (4.47)
for any bounded linear functional L0 and any γ = (γ1, . . . , γp+q)T , where
γk = 1 when the corresponding component in (4.46) is to be includedand 0 otherwise.
Let F0ν = ζνφν for ν = 1, . . . , p, and F1j =√δθjUj for j = 1, . . . , q.
Let L0, L01, and L02 be bounded linear functionals.
Posterior means and covariancesFor ν, µ = 1, . . . , p and j, k = 1, . . . , q, the posterior means are
E(L0F0ν |y) = (L0φν)eTν d,
E(L0F1j |y) = θj(L0ξj)Tc,
(4.48)
and the posterior covariances are
δ−1Cov(L01F0ν ,L02F0µ|y) = (L01φν)(L02φµ)eTν Aeµ,
δ−1Cov(L01F0ν ,L02F1j |y) = −θj(L01φν)eTν B(L02ξj), (4.49)
δ−1Cov(L01F1j ,L02F1k|y) = δj,kθjL01L02Rj − θjθk(L01ξj)
TC(L02ξk),
where eν is a vector of dimension p with the νth element being oneand all other elements being zero, δj,k is the Kronecker delta, c andd are solutions to (4.37), Lξj = (LL1R
j, . . . ,LLnRj)T for any well-
defined L, M = Σθ + nλI, A = (T TM−1T )−1, B = AT TM−1, andC = M−1(I −B).
As in Section 3.8.1, even though not explicitly expressed, a diffuseprior is assumed for ζ with κ → ∞. The first two equations in (4.48)
state that the projections of f on subspaces are the posterior meansof the corresponding components in the Bayes model (4.42). The next
Smoothing Spline ANOVA 117
three equations in (4.49) can be used to compute posterior covariancesof the spline estimates and their projections. Based on these posteriorcovariances, we construct Bayesian confidence intervals for the overallfunction f and its components in (4.46). Specifically, posterior meanand variance for L0fγ in (4.47) can be calculated using the formulae in(4.48) and (4.49). Then 100(1 − α)% Bayesian confidence interval forL0fγ is
E{L0Fγ |y} ± zα2
√
Var{L0Fγ |y}
where
Fγ (x) =
p∑
ν=1
γνF0ν(x) +
q∑
j=1
γp+jF1j(x).
Confidence intervals for a collection of points can be constructedsimilarly.
The same approach in Section 3.8.2 can be used to construct bootstrapconfidence intervals. The extension is straightforward.
4.9 Examples
4.9.1 Tongue Shapes
Consider the ultrasound data. Let y be the response variable height; x1
be the index of environment with x1 = 1, 2, 3 corresponding to 2words,cluster, and schwa respectively; x2 be the variable length scaled into[0, 1]; and x3 be the variable time scaled into [0, 1].
We first investigate how tongue shapes for an articulation differ underdifferent environments at a particular time, say, at time 60 ms. Obser-vations are shown in Figure 4.2. Consider a bivariate regression functionf(x1, x2) where x1 (environment) is a discrete variable with three levels,and x2 (length) is a continuous variable in [0, 1]. Therefore, we modelthe joint function using the tensor product space R
3⊗Wm2 [0, 1]. The SS
ANOVA decompositions of Ra ⊗Wm
2 [0, 1] are given in (4.16) for m = 1and (4.18) for m = 2.
The following statements fit the SS ANOVA model (4.16):
> data(ultrasound)
> ultrasound$y <- ultrasound$height
> ultrasound$x1 <- ultrasound$env
> ultrasound$x2 <- ident(ultrasound$length)
> ssr(y~1, data=ultrasound, subset=ultrasound$time==60,
118 Smoothing Splines: Methods and Applications
length (mm)
heig
ht (m
m)
o
o
o
o
o
o
o
o oo
o
o
o
o
o
o
o
o
o
o
o
oo o
o
o
o
o
o
o
o
o
o
o
o
o
o
o o
o
o
o
o
o
o
80 100 120 140
45
50
55
60
65
70
75
80 2words
length (mm)
o
o
o
o
o
o o o o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
o o oo
o
o
o
o
o
80 100 120 140
cluster
length (mm)
o
o
o
o
o
o
o o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o o
oo
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
80 100 120 140
schwa
FIGURE 4.2 Ultrasound data, plots of observations (circles), fits(thin lines), and 95% Bayesian confidence intervals (shaded regions),and the mean curves among three environments (thicker lines). Thetongue root is on the left, and the tongue tip is on the right in each plot.
rk=list(shrink1(x1),
linear(x2),
rk.prod(shrink1(x1),linear(x2))))
where the ident function transforms a variable into [0, 1], the shrink1
function computes the RK of H(1)1 = {f :
∑3i=1 f(i) = 0} using the
formula in (4.6), and the rk.prod function computes the product ofRKs.
The following statements fit the SS ANOVA model (4.18) and sum-marize the fit:
> ultra.el.c.fit <- ssr(y~x2, data=ultrasound,
subset=ultrasound$time==60,
rk=list(shrink1(x1),
cubic(x2),
rk.prod(shrink1(x1),kron(x2-.5)),
rk.prod(shrink1(x1),cubic(x2))))
> summary(ultra.el.c.fit)
...
GCV estimate(s) of smoothing parameter(s) : 6.061108e-02
2.650966e-05 1.557379e-03 3.177218e-05
Equivalent Degrees of Freedom (DF): 13.57631
Estimate of sigma: 2.654545
Smoothing Spline ANOVA 119
where the function kron computes the RK, k1(x2)k1(z2), of the space
H(2)1 = {k1(x2)}.Estimates of the smoothing parameters λ3 and λ4 are small, which
indicates that both interactions γx1× (x2 −0.5) and fss
12(x1, x2) may notbe negligible. We compute the posterior mean and standard deviationfor the overall interaction f12(x1, x2) = γx1
× (x2 − 0.5)+ fss12 (x1, x2) on
grid points as follows:
> grid <- seq(0,1,len=40)
> predict(ultra.el.c.fit, terms=c(0,0,0,0,1,1),
newdata=expand.grid(x2=grid,x1=as.factor(1:3)))
The overall interaction and its 95% Bayesian confidence intervals areshown in Figure 4.3. It is clear that the interaction is nonzero.
length (mm)
heig
ht (m
m)
80 100 120 140
−5
05
2words
length (mm)80 100 120 140
cluster
length (mm)80 100 120 140
schwa
FIGURE 4.3 Ultrasound data, plots of the overall interaction (solidlines), and 95% Bayesian confidence intervals (shaded regions). Dashedline in each plot represents the constant function zero.
The posterior mean and standard deviation of the function f(x1, x2)can be calculated as follows:
> predict(ultra.el.c.fit, terms=c(1,1,1,1,1,1),
newdata=expand.grid(x2=grid,x1=as.factor(1:3)))
The default for the option terms is a vector of all 1’s. Therefore, this op-tion can be dropped in the above statement. The fits and 95% Bayesianconfidence intervals are shown in Figure 4.2.
120 Smoothing Splines: Methods and Applications
Note that in model (4.18) the first three terms represent the meancurve among three environments and the last three terms represent thedeparture of a particular environment from the mean curve. For com-parison, we compute the estimate of the mean curve among three envi-ronments:
> predict(ultra.el.c.fit, terms=c(1,1,0,1,0,0),
newdata=expand.grid(x2=grid,x1=as.factor(1)))
The estimate of the mean curve is also displayed in Figure 4.2. Thedifference between the tongue shape under a particular environment andthe average tongue shape can be made by comparing two lines in eachplot. To look at the effect of each environment more closely, we calculatethe estimate of the departure from the mean curve for each environment:
> predict(ultra.el.c.fit, terms=c(0,0,1,0,1,1),
newdata=expand.grid(x2=grid,x1=as.factor(1:3)))
The estimates of environment effects are shown in Figure 4.4. We cansee that, comparing to the average shape, the tongue shape for 2wordsis front-raising, and the tongue shape for cluster is back-raising. Thetongue shape for schwa is close to the average shape.
length (mm)
heig
ht (m
m)
80 100 120 140
−5
05
2words
length (mm)80 100 120 140
cluster
length (mm)80 100 120 140
schwa
FIGURE 4.4 Ultrasound data, plots of effects of environment, and95% Bayesian confidence intervals. The dashed line in each plot repre-sents the constant function zero.
The model space of the SS ANOVA model (4.18) is M = H0 ⊕H1 ⊕H2 ⊕ H3 ⊕ H4, where Hj for j = 0, . . . , 4 are defined in (4.19). In
Smoothing Spline ANOVA 121
particular, the spaces H3 and H4 contain smooth–linear and smooth–smooth interactions between x1 and x2. For illustration, suppose nowthat we want to fit model (4.18) with the same smoothing parameter forpenalties to functions in H3 and H4. That is to set λ3 = λ4 in the PLS.As discussed in Section 4.6, this can be achieved by combining H3 andH4 into one space. The following statements fit the SS ANOVA model(4.18) with λ3 = λ4:
> ultra.el.c.fit1 <- ssr(y~x2, data=ultra,
subset=ultra$time==60,
rk=list(shrink1(x1),
cubic(x2),
rk.prod(shrink1(x1),kron(x2-.5))+
rk.prod(shrink1(x1),cubic(x2))))
> summary(ultra.el.c.fit1)
...
GCV estimate(s) of smoothing parameter(s) : 5.648863e-02
2.739598e-05 3.766634e-05
Equivalent Degrees of Freedom (DF): 13.52603
Estimate of sigma: 2.65806
Next we investigate how the tongue shapes change over time for eachenvironment. Figure 4.1 shows 3-d plots of observations. For a fixedenvironment, consider a bivariate regression function f(x2, x3) whereboth x2 and x3 are continuous variables. Therefore, we model the jointfunction using the tensor product space Wm1
2 [0, 1] ⊗Wm2
2 [0, 1]. TheSS ANOVA decompositions of Wm1
2 [0, 1] ⊗Wm2
2 [0, 1] were presented inSection 4.4.3. Note that variables x2 and x3 in this section correspondto x1 and x2 in Section 4.4.3.
The following statements fit the tensor product of linear splines withm1 = m2 = 1, that is, the SS ANOVA model (4.20), under environment2words (x1 = 1):
> ultrasound$x3 <- ident(ultrasound$time)
> ssr(height~1, data=ultrasound, subset=ultrasound$env==1,
rk=list(linear(x2),
linear(x3),
rk.prod(linear(x2),linear(x3))))
The following statements fit the tensor product of cubic splines withm1 = m2 = 2, that is, the SS ANOVA model (4.21), under environment2words and calculate estimates at grid points:
> ultra.lt.c.fit[[1]] <- ssr(height~x2+x3+x2*x3,
data=ultrasound, subset=ultrasound$env==1,
122 Smoothing Splines: Methods and Applications
rk=list(cubic(x2),
cubic(x3),
rk.prod(kron(x2),cubic(x3)),
rk.prod(cubic(x2),kron(x3)),
rk.prod(cubic(x2),cubic(x3))))
> grid <- seq(0,1,len=20)
> ultra.lt.c.pred <- predict(ultra.lt.c.fit[[1]],
newdata=expand.grid(x2=grid,x3=grid))
80100
120140
0
50
100150
200
40
50
60
70
80
length (mm)
time (m
s)
heig
ht (m
m)
2words
80100
120140
0
50
100150
200
40
50
60
70
80
length (mm)
time (m
s)
heig
ht (m
m)
cluster
80100
120140
0
50
100150
200
40
50
60
70
80
length (mm)
time (m
s)
heig
ht (m
m)
schwa
FIGURE 4.5 Ultrasound data, 3-d plots of the estimated tongueshapes as functions of length and time based on the SS ANOVA model(4.21).
The SS AONVA model (4.21) for environments cluster and schwa canbe fitted similarly. The estimates of all three environments are shownin Figure 4.5. These surfaces show how the tongue shapes change overtime. Note that µ+β1× (x2− 0.5)+ fs
1(x2) represents the mean tongueshape over the time period [0, 140], and the rest in (4.21), β2×(x3−0.5)+fs2 (x3)+β3×(x2−0.5)×(x3−0.5)+f ls
12(x2, x3)+fsl12(x2, x3)+f
ss12(x2, x3),
represents the departure from the mean shape at time x3. To look atthe time effect, we compute posterior means and standard deviations ofthe departure on grid points:
> predict(ultra.lt.c.fit[[1]], term=c(0,0,1,1,0,1,1,1,1),
newdata=expand.grid(x2=grid,x3=grid))
Smoothing Spline ANOVA 123
Figure 4.6 shows the contour plots of the estimated time effect for threeenvironments. Regions where the lower bounds of the 95% Bayesianconfidence intervals are positive are shaded in dark grey, while regionswhere the upper bounds of the 95% Bayesian confidence intervals arenegative are shaded in light grey.
length (mm)
tim
e (
ms)
−2
0
0
0 2
2
2
4
4
4
6
6
80 100 120 140
05
01
00
15
02
00
2words
length (mm)
−5
−3
0
0
5
5
10
10
80 100 120 140
cluster
length (mm)
−4
−2
0
0
2
2
4
4
6
6
80 100 120 140
schwa
FIGURE 4.6 Ultrasound data, contour plots of the estimated timeeffect for three environments based on the SS ANOVA model (4.21).Regions where the lower bounds of the 95% Bayesian confidence intervalsare positive are shaded in dark grey. Regions where the upper boundsof the 95% Bayesian confidence intervals are negative are shaded in lightgrey.
Finally we investigate how the changes of tongue shapes over timediffer among different environments. Consider a trivariate regressionfunction f(x1, x2, x3) in tensor product space R
3⊗Wm1
2 [0, 1]⊗Wm2
2 [0, 1].For simplicity, we derive the SS ANOVA decomposition form1 = m2 = 2only. Define averaging operators:
A(1)1 f =
1
3
3∑
x1=1
f,
A(k)1 f =
∫ 1
0
fdxk,
A(k)2 f =
(∫ 1
0
f ′dxk
)
(xk − 0.5), k = 2, 3,
where A(1)1 , A(2)
1 , and A(3)1 extract the constant function out of all pos-
124 Smoothing Splines: Methods and Applications
sible functions for each variable, and A(2)2 and A(3)
2 extract the linear
function for x2 and x3. Let A(1)2 = I −A(1)
1 and A(k)3 = I −A(k)
1 −A(k)2
for k = 2, 3. Then
f ={A(1)
1 + A(1)2
}{A(2)
1 + A(2)2 + A(2)
3
}{A(3)
1 + A(3)2 + A(3)
3
}f
= A(1)1 A(2)
1 A(3)1 f + A(1)
1 A(2)1 A(3)
2 f + A(1)1 A(2)
1 A(3)3 f
+ A(1)1 A(2)
2 A(3)1 f + A(1)
1 A(2)2 A(3)
2 f + A(1)1 A(2)
2 A(3)3 f
+ A(1)1 A(2)
3 A(3)1 f + A(1)
1 A(2)3 A(3)
2 f + A(1)1 A(2)
3 A(3)3 f
+ A(1)2 A(2)
1 A(3)1 f + A(1)
2 A(2)1 A(3)
2 f + A(1)2 A(2)
1 A(3)3 f
+ A(1)2 A(2)
2 A(3)1 f + A(1)
2 A(2)2 A(3)
2 f + A(1)2 A(2)
2 A(3)3 f
+ A(1)2 A(2)
3 A(3)1 f + A(1)
2 A(2)3 A(3)
2 f + A(1)2 A(2)
3 A(3)3 f
, µ+ β2 × (x3 − 0.5) + fs3 (x3)
+ β1 × (x2 − 0.5) + β3 × (x2 − 0.5)(x3 − 0.5) + f ls23(x2, x3)
+ fs2 (x2) + fsl
23(x2, x3) + fss23 (x2, x3)
+ f1(x1) + fsl13(x1, x3) + fss
13 (x1, x3)
+ fsl12(x1, x2) + fsll
123(x1, x2, x3) + fsls123(x1, x2, x3)
+ fss12 (x1, x2) + fssl
123(x1, x2, x3) + fsss123(x1, x2, x3), (4.50)
where µ represents the overall mean; f1(x1) represents the main effect ofx1; β1× (x2−0.5) and β2× (x3−0.5) represent the linear main effects ofx2 and x3; f
s2 (x2) and fs
3 (x3) represent the smooth main effects of x2 andx3; f
sl12(x1, x2) (fsl
13(x1, x3)) represents the smooth–linear interaction be-tween x1 and x2 (x3); β3× (x2−0.5)× (x3−0.5), f ls
23(x2, x3), fsl23(x2, x3)
and fss23 (x2, x3) represent linear–linear, linear–smooth, smooth–linear
and smooth–smooth interactions between x2 and x3; and fsll123(x1, x2, x3),
fsls123(x1, x2, x3), f
ssl123(x1, x2, x3), and fsss
123(x1, x2, x3) represent three-wayinteractions between x1, x2, and x3. The overall main effect of xk,fk(xk), equals βk−1 × (xk − 0.5)+ fs
k(xk) for k = 2, 3. The overall inter-action between x1 and xk, f1k(x1, xk), equals fsl
12(x1, xk)+fss1k(x1, xk) for
k = 2, 3. The overall interaction between x2 and x3, f23(x2, x3), equalsβ3 × (x2 − 0.5)× (x3 − 0.5)+ f ls
23(x2, x3)+ fsl23(x2, x3)+ fss
23(x2, x3). Theoverall three-way interaction, f123(x1, x2, x3), equals fsll
123(x1, x2, x3) +fsls123(x1, x2, x3) + fssl
123(x1, x2, x3) + fsss123(x1, x2, x3).
We fit model (4.50) as follows:
> ssr(height~I(x2-.5)+I(x3-.5)+I((x2-.5)*(x3-.5)),
data=ultrasound,
rk=list(shrink1(x1),
Smoothing Spline ANOVA 125
0.00.2
0.40.6
0.81.0
0.0
0.20.4
0.60.8
1.0
40
50
60
70
80
length (mm)
time (m
s)
heig
ht (m
m)
2words0.0
0.20.4
0.60.8
1.0
0.0
0.20.4
0.60.8
1.0
40
50
60
70
80
length (mm)
time (m
s)
heig
ht (m
m)
cluster0.0
0.20.4
0.60.8
1.0
0.0
0.20.4
0.60.8
1.0
40
50
60
70
80
length (mm)
time (m
s)
heig
ht (m
m)
schwa
FIGURE 4.7 Ultrasound data, 3-d plots of the estimated tongueshape as a function of environment, length and time based on theSS ANOVA model (4.50).
cubic(x2),
cubic(x3),
rk.prod(shrink1(x1),kron(x2-.5)),
rk.prod(shrink1(x1),cubic(x2)),
rk.prod(shrink1(x1),kron(x3-.5)),
rk.prod(shrink1(x1),cubic(x3)),
rk.prod(cubic(x2),kron(x3-.5)),
rk.prod(kron(x2-.5),cubic(x3)),
rk.prod(cubic(x2),cubic(x3)),
rk.prod(shrink1(x1),kron(x2-.5),kron(x3-.5)),
rk.prod(shrink1(x1),kron(x2-.5),cubic(x3)),
rk.prod(shrink1(x1),cubic(x2),kron(x3-.5)),
rk.prod(shrink1(x1),cubic(x2),cubic(x3))))
The estimates of all three environments are shown in Figure 4.7. Notethat the first nine terms in (4.50) represent the mean tongue shape sur-face over time, and the last nine terms in (4.50) represent the departureof an environment from this mean surface. To look at the environmenteffect on the tongue shape surface over time, we calculate the posteriormean and standard deviation of the departure for each environment:
> pred <- predict(ultra.elt.c.fit,
newdata=expand.grid(x1=as.factor(1:3),x2=grid,x3=grid),
terms=c(0,0,0,0,1,0,0,1,1,1,1,0,0,0,1,1,1,1))
126 Smoothing Splines: Methods and Applications
The contour plots of the estimated departures for three environmentsare shown in Figure 4.8. Note that the significant regions at time 60 msare similar to those in Figure 4.4.
length (mm)
tim
e (
ms)
−2
−1
−1
−1
0
0
2
2
4
4
6
80 100 120 140
05
01
00
15
02
00
2words
length (mm)
−5
−3
−3
−1
−1
−1
0
0
0
1
1
3
80 100 120 140
cluster
length (mm)
−4
−2
−2
−1
−1
−1
0
0
0
1
1
3
80 100 120 140
schwa
FIGURE 4.8 Ultrasound data, contour plots of estimated environ-ment effect for three environments in the SS ANOVA model (4.50). Re-gions where the lower bounds of the 95% Bayesian confidence intervalsare positive are shaded in dark grey. Regions where the upper boundsof the 95% Bayesian confidence intervals are negative are shaded in lightgrey.
4.9.2 Ozone in Arosa — Revisit
Suppose we want to investigate how ozone thickness changes over time byconsidering the effects of both month and year. In Section 2.10 we fittedan additive model (2.49) with the month effect modeled parametricallyby a simple sinusoidal function, and the year effect modeled nonpara-metrically by a cubic spline. The following analyses show how to modelboth effects nonparametrically and investigate their interaction using SSANOVA decompositions. We also show how to check the partial splinemodel (2.49).
Let y be the response variable thick, x1 be the independent variablemonth scaled into [0, 1], and x2 be the independent variable year scaledinto [0, 1]. It is reasonable to assume that the mean ozone thickness is aperiodic function of x1. We model the effect of x1 using a periodic splinespace W 2
2 (per) and the effect of x2 using a cubic spline space W 22 [0, 1].
Smoothing Spline ANOVA 127
Therefore, we consider the SS ANOVA decomposition (4.23) for thetensor product space Wm1
2 (per) ⊗W 22 [0, 1]. The following statements
fit model (4.23) with m1 = 2:
> Arosa$x1 <- (Arosa$month-0.5)/12
> Arosa$x2 <- (Arosa$year-1)/45
> arosa.ssanova.fit1 <- ssr(thick~I(x2-0.5), data=Arosa,
rk=list(periodic(x1),
cubic(x2),
rk.prod(periodic(x1),kron(x2-.5)),
rk.prod(periodic(x1),cubic(x2))))
> summary(arosa.ssanova.fit1)
...
GCV estimate(s) of smoothing parameter(s) : 5.442106e-06
2.154531e-09 3.387917e-06 2.961559e-02
Equivalent Degrees of Freedom (DF): 50.88469
Estimate of sigma: 14.7569
The mean function f(x) in model (4.23) evaluated at design pointsf = (f(x1), . . . , f(xn))T can be represented as
f = µ1 + f1 + f2 + f12,
where 1 is a vector of all ones, f1 = (f1(x1), . . . , f1(xn))T , f2 =(f2(x1), . . . , f2(xn))T , f12 = (f12(x1), . . . , f12(xn))T , and f1(x), f2(x)and f12(x1) are the main effect of x1, the main effect of x2, and theinteraction between x1 and x2. Eliminating the constant term, we have
f∗ = f∗1 + f∗
2 + f∗12,
where a∗ = a− a1, and a =∑n
i=1 ai/n. Let f∗, f
∗1, f
∗2, and f
∗12 be the
estimates of f∗, f∗1, f
∗2, and f∗
12, respectively. To check the contributionsof the main effects and interaction, we compute the quantities πk =
(f∗k)T f
∗/||f∗||2 for k = 1, 2, 12, and the Euclidean norms of f
∗, f
∗1, f
∗2,
and f∗12:
> f1 <- predict(arosa.ssanova.fit1, terms=c(0,0,1,0,0,0))
> f2 <- predict(arosa.ssanova.fit1, terms=c(0,1,0,1,0,0))
> f12 <- predict(arosa.ssanova.fit1, terms=c(0,0,0,0,1,1))
> fs1 <- scale(f1$fit, scale=F)
> fs2 <- scale(f2$fit, scale=F)
> fs12 <- scale(f12$fit, scale=F)
> ys <- fs1+fs2+fs12
> pi1 <- sum(fs1*ys)/sum(ys**2)
> pi2 <- sum(fs2*ys)/sum(ys**2)
128 Smoothing Splines: Methods and Applications
> pi12 <- sum(fs12*ys)/sum(ys**2)
> print(round(c(pi1,pi2,pi12),4))
0.9375 0.0592 0.0033
> print(round(sqrt(c(sum(fs1**2),sum(fs2**2),
sum(fs12**2))),2))
768.31 186.21 30.25
See Gu (2002) for more details about the cosine diagnostics. It is clearthat the contribution from the interaction to the total variation is neg-ligible. We also compute the posterior mean and standard deviation ofthe interaction f12:
> grid <- seq(0,1,len=50)
> predict(arosa.ssanova.fit1, terms=c(0,0,0,0,1,1),
expand.grid(x1=grid,x2=grid))
The estimate of the interaction is shown in Figure 4.9(a). Except fora narrow region, the zero function is contained in the 95% Bayesianconfidence intervals. Thus the interaction is negligible.
2 4 6 8 10
1930
1940
1950
1960
1970
month
year
−3
−2
−2
−1
−1
−1
0 0 0 0
1
1
1 2
2
3
(a)
2 4 6 8 10
1930
1940
1950
1960
1970
month
year
−3
−2
−2
−1
−1
−1
0 0 0 0
1
1
1
2
2
3
3
(b)
2 4 6 8 10
−10
−5
05
10
month
thic
kness
(c)
FIGURE 4.9 Arosa data, (a) plot of estimate of the interaction inmodel (4.23), (b) plot of estimate of the interaction in model (4.52),and (c) plot of estimate of the smooth component fs
1 in model (4.53).For plots (a) and (b), regions where the lower bounds of the confidenceintervals are positive are shaded in dark grey, while regions where theupper bounds of the confidence intervals are negative are shaded in lightgrey. For plot (c), the solid line represents the estimate of fs
1 , the shadedregion represents 95% Bayesian confidence intervals, and the dashed linerepresents the zero function.
Smoothing Spline ANOVA 129
We drop the interaction term and fit an additive model
f(x1, x2) = µ+ β × (x2 − 0.5) + fs2 (x2) + f1(x1) (4.51)
as follows:
> update(arosa.ssanova.fit1,
rk=list(periodic(x1),cubic(x2)))
The estimates of two main effects are shown in Figure 4.10.
month
thic
kness
−4
0−
20
02
04
0
1 2 3 4 5 6 7 8 9 11
o
o
o
oo
o
o
o
oo
o
o
Main effect of month
year
thic
kness
−4
0−
20
02
04
0
1926 1935 1944 1953 1962 1971
o
o
o
o
o
oo
o
ooo
o
ooo
oo
oo
o
ooo
ooo
oo
o
o
ooo
o
oo
o
oo
o
o
o
o
o
o
Main effect of year
FIGURE 4.10 Arosa data, plots of the estimates of the main effectsin the SS ANOVA model (4.51). Circles in the left panel representmonthly average thickness minus the overall mean. Circles in the rightpanel represent yearly average thickness minus the overall mean. Shadedregions are 95% Bayesian confidence intervals.
Now suppose we want to check if the partial spline model (2.49) is ap-propriate for the Arosa data. For the month effect, consider the trigono-metric spline model W 3
2 (per) with L = D{D2+(2π)2} defined in Section2.11.5. The null space span{1, sin(2πx), cos(2πx)} corresponds to the si-nusoidal model space assumed for the partial spline model (2.49). Con-sider the tensor product space W 3
2 (per) ⊗W 22 [0, 1]. Define averaging
130 Smoothing Splines: Methods and Applications
operators:
A(1)1 f =
∫ 1
0
fdx1,
A(1)2 f =
∫ 1
0
f sin(2πx1)dx1,
A(1)3 f =
∫ 1
0
f cos(2πx1)dx1,
A(2)1 f =
∫ 1
0
fdx2,
A(2)2 f =
(∫ 1
0
f ′dx2
)
(x2 − 0.5).
Let A(1)4 = I −A(1)
1 −A(1)2 −A(1)
3 and A(2)3 = I −A(2)
1 −A(2)2 . Then
f ={A(1)
1 + A(1)2 + A(1)
3 + A(1)4
}{A(2)
1 + A(2)2 + A(2)
3
}f
= A(1)1 A(2)
1 f + A(1)2 A(2)
1 f + A(1)3 A(2)
1 f + A(1)4 A(2)
1 f
+ A(1)1 A(2)
2 f + A(1)2 A(2)
2 f + A(1)3 A(2)
2 f + A(1)4 A(2)
2 f
+ A(1)1 A(2)
3 f + A(1)2 A(2)
3 f + A(1)3 A(2)
3 f + A(1)4 A(2)
3 f
, µ+ β1 × sin(2πx1) + β2 × cos(2πx1) + fs1 (x1)
+ β3 × (x2 − 0.5) + β4 × sin(2πx1) × (x2 − 0.5)
+ β5 × cos(2πx1) × (x2 − 0.5) + fsl12(x1, x2)
+ fs2 (x2) + f1s
12 (x1, x2) + f2s12 (x1, x2) + fss
12(x1, x2), (4.52)
where µ represents the overall mean; β1 × sin(2πx1) and β2 × cos(2πx1)represent the parametric main effects of x1; f
s1 (x1) represents the smooth
main effect of x1; β3 × (x2 − 0.5) represents the linear main effect of x2;fs2 (x2) represents the smooth main effects of x2; and β4 × sin(2πx1) ×
(x2−0.5), β5×cos(2πx1)×(x2−0.5), fsl12(x1, x2), f
1s12 (x1, x2), f
2s12 (x1, x2),
and fss12(x1, x2) represent interactions. We fit model (4.52) and compute
the posterior mean and standard deviations for the overall interactionas follows:
> arosa.ssanova.fit3 <- ssr(thick~sin(2*pi*x1)+cos(2*pi*x1)
+I(x2-0.5)+I(sin(2*pi*x1)*(x2-0.5))
+I(cos(2*pi*x1)*(x2-0.5)), data=Arosa,
rk=list(lspline(x1,type=‘‘sine1’’), cubic(x2),
rk.prod(kron(sin(2*pi*x1)),cubic(x2)),
rk.prod(kron(cos(2*pi*x1)),cubic(x2)),
Smoothing Spline ANOVA 131
rk.prod(lspline(x1,type=‘‘sine1’’), kron(x2-.5)),
rk.prod(lspline(x1,type=‘‘sine1’’), cubic(x2))))
> ngrid <- 50
> grid1 <- seq(.5/12,11.5/12,length=ngrid)
> grid2 <- seq(0,1,length=ngrid)
> predict(arosa.ssanova.fit3,
expand.grid(x1=grid1,x2=grid2),
terms=c(0,0,0,0,1,1,0,0,1,1,1,1))
The estimate of the interaction is shown in Figure 4.9(b). Except fora narrow region, the zero function is contained in the 95% Bayesianconfidence intervals. Therefore, we drop interaction terms and considerthe following additive model:
f(x1, x2) = µ+ β1 × sin(2πx1) + β2 × cos(2πx1) + fs1 (x1)
β3 × (x2 − 0.5) + fs2 (x2). (4.53)
Note that the partial spline model (2.49) is a special case of model (4.53)with fs
1 (x1) = 0. We fit model (4.53) and compute posterior means andstandard deviations for fs
1 (x1):
> arosa.ssanova.fit4 <- ssr(thick~sin(2*pi*x1)
+cos(2*pi*x1)+I(x2-0.5), data=Aros,
rk=list(lspline(x1,type=‘‘sine1’’), cubic(x2)))
> predict(arosa.ssanova.fit4, expand.grid(x1=grid1,x2=0),
terms=c(0,0,0,0,1,0))
The estimate of fs1 (x1) is shown in Figure 4.9(c) with 95% Bayesian
confidence intervals. It is clear that fs1 (x1) is nonzero, which indicates
that the simple sinusoidal function is inadequate for modeling the montheffect.
4.9.3 Canadian Weather — Revisit
Consider the Canadian weather data with annual temperature profilesfrom all 35 stations as functional data. To investigate how the weatherpatterns differ, Ramsay and Silverman (2005) divided Canada into fourclimatic regions: Atlantic, Continental, Pacific, and Arctic. Let y bethe response variable temp, x1 be the independent variable region, andx2 be the independent variable month scaled into [0, 1]. The functionalANOVA (FANOVA) model (13.1) in Ramsay and Silverman (2005) as-sumes that
yk,x1(x2) = η(x2) + αx1
(x2) + ǫk,x1(x2), (4.54)
132 Smoothing Splines: Methods and Applications
where yk,x1(x2) is the temperature profile of station k in climate re-
gion x1, η(x2) is the average temperature profile across all of Canada,αx1
(x2) is the departure of the region x1 profile from the populationaverage profile η(x2), and ǫk,x1
(x2) are random errors. It is clear thatthe FANOVA can be derived from the SS ANOVA decomposition (4.22)by letting η(x2) = µ + f2(x2) and αx1
(x2) = f1(x1) + f12(x1, x2). The
side condition for the FANOVA model (4.54),∑4
x1=1 αx1(x2) = 0 for all
x2, is satisfied from the construction of the SS ANOVA decomposition.Model (4.54) is an example of situation (ii) in Section 2.10 where thedependent variable involves functional data.
month
tem
pe
ratu
re (
C)
−30
−20
−10
0
10
20
2 4 6 8 10 12
Continental Arctic
Atlantic
2 4 6 8 10 12
−30
−20
−10
0
10
20Pacific
FIGURE 4.11 Canadian weather data, plots of temperature profilesof stations in four regions (thin lines), and the estimated profiles (thicklines).
Observed temperature profiles are shown in Figure 4.11. The followingstatements fit model (2.10) and compute posterior means and standarddeviations for four regions:
> x1 <- rep(as.factor(region),rep(12,35))
Smoothing Spline ANOVA 133
> x2 <- (rep(1:12,35)-.5)/12
> y <- as.vector(monthlyTemp)
> canada.fit2 <- ssr(y~1,
rk=list(shrink1(x1), periodic(x2),
rk.prod(shrink1(x1),periodic(x2))))
> xgrid <- seq(.5/12,11.5/12,len=50)
> zone <- c(‘‘Atlantic’’,‘‘Pacific’’,
‘‘Continental’’,‘‘Arctic’’)
> grid <- data.frame(x1=rep(zone,rep(50,4)),
x2=rep(xgrid,4))
> canada.fit2.p1 <- predict(canada.fit2, newdata=grid)
Estimates of mean temperature functions for four regions are shown inFigure 4.11. To look at the region effects αx1
more closely, we computetheir posterior means and standard deviations:
> canada.fit2.p2 <- predict(canada.fit2, newdata=grid,
terms=c(0,1,0,1))
Estimates of region effects and 95% Bayesian confidence intervals areshown in Figure 4.12. These estimates are similar to those in Ramsayand Silverman (2005).
4.9.4 Texas Weather
Instead of dividing stations into geological regions as in model (4.54),suppose we want to investigate how the weather patterns depend on ge-ographical locations in terms of latitude and longitude. For illustration,consider the Texas weather data consisting of average monthly temper-atures during 1961–1990 from 48 weather stations in Texas. Denotex1 = (lat, long) as the geological location of a station, and x2 asthe month variable scaled into [0, 1]. We want to investigate how theexpected temperature, f(x1, x2), depends on both location and month
variables. Average monthly temperatures are computed using monthlytemperatures during 1986–1990 for all 48 stations and are used as ob-servations of f(x1, x2). For each fixed station, the annual temperatureprofile can be regarded as functional data on a continuous interval. Fig-ure 4.13 shows these curves for all 48 stations. For each fixed month,the temperature surface as a function of latitude and longitude can beregarded as functional data on R
2. Figure 4.14 shows contour plots ofobserved surfaces for January, April, July, and October.
A natural model space for the location variable is the thin-platespline W 2
2 (R2), and a natural model space for the month variable isW 2
2 (per). Therefore, we fit the SS ANOVA model (4.24):
134 Smoothing Splines: Methods and Applications
month
reg
ion
eff
ect
−10
0
10
2 4 6 8 10 12
Continental Arctic
Atlantic
2 4 6 8 10 12
−10
0
10
Pacific
FIGURE 4.12 Canadian weather data, plots of the estimated regioneffects to temperature, and 95% Bayesian confidence intervals.
> data(TXtemp); TXtemp1 <- TXtemp[TXtemp$year>1985,]
> y <- gapply(TXtemp1, which=5,
FUN=function(x) mean(x[x!=-99.99]),
group=TXtemp1$stacod*TXtemp1$month)
> tx.dat <- data.frame(y=as.vector(t(matrix(y,48,12,
byrow=F))))
> tx.dat$x2 <- rep((1:12-0.5)/12, 48)
> lat <- TXtemp$lat[seq(1, nrow(TXtemp),by=360)]
> long <- TXtemp$long[seq(1, nrow(TXtemp),by=360)]
> tx.dat$x11 <- rep(scale(lat), rep(12,48))
> tx.dat$x12 <- rep(scale(long), rep(12,48))
> tx.dat$stacod <- rep(TXtemp$stacod[seq(1,nrow(TXtemp),
by=360)],rep(12,48))
> tx.ssanova <- ssr(y~x11+x12, data=tx.dat,
rk=list(tp(list(x11,x12)),
periodic(x2),
rk.prod(tp.linear(list(x11,x12)),periodic(x2)),
rk.prod(tp(list(x11,x12)), periodic(x2))))
Smoothing Spline ANOVA 135
months
tem
pera
ture
(F
)
x x
x
x
x
xx x
x
x
x
x
xx
x
x
x
x x x
x
x
x
x
x
x
x
x
x
x xx
x
x
x
xx
x
x
x
x
xx x
x
x
x
xx
x
x
x
x
x xx
x
x
x
x
xx
x
x
x
xx x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
xx
x
x
x
x
xx x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
xx
x
x
x
xx
x
x
x
x
x
xx
x
x
x
xx x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
x
xx
x
x
x
xx x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
xx
x
x
x
xx x
x
x
x
x
xx
x
x
x
xx x
x
x
x
x
xx
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
x
x
x
x
x
x
x xx
x
x
x
x
x
x
x
x
x
x xx
x
x
x
xx
x
x
x
x
x x x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x
x
x
x
x
x xx
x
x
x
xx
x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
x
x x
x
x
x
x
xx
x
x
x
xx x
x
x
x
x
xx
x
x
x
x
x x
x
x
x
x
xx
x
x
x
x
x x
x
x
x
x
xx
x
x
x
xx x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
x
xx
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
x x
x
x
x
xx
x
x
x
x
xx x
x
x
x
xx
x
x
x
x
x
x x
x
x
x
x
xx
x
x
x
x
x x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
x
xx
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
x xx
x
x
x
x
xx
x
x
x
xx x
x
x
x
xx
x
x
x
x
x
x x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
x
xx
x
x
x
xx x
x
x
x
x
x
x
x
x
x
xx x
x
x
x
xx
x
x
x
x
x
x x
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
xx
x
x
x
x
xx
x
x
x
xx
x
x
x
x
x
x x
x
x
x
x
30
40
50
60
70
80
90
1 2 3 4 5 6 7 8 9 10 11 12
FIGURE 4.13 Texas weather data, plot of temperature profiles forall 48 stations.
January 36
38
40 42
44
46
48 50
52 54
56
April 56
58
60
62
64
64
66 68
70 72
July
78
78
78 79
80
80
81 82
82
83 84
85
October 58 60
62
64
64
66
68
68 70
72
FIGURE 4.14 Texas weather data, contour plots of observations forJanuary, April, July, and October. Dots represent station locations.
136 Smoothing Splines: Methods and Applications
where tp.linear computes the RK, φ2(x)φ2(z) + φ3(x)φ3(z), of thesubspace {φ2, φ3}.
The location effect equals β1φ1(x1)+β2φ2(x1)+fs1 (x1)+f
ls12(x1, x2)+
fss12(x1, x2). We compute posterior means and standard deviations of
the location effect for the southmost (Rio Grande City 3W), northmost(Stratford), westmost (El Paso WSO AP), and eastmost (Marshall) sta-tions:
> selsta <- c(tx.dat[tx.dat$x11==min(tx.dat$x11),7][1],
tx.dat[tx.dat$x11==max(tx.dat$x11),7][1],
tx.dat[tx.dat$x12==min(tx.dat$x12),7][1],
tx.dat[tx.dat$x12==max(tx.dat$x12),7][1])
> sellat <- sellong <- NULL
> for (i in 1:4) {
sellat <- c(sellat,
tx.dat$x11[tx.dat$stacod==selsta[i]][1])
sellong <- c(sellong,
tx.dat$x12[tx.dat$stacod==selsta[i]][1])
}
> grid <- data.frame(x11=rep(sellat,rep(40,4)),
x12=rep(sellong ,rep(40,4)),
x2=rep(seq(0,1,len=40), 4))
> tx.pred1 <- predict(tx.ssanova, grid,
terms=c(0,1,1,1,0,1,1))
The estimates of these effects and 95% Bayesian confidence intervalsare shown in Figure 4.15. The curve in each plot shows how temperatureprofile of that particular station differ from the average profile amongthe 48 stations. It is clear that the temperature in Rio Grande Cityis higher than average, and the temperature in Stratford is lower thanaverage, especially in the winter. The temperature in El Paso is close toaverage in the first half year, and lower than average in the second halfyear. The temperature in Marshall is slightly above average.
The month effect equals f2(x2)+fls12(x1, x2)+f
ss12(x1, x2). We compute
posterior means and standard deviations of the month effect:
> tx.pred2 <- predict(tx.ssanova, terms=c(0,0,0,0,1,1,1))
The estimates of these effects for January, April, July, and Octoberare shown in Figure 4.16. Each plot in Figure 4.16 shows how temper-ature pattern for that particular month differ from the average patternamong all 12 months. As expected, the temperature in January is colderthan average, while the temperature in July is warmer than average. Ingeneral, the difference becomes smaller from north to south. The tem-peratures in April and October are close to average.
Smoothing Spline ANOVA 137
month
tem
pera
ture
(F
)
−15−10
−505
10
0 2 4 6 8 10 12
El Paso WSO AP Marshall
Rio Grande City 3W
0 2 4 6 8 10 12
−15−10−50510
Stratford
FIGURE 4.15 Texas weather data, plots of the location effect (solidlines) for four selected stations with 95% Bayesian confidence intervals(dashed lines).
138 Smoothing Splines: Methods and Applications
January
−20
−20
−19
−19
−18
−17
−16
−16
−15
April
0
0.3
0.6
0.9
0.9
1.2
July
13
15
15
17
19
21
October
0 0.4
0.8
0.8
0.8
1.2
1.2
FIGURE 4.16 Texas weather data, plots of the month effect for Jan-uary, April, July, and October.
Chapter 5
Spline Smoothing withHeteroscedastic and/orCorrelated Errors
5.1 Problems with Heteroscedasticityand Correlation
In the previous chapters we have assumed that observations are indepen-dent with equal variances. These assumptions may not be appropriatefor many applications. This chapter presents spline smoothing methodsfor heteroscedastic and/or correlated observations. Before introducingthese methods, we first illustrate potential problems associated with thepresence of heteroscedasticity and correlation in spline smoothing.
We use the following simulation to illustrate potential problems withheteroscedasticity. Observations are generated from model (1.1) withf(x) = sin(4πx2), xi = i/n for i = 1, . . . , n and n = 100. Randomerrors are generated independently from the Gaussian distribution withmean zero and variance σ2 exp{α|f(x)|}. Therefore, we have unequalvariances when α 6= 0. We set σ = 0.05 and α = 4. For each simulateddata, we first fit the cubic spline directly using PLS with GCV andGML choices of smoothing parameters. Note that these direct fits ig-nore heteroscedasticity. We then fit the cubic spline using the penalizedweighted LS (PWLS) introduced in Section 5.2.1 with known weightsW = diag(exp{−4|f(x1)|}, . . . , exp{−4|f(xn)|}). For each fit, we com-pute weighted MSE (WMSE)
WMSE =1
n
n∑
i=1
wi(f(xi) − f(xi))2,
where wi = exp{−4|f(xi)|}. We also construct 95% Bayesian confidenceintervals for each fit. The simulation is repeated 100 times. Figure 5.1shows the performances of unweighted and weighted methods in termsof WMSE and coverages of Bayesian confidence intervals. Figure 5.1(a)
139
140 Smoothing Splines: Methods and Applications
GCV GCVW GML GMLW
0.0
005
0.0
015
WM
SE
+
++
+
(a)
unweighted weighted
0.7
50.8
50.9
5avera
ge c
overa
ge + +
(b)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
poin
twis
e c
overa
ge
(c)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
poin
twis
e c
overa
ge
(d)
FIGURE 5.1 (a) Boxplots of WMSEs based on PLS with GCV andGML choices of the smoothing parameter and PWLS with GCV (la-beled as GCVW) and GML (labeled as GMLW) choices of the smooth-ing parameter. Average WMSEs are marked as pluses; (b) boxplots ofacross-the-function coverages of 95% Bayesian confidence intervals forthe PLS fits with GML choice of the smoothing parameter (labeled asunweighted) and PWLS fits with GML choice of the smoothing param-eter (labeled as weighted). Average coverages are marked as pluses; (c)plot of pointwise coverages (solid line) of 95% Bayesian confidence inter-vals for the PLS fits with GML choice of the smoothing parameter; (d)plot of pointwise coverages (solid line) of 95% Bayesian confidence in-tervals for the PWLS fits with GML choice of the smoothing parameter.Dotted lines in (b), (c), and (d) represent the nominal value. Dashedlines in (c) and (d) represent a scaled version of the variance functionexp{α|f(x)|}.
indicates that, even though ignoring heteroscedasticity, the unweightedmethods provide good fits to the function. The weighted methods leadto better fits. Bayesian confidence intervals based on both methods pro-vide the intended across-the-function coverages (Figure 5.1(b)). How-ever, Figure 5.1(c) reveals the problem with heteroscedasticity: point-
Spline Smoothing with Heteroscedastic and/or Correlated Errors 141
wise coverages in regions with larger variances are smaller than the nom-inal value, while pointwise coverages in other regions are larger than thenominal value. Obviously, this is caused by ignoring heteroscedasticity.Bayesian confidence intervals based on the PWLS method overcomesthis problem (Figure 5.1(d)).
x
y
−1.5
−1.0
−0.5
0.0
0.5
1.0
1.5
0.0 0.2 0.4 0.6 0.8 1.0
o
ooo
ooooo
o
o
ooo
oo
oooo
oo
oooooooo
oo
o
oooooooo
oo
oooo
oo
o
ooooooooooo
o
o
oooooo
o
oooo
o
oo
oo
ooo
o
ooo
o
oo
o
oooo
oo
ooo
o
UBR
0.0 0.2 0.4 0.6 0.8 1.0
o
ooo
ooooo
o
o
ooo
oo
oooo
oo
oooooooo
oo
o
oooooooo
oo
oooo
oo
o
ooooooooooo
o
o
oooooo
o
oooo
o
oo
oo
ooo
o
ooo
o
oo
o
oooo
oo
ooo
o
GCV
0.0 0.2 0.4 0.6 0.8 1.0
o
ooo
ooooo
o
o
ooo
oo
oooo
oo
oooooooo
oo
o
oooooooo
oo
oooo
oo
o
ooooooooooo
o
o
oooooo
o
oooo
o
oo
oo
ooo
o
ooo
o
oo
o
oooo
oo
ooo
o
GML
FIGURE 5.2 Plots of the true function (dashed lines), observations(circles), and the cubic spline fits (solid lines) with UBR, GCV, andGML choices of smoothing parameters. The true variance is used in thecalculation of the UBR criterion.
Comparing with heteroscedasticity, the potential problems associatedwith correlation are more fundamental and difficult to deal with. For il-lustration, again we simulate data from model (1.1) with f(x) = sin(4πx2),xi = i/n for i = 1, . . . , n and n = 100. Random errors ǫi are generatedby a first-order autoregressive model (AR(1)) with mean zero, standarddeviation 0.2, and first-order correlation 0.6. Figure 5.2 shows the simu-lated data, the true function, and three cubic spline fits with smoothingparameters chosen by the UBR, GCV, and GML methods, respectively.The true variance is used in the calculation of the UBR criterion. Allfits are wiggly, which indicates that the estimated smoothing parame-ters are too small. This undersmoothing phenomenon is common in thepresence of positive correlation. Ignoring correlation, the UBR, GCV,and GML methods perceive that all the trend (signal) in the data is dueto the mean function f and attempt to incorporate that trend into theestimate. Correlated random errors may induce local trend and thusfool these methods to select smaller smoothing parameters such that thelocal trend can be picked up.
142 Smoothing Splines: Methods and Applications
5.2 Extended SS ANOVA Models
Suppose observations are generated by
yi = Lif + ǫi, i = 1, . . . , n, (5.1)
where f is a multivariate function in a model space M, and Li arebounded linear functionals. Let ǫ = (ǫ1, . . . , ǫn)T . We assume thatE(ǫ) = 0 and Cov(ǫ) = σ2W−1. In this chapter we consider the SSANOVA model space
M = H0 ⊕H1 ⊕ · · · ⊕ Hq, (5.2)
where H0 is a finite dimensional space collecting all functions that arenot going to be penalized, and H1, . . . ,Hq are orthogonal RKHS’s withRKs Rj for j = 1, . . . , q. Model (5.1) is an extension of the SS ANOVAmodel (4.31) with non-iid random errors. See Section 8.2 for a moregeneral model involving linear functionals and non-iid errors.
5.2.1 Penalized Weighted Least Squares
Our goal is to estimate f as well as W when it is unknown. We firstassume that W is fixed and consider the estimation of the nonparametricfunction f . A direct generalization of the PLS (4.32) is the followingPWLS:
minf∈M
1
n(y − f )TW (y − f) + λ
q∑
j=1
θ−1j ‖Pjf‖2
, (5.3)
where y = (y1, . . . , yn)T , f = (L1f, . . . ,Lnf)T , and Pj is the orthogonalprojection in M onto Hj .
Let θ = (θ1, . . . , θq). Denote φ1, . . . , φp as basis functions of H0,T = {Liφν}n p
i=1 ν=1, Σk = {LiLjRk}n
i,j=1 for k = 1, . . . , q, and Σθ =∑q
j=1 θjΣj . As in the previous chapters, we assume that T is of fullcolumn rank. The same arguments in Section 2.4 apply to the PWLS.Therefore, by the Kimeldorf–Wahba representer theorem, the solutionto (5.3) exists and is unique, and the solution can be represented as
f(x) =
p∑
ν=1
dνφν(x) +
n∑
i=1
ci
q∑
j=1
θjLi(z)Rj(x, z).
Spline Smoothing with Heteroscedastic and/or Correlated Errors 143
Let c = (c1, . . . , cn)T and d = (d1, . . . , dp)T . Let f = (L1f , . . . ,Lnf)T
be the vector of fitted values. It is easy to check that f = Td + Σθc
and ‖P ∗1 f‖2
∗ = cT Σθc. Then the PWLS (5.3) reduces to
1
n(y − Td− Σθc)
TW (y − Td− Σθc) + λcT Σθc. (5.4)
Taking the first derivatives leads to the following equations for c and d:
(ΣθWΣθ + nλΣθ)c + ΣθWTd = ΣθWy,
T TWΣθc+ T TWTd = T TWy.(5.5)
It is easy to check that a solution to
(Σθ + nλW−1)c+ Td = y,
T Tc = 0,(5.6)
is also a solution to (5.5). Let M = Σθ + nλW−1 and
T = (Q1 Q2)
(R0
)
be the QR decomposition of T . Then the solutions to (5.6) are
c = Q2(QT2 MQ2)
−1QT2 y,
d = R−1QT1 (y −Mc).
(5.7)
Based on the first equation in (5.6) and the fact that f = Td+ Σθc,we have
f = y − nλW−1c = H(λ,θ)y
where
H(λ,θ) = I − nλW−1Q2(QT2MQ2)
−1QT2 (5.8)
is the hat matrix. The dependence of H on the smoothing parametersis expressed explicitly in (5.8). In the reminder of this chapter, forsimplicity, the notation H will be used. Note that, different from theindependent case, H may be asymmetric.
To solve (5.6), for fixed W and smoothing parameters, consider trans-formations y = W 1/2y, T = W 1/2T , Σθ = W 1/2ΣθW
1/2, c = W−1/2c,
and d = d. Then equations in (5.6) are equivalent to the followingequations
(Σθ + nλI)c + T d = y,
T T c = 0.(5.9)
Note that equations in (5.9) have the same form as those in (2.21); thusmethods in Section 2.4 can be used to compute c and d. Transformingback, we have the solutions of c and d.
144 Smoothing Splines: Methods and Applications
5.2.2 UBR, GCV and GML Criteria
We now extend the UBR, GCV, and GML criteria for the estimation ofsmoothing parameters λ and θ as well as W when it is unknown.
Define the weighted version of the loss function
L(λ,θ) =1
n(f − f )TW (f − f ) (5.10)
and weighted MSE as WMSE(λ,θ) , EL(λ,θ). Then
WMSE(λ,θ) =1
nfT (I −HT )W (I −H)f +
σ2
ntrHTWHW−1. (5.11)
It is easy to check that an unbiased estimate of WMSE(λ,θ) + σ2 is
UBR(λ,θ) =1
nyT (I −HT )W (I −H)y +
2σ2
ntrH. (5.12)
The UBR estimates of smoothing parameters andW when it is unknownare the minimizers of UBR(λ,θ).
To extend the CV and GCV methods, we first consider the specialcase when W is diagonal: W = diag(w1, . . . , wn). Denote f [i] as theminimizer of the PWLS (5.3) based on all observations except yi. It iseasy to check that the leaving-out-one lemma in Section 3.4 still holdsand Lif
[i]−yi = (Lif−yi)/(1−hii), where hii are the diagonal elementsof H . Then the cross-validation criterion is
CV(λ,θ) ,1
n
n∑
i=1
wi
(
Lif[i] − yi
)2
=1
n
n∑
i=1
wi(Lif − yi)
2
(1 − hii)2. (5.13)
Replacing hii by the average of diagonal elements leads to the GCVcriterion
GCV(λ,θ) ,
1n ||W 1/2(I −H)y||2{
1n tr(I −H)
}2 . (5.14)
Next consider the following model for clustered data
yij = Lijf + ǫij , i = 1, . . . ,m; j = 1, . . . , ni, (5.15)
where yij is the jth observation in cluster i, and Lij are bounded linearfunctionals. Let n =
∑mi=1 ni, yi = (yi1, . . . , yini
)T , y = (yT1 , . . . ,y
Tm)T ,
ǫi = (ǫi1, . . . , ǫini)T , and ǫ = (ǫT
1 , . . . , ǫTm)T . We assume that Cov(ǫi) =
σ2W−1i and observations between clusters are independent. Conse-
quently, Cov(ǫ) = σ2W−1, where W = diag(W1, . . . ,Wm) is a blockdiagonal matrix. Assume that f ∈ M, where M is the model space in
Spline Smoothing with Heteroscedastic and/or Correlated Errors 145
(5.2). We estimate f using the PWLS (5.3). Let f be the PWLS esti-
mate based on all observations, and f [i] be the PWLS estimate basedon all observations except those from the ith cluster yi. Let f i =
(Li1f, . . . ,Linif)T , f = (fT
1 , . . . ,fTm)T , f i = (Li1f , . . . ,Lini
f)T , f =
(fT
1 , . . . , fT
m)T , f[i]
j = (Lj1f[i], . . . ,Ljnj
f [i])T , and f[i]
= ((f[i]
1 )T , . . . ,
(f[i]
m)T )T .
Leaving-out-one-cluster Lemma
For any fixed i, f [i] is the minimizer of
1
n
(
f[i]
i − f i
)T
Wi
(
f[i]
i − f i
)
+1
n
∑
j 6=i
(yj − f j
)TWj
(yj − f j
)
+λ
q∑
j=1
θ−1j ‖Pjf‖2. (5.16)
[Proof] For any function f ∈ M, we have
1
n
(
f[i]
i − f i
)T
Wi
(
f[i]
i − f i
)
+1
n
∑
j 6=i
(yj − f j
)TWj
(yj − f j
)
+λ
q∑
j=1
θ−1j ‖Pjf‖2
≥ 1
n
∑
j 6=i
(yj − f j
)TWj
(yj − f j
)+ λ
q∑
j=1
θ−1j ‖Pjf‖2
≥ 1
n
∑
j 6=i
(
yj − f[i]
j
)T
Wj
(
yj − f[i]
j
)
+ λ
q∑
j=1
θ−1j ‖Pj f
[i]‖2
=1
n
(
f[i]
i − f [i]
i
)T
Wi
(
f[i]
i − f [i]
i
)
+1
n
∑
j 6=i
(
yj − f[i]
j
)T
Wj
(
yj − f[i]
j
)
+λ
q∑
j=1
θ−1j ‖Pj f
[i]‖2.
The leaving-out-one-cluster lemma implies that f = Hy and f[i]
=Hy[i], where H is the hat matrix and y[i] is the same as y except that yi
is replaced by f[i]
i . Divide the hat matrix H according to clusters such
146 Smoothing Splines: Methods and Applications
that H = {Hik}mi,k=1, where Hik is an ni × nk matrix. Then we have
f i =
m∑
j=1
Hijyj ,
f[i]
i =∑
j 6=i
Hijyj +Hiif[i]
i .
Assume that I −Hii is invertible. Then
f[i]
i − yi = (I −Hii)−1(f i − yi).
The cross-validation criterion
CV(λ,θ)
,1
n
m∑
i=1
(
f[i]
i − yi
)T
Wi
(
f[i]
i − yi
)
=1
n
m∑
i=1
(
f i − yi
)T
(I −Hii)−TWi(I −Hii)
−1(
f i − yi
)
. (5.17)
Replacing I−Hii by its generalized averageGi (Ma, Dai, Klein, Klein,Lee and Wahba 2010), we have
GCV(λ,θ) ,1
n
m∑
i=1
(
f i − yi
)T
GiWiGi
(
f i − yi
)
, (5.18)
where Gi = aiIni− bi1ni
1Tni
, ai = 1/(δi − γi), bi = γi/[(δi − γi){δi +(ni − 1)γi}], δi = (n − trH)/mni, γi = 0 when ni = 1 and γi =−∑m
i=1
∑
s6=t hist/mni(ni − 1) when ni > 1, hi
st is the element of thesth row and tth column of the matrix Hii, Ini
is an identity matrix ofsize ni, and 1ni
is an ni-vector of all ones. The CV and GCV estimatesof smoothing parameters and W when it is unknown are the minimizersof the CV and GCV criteria.
To derive the GML criterion, we first construct a Bayes model for theextended SS ANOVA model (5.1). Assume the same prior for f as in(4.42):
F (x) =
p∑
ν=1
ζνφν(x) + δ12
q∑
j=1
√
θjUj(x), (5.19)
where ζνiid∼ N(0, κ), Uj(x) are independent, zero-mean Gaussian stochas-
tic processes with covariance function Rj(x, z), ζν and Uj(x) are mutu-ally independent, and κ and δ are positive constants. Suppose observa-tions are generated from
yi = LiF + ǫi, i = 1, . . . , n, (5.20)
Spline Smoothing with Heteroscedastic and/or Correlated Errors 147
where ǫ = (ǫ1, . . . , ǫn)T ∼ N(0, σ2W−1).Let L0 be a bounded linear functional on M. Let λ = σ2/nδ. The
same arguments in Section 3.6 hold when M = Σ + nλI is replaced byM = Σθ + nλW−1 in this chapter (Wang 1998b). Therefore,
limκ→∞
E(L0F |y) = L0f ,
and an extension of the GML criterion is
GML(λ,θ) =yTW (I −H)y
{det+(W (I −H))} 1n−p
, (5.21)
where det+ is the product of the nonzero eigenvalues. The GML es-timates of smoothing parameters and W when it is unknown are theminimizers of GML(λ,θ).
The GML estimator of the variance σ2 is (Wang 1998b)
σ2 =yTW (I −H)y
n− p. (5.22)
5.2.3 Known Covariance
In this section we discuss the implementation of the UBR, GCV, andGML criteria in Section 5.2.2 when W is known. In this situation weonly need to estimate the smoothing parameters λ and θ. Considertransformations discussed at the end of Section 5.2.1. Let f = W 1/2f
be the fits to the transformed data, and H be the hat matrix associatedwith the transformed data. Then Hy = f = W−1/2f = W−1/2Hy =W−1/2HW 1/2y for any y. Therefore,
H = W−1/2HW 1/2. (5.23)
From (5.23), the UBR, GCV, and GML in (5.12), (5.14), and (5.21)can be rewritten based on the transformed data as
UBR(λ,θ) =1
n||(I − H)y||2 +
2σ2
ntrH, (5.24)
GCV(λ,θ) =1n
∑ni=1 ||(I − H)y||2{
1n tr(I − H)
}2 , (5.25)
GML(λ,θ) =CyT (I − H)y
{det+(I − H)} 1
n−p
, (5.26)
where C in GML(λ,θ) is a constant independent of λ and θ. Equations(5.24), (5.25), and (5.26) indicate that, when W is known, the UBR,
148 Smoothing Splines: Methods and Applications
GCV, and GML estimates of smoothing parameters can be calculatedbased on the transformed data using the method described in Section4.7.
5.2.4 Unknown Covariance
We now consider the case when W is unknown. When a separate methodis available for estimating W , one approach is to estimate the functionf and covariance W iteratively. For example, the following two-stepprocedure is simple and easy to implement: (1) Estimate the functionwith a “sensible” choice of the smoothing parameter; (2) Estimate thecovariance using residuals; and (3) Estimate the function again using theestimated covariance. However, it may not work in certain situations dueto the interplay between the smoothing parameter and correlation.
We use the following two simple simulations to illustrate the poten-tial problem associated with the above iterative approach. In the firstsimulation, n = 100 observations are generated according to model (1.1)
with f(x) = sin(4πx2), xi = i/n for i = 1, . . . , n, and ǫiiid∼ N(0, 0.22).
We fit a cubic spline with a fixed smoothing parameter λ such thatlog10(nλ) = −3.5. Figure 5.3(a) shows the fit, which is slightly over-smoothed. The estimated autocorrelation function (ACF) of residuals(Figure 5.3(b)) suggests an autoregressive structure even though thetrue random errors are independent. It is clear that the leftover signaldue to oversmoothing shows up in the residuals. In the second simula-tion, n = 100 observations are generated according to model (1.1) withf(x) ≡ 0, xi = i/n for i = 1, . . . , n, and ǫi are generated by an AR(1)model with mean zero, standard deviation 0.2, and first-order correla-tion 0.6. Figure 5.3(c) shows the cubic spline fit with GCV choice ofthe smoothing parameter. The fit picks up local trend in the AR(1)process, and the estimated ACF of residuals (Figure 5.3(d)) does not re-veal any autoregressive structure. In both cases, the mean functions areincorrectly estimated, and the conclusions about the error structuresare erroneous. These two simulations indicate that a wrong choice ofthe smoothing parameter in the first step will lead to a deceptive serialcorrelation in the second step.
In the most general setting where no parametric shape is assumed forthe mean or the correlation function, the model is essentially unidentifi-able. In the following we will model the correlation structure paramet-rically. Specifically we assume that W depends on an unknown vectorof parameters τ . Models for W−1 will be discussed in Section 5.3.
When there is no strong connection between the smoothing and cor-relation parameters, an iterative procedure as described earlier may be
Spline Smoothing with Heteroscedastic and/or Correlated Errors 149
o
o
o
ooo
oo
o
o
o
oo
o
o
o
o
o
o
o
o
ooo
o
o
o
oooo
o
ooo
oo
oo
ooo
o
o
ooooo
o
oo
ooooo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
oo
oo
o
o
o
o
oo
o
o
o
o
ooo
o
o
o
oo
o
0.0 0.2 0.4 0.6 0.8 1.0
−1.0
0.0
0.5
1.0
x
y
(a)
0 5 10 15 20
−0.2
0.2
0.6
1.0
Lag
AC
F
(b)
o
o
o
o
o
o
oo
oo
oo
o
o
oo
o
o
oo
o
ooo
o
o
o
o
oo
oo
o
ooo
o
o
ooo
o
o
oo
o
o
o
o
o
o
oooo
o
oo
o
oo
o
o
o
o
o
o
oo
o
o
o
oo
o
o
ooo
o
o
o
o
o
o
oo
o
o
ooo
oo
o
oo
o
o
o
0.0 0.2 0.4 0.6 0.8 1.0
−0.6
−0.2
0.2
0.6
x
y
(c)
0 5 10 15 20
−0.5
0.0
0.5
1.0
Lag
AC
F
(d)
FIGURE 5.3 For the first simulation, plots of (a) the true function(dashed lines), observations (circles), and the cubic spline fits (solid line)with log10(nλ) = −3.5, and (b) estimated ACF of residuals. For the sec-ond simulation, plots of (c) the true function (dashed lines), observations(circles), and the cubic spline fits (solid line) with GCV choice of thesmoothing parameter, and (d) estimated ACF of residuals.
used to estimate f and W . See Sections 5.4.1 and 6.4 for examples.When there is a strong connection, it is helpful to estimate the smooth-ing and correlation parameters simultaneously. One may use the UBR,GCV, or GML criterion to estimate the smoothing and correlation pa-rameters simultaneously. It was found that the GML method performsbetter than the UBR and GCV methods (Wang 1998b). Therefore, inthe following, we present the implementation of the GML method only.
To compute the minimizers of the GML criterion (5.21), we nowconstruct a corresponding LME model for the extended SS ANOVAmodel (5.1). Let Σk = ZkZ
Tk , where Zk is an n × mk matrix with
150 Smoothing Splines: Methods and Applications
mk = rank(Σk). Consider the following LME model
y = Tζ +
q∑
k=1
Zkuk + ǫ, (5.27)
where ζ is a p-dimensional vector of deterministic parameters, uk aremutually independent random effects, uk ∼ N(0, σ2θkImk
/nλ), Imkis
the identity matrix of order mk, ǫ ∼ N(0, σ2W−1), and uk are indepen-dent of ǫ. It is not difficult to show that the spline estimate based onthe PWLS is the BLUP estimate, and the GML criterion (5.21) is theREML criterion based on the LME model (5.27) (Opsomer, Wang andYang 2001). Details for the more complicated semiparametric mixed-effects models will be given in Chapter 9. In the ssr function, we firstcalculate Zk through Cholesky decomposition. Then we calculate theREML (GML) estimates of λ, θ and τ using the function lme in thenlme library by Pinheiro and Bates (2000).
5.2.5 Confidence Intervals
Consider the Bayes model defined in (5.19) and (5.20). Conditional onW , it is not difficult to show that formulae for posterior means andcovariances in Section 4.8 hold when M = Σθ+nλI is replaced by M =Σθ+nλW−1. Bayesian confidence intervals can be constructed similarly.When the matrix W involves unknown parameters τ , it is replaced byW (τ ) in the construction of Bayesian confidence intervals. This simpleplug-in approach does not account for the variation in estimating thecovariance parameters τ . The bootstrap approach may also be used toconstruct confidence intervals. More research is necessary for inferenceson the nonparametric function f as well as on parameters τ .
5.3 Variance and Correlation Structures
In this section we discuss commonly used models for the variance–covariance matrix W−1. The matrix W−1 can be decomposed as
W−1 = V CV,
where V is a diagonal matrix with positive diagonal elements, and Cis a correlation matrix with all diagonal elements equal to one. Thematrices V and C describe the variance and correlation, respectively.
Spline Smoothing with Heteroscedastic and/or Correlated Errors 151
The above decomposition allows us to develop separate models for thevariance structure and correlation structure. We now describe structuresavailable for the assist package. Details about these structures can befound in Pinheiro and Bates (2000).
Consider the variance structure first. Define general variance functionas
Var(ǫi) = σ2v2(µi, zi; ζ), (5.28)
where v is a known variance function of the mean µi = E(yi) and a vectorof covariates zi associated with the variance, and ζ is a vector of varianceparameters. As in Pinheiro and Bates (2000), when v depends on µi,the pseudo-likelihood method will be used to estimate all parameters andthe function f .
In the ssr function, a variance matrix (vector) or a model for thevariance function is specified using the weights argument. The inputto the weights argument may be the matrix W when it is known. Fur-thermore, when W is a known diagonal matrix, the diagonal elementsmay be specified as a vector by the weights argument. The input tothe weights argument may also be a varFunc structure specifying amodel for variance function v in (5.28). All varFunc objects availablein the nlme package are available for the assist package. The varFunc
structure is defined through the function v(s; ζ), where s can be eithera variance covariate or the fitted value. Standard varFunc classes andtheir corresponding variance functions v are listed in Table 5.1. Two ormore variance models may be combined using the varComb constructor.
TABLE 5.1 Standard varFunc classes
Class v(s; ζ)
varFixed√
|s|varIdent ζs, s is a stratification variablevarPower |s|ζvarExp exp(ζs)varConstPower ζ1 + |s|ζ2
Next consider the correlation structure. Assume the general isotropiccorrelation structure
cor(ǫi, ǫj) = h(d(pi,pj);ρ), (5.29)
where pi and pj are position vectors associated with observations i andj respectively; h is a known correlation function of the distance d(pi,pj)
152 Smoothing Splines: Methods and Applications
such that h(0;ρ) = 1; and ρ is a vector of the correlation parameters.In the ssr function, correlation structures are specified as corStruct
objects through the correlation argument. There are two commonfamilies of correlation structures: serial correlation structures for timeseries and spatial correlation structures for spatial data.
For time series data, observations are indexed by an one-dimensionalposition vector, and d(pi, pj) represents lags that take nonnegative inte-gers. Therefore, serial correlation is determined by the function h(k;ρ)for k = 1, 2, . . . . Standard corStruct classes for serial correlation struc-tures and their corresponding correlation functions h are listed in Table5.2.
TABLE 5.2 Standard corStruct classes for serialcorrelation structures
Class h(k;ρ)corCompSymm ρcorSymm ρk
corAR1 ρk
corARMA correlation function given in (5.30)
The classes corCompSymm and corSymm correspond to the compoundsymmetry and general correlation structures. An autoregressive-movingaverage (ARMA(p,q)) model assumes that
ǫt =
p∑
i=1
φiǫt−i +
q∑
j=1
θjat−j + at,
where at are iid random variables with mean zero and a constant vari-ance, φ1, . . . , φp are autoregressive parameters, and θ1, . . . , θq are movingaverage parameters. The correlation function is defined recursively as
h(k;ρ) =
φ1h(|k − 1|;ρ) + · · · + φph(|k − p|;ρ)+θ1ψ(k − 1;ρ) + · · · + θqψ(k − q;ρ), k ≤ q,
φ1h(|k − 1|;ρ) + · · · + φph(|k − p|;ρ), k > q,(5.30)
where ρ = (φ1, . . . , φp, θ1, . . . , θq) and ψ(k;ρ) = E(ǫt−kat)/Var(ǫt).The continuous AR(1) correlation function is defined as
h(s; ρ) = ρs, s ≥ 0, ρ ≥ 0.
The corCAR1 constructor specifies the continuous AR(1) structure. Forspatial data, locations pi ∈ R
r. Any well-defined distance such as
Spline Smoothing with Heteroscedastic and/or Correlated Errors 153
Euclidean d(pi,pj) = {∑rk=1(pik − pjk)2}1/2, Manhattan d(pi,pj) =
∑rk=1 |pik−pjk|, and maximum distance d(pi,pj) = max1≤k≤r |pik−pjk|
may be used. The correlation function structure is defined through thefunction h(s;ρ), where s ≥ 0. Standard corStruct classes for spatialcorrelation structures and their corresponding correlation functions hare listed in Table 5.3. The function I(s < ρ) equals 1 when s < ρ and0 otherwise. When desirable, the following correlation function allows anugget effect:
hnugg(s, c0;ρ) =
{(1 − c0)hcont(s;ρ), s > 0,1, s = 0,
where hcont is any standard correlation function that is continuous in sand 0 < c0 < 1 is a nugget effect.
TABLE 5.3 Standard corStruct classes forspatial correlation structures
Class h(s;ρ)corExp exp(−s/ρ)corGaus exp{−(s/ρ)2}corLin (1 − s/ρ)I(s < ρ)corRatio 1/{1 + (s/ρ)2}corSpher {1 − 1.5(s/ρ) + 0.5(s/ρ)3}I(s < ρ)
5.4 Examples
5.4.1 Simulated Motorcycle Accident — Revisit
Plot of observations in Figure 3.8 suggests that variances may changeover time. For the partial spline fit to model (3.53), we compute squaredresiduals and plot the logarithm of squared residuals over time in Figure5.4(a). Cubic spline fit to the logarithm of squared residuals in Figure5.4(a) indicates that the constant variance assumption may be violated.We then fit model (3.53) again using PWLS with weights fixed as theestimated variances:
> r <- residuals(mcycle.ps.fit2); y <- log(r**2)
154 Smoothing Splines: Methods and Applications
> mcycle.v <- ssr(y~x, cubic(x), spar=‘‘m’’)
> update(mcycle.ps.fit2, weights=exp(mcycle.v$fit))
> predict(mcycle.ps.fit3,
newdata=data.frame(x=grid, s1=(grid-t1)*(grid>t1),
s2=(grid-t2)*(grid>t2), s3=(grid-t3)*(grid>t3)))
time (ms)
log(s
quare
d r
esid
ual)
−5
05
0 10 20 30 40 50 60
o
o
ooo
ooo
o
o
o
o
o
o
o
oo
ooo
o
o
oo
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oooo
o
o
o
oo
o
o
ooo
oo
oo
o
o
o
o
oo
o
oo
o
oo
o
o
o
o
o
o
oooo
o
o
o
o
o
o
oo
o
oo
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
oo
o
oo
o
o
o o
(a)
time (ms)
accele
ration (
g)
−1
50
−5
00
50
0 10 20 30 40 50 60
ooooo ooooooooooooo ooo
oooooo
o
oo
oo
o
oo
oo
oo
oo
o
o
o
o
o
o
o
o
o
oooo
o
o
o
o
ooo
o
oo
ooo
o
o
o
ooo
o
o
o
o
o
o
o
o
o
o
ooo
o
oo
o
o
o
o
ooo
o
o
o
o
o
o
o
oo
o
o
o
ooo
oo
o
o
o
oo
o
ooo o
o
oo
oo
o
o
o
o
o
(b)
FIGURE 5.4 Motorcycle data, plots of (a) logarithm of the squaredresiduals (circles) based on model (3.53), and the cubic spline fit (line)to logarithm of the squared residuals; and (b) observations (circles), newPWLS fit (line), and 95% Bayesian confidence intervals (shaded region).
The PWLS fit and 95% Bayesian confidence intervals are shown in5.4(b). The impact of the unequal variances is reflected in the widths ofconfidence intervals. The two-step approach adapted here is crude, andthe variation in the estimation of the variance function has been ignoredin the construction of the confidence intervals. Additional methods forestimating the mean and variance functions will be discussed in Section6.4.
5.4.2 Ozone in Arosa — Revisit
Figure 2.2 suggests that the variances may not be a constant. Basedon fit to the trigonometric spline model (2.75), we calculate residualvariances for each month and plot them on the logarithm scale in Figure5.5(a). It is obvious that variations depend on the time of the year.It seems that a simple sinusoidal function can be used to model thevariance function.
Spline Smoothing with Heteroscedastic and/or Correlated Errors 155
> v <- sapply(split(arosa.ls.fit$resi,Arosa$month),var)
> a <- sort(unique(Arosa$x))
> b <- lm(log(v)~sin(2*pi*a)+cos(2*pi*a))
> summary(b)
...
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 5.43715 0.05763 94.341 8.57e-15 ***
sin(2 * pi * a) 0.71786 0.08151 8.807 1.02e-05 ***
cos(2 * pi * a) 0.49854 0.08151 6.117 0.000176 ***
---
Residual standard error: 0.1996 on 9 degrees of freedom
Multiple R-squared: 0.9274, Adjusted R-squared: 0.9113
F-statistic: 57.49 on 2 and 9 DF, p-value: 7.48e-06
month
log(v
ariance)
o
o
o o
o
o
o
o
o
o
o
o
4.5
5.5
6.5
1 3 5 7 9 11
(a)
month
thic
kness
30
03
50
40
0
1 3 5 7 9 11
(b)
FIGURE 5.5 Arosa data, plots of (a) logarithm of residual variances(circles) based on the periodic spline fit in Section 2.7, the sinusoidal fitto logarithm of squared residuals (dashed line), and the fit from model(5.31) (solid line); and (b) observations (dots), the new PWLS fit (solidline), and 95% Bayesian confidence intervals (shaded region).
The fit of the simple sinusoidal model to the log variance is shown inFigure 5.5(a). We now assume the following variance function for thetrigonometric spline model (2.75):
v(x) = exp(ζ1 sin 2πx+ ζ2 cos 2πx), (5.31)
and fit the model as follows:
156 Smoothing Splines: Methods and Applications
> arosa.ls.fit1 <- ssr(thick~sin(2*pi*x)+cos(2*pi*x),
rk=lspline(x,type=‘‘sine1’’), spar=‘‘m’’, data=Arosa,
weights=varComb(varExp(form=~sin(2*pi*x)),
varExp(form=~cos(2*pi*x))))
> summary(arosa.ls.fit1)
...
GML estimate(s) of smoothing parameter(s) : 3.675780e-09
Equivalent Degrees of Freedom (DF): 6.84728
Estimate of sigma: 15.22466
Combination of:
Variance function structure of class varExp representing
expon
0.3555942
Variance function structure of class varExp representing
expon
0.2497364
The estimated variance parameters, 0.3556 and 0.2497, are very closeto (up to a scale of 2 by definition) those in the sinusoidal model basedon residual variances — 0.7179 and 0.4985. The fitted variance functionis plotted in Figure 5.5(a) which is almost identical to the fit basedon residual variances. Figure 5.5(b) plots the trigonometric spline fitto the mean function with 95% Bayesian confidence intervals. Notethat these confidence intervals are conditional on the estimated varianceparameters. Thus they may have smaller coverage than the nominalvalue since variation in the estimation of variance parameters is notcounted. Nevertheless, we can see that unequal variances are reflectedin the widths of these confidence intervals.
Observations close in time may be correlated. We now consider afirst-order autoregressive structure for random errors. Since some obser-vations are missing, we use the continuous AR(1) correlation structureh(s, ρ) = ρs, where s represents distance in terms of calendar time be-tween observations. We refit model (2.75) using the variance structure(5.31) and continuous AR(1) correlation structure as follows:
> Arosa$time <- Arosa$month+12*(Arosa$year-1)
> arosa.ls.fit2 <- update(arosa.ls.fit1,
corr=corCAR1(form=~time))
> summary(arosa.ls.fit2)
...
GML estimate(s) of smoothing parameter(s) : 3.575283e-09
Equivalent Degrees of Freedom (DF): 7.327834
Estimate of sigma: 15.31094
Correlation structure of class corCAR1 representing
Spline Smoothing with Heteroscedastic and/or Correlated Errors 157
Phi
0.3411414
Combination of:
Variance function structure of class varExp representing
expon
0.3602905
Variance function structure of class varExp representing
expon
0.3009282
where the variable time represents the continuous calendar time in months.
5.4.3 Beveridge Wheat Price Index
The Beveridge data contain the time series of annual wheat price indexfrom 1500 to 1869. The time series of price index on the logarithmicscale is shown in Figure 5.6.
1500 1600 1700 1800
2.5
3.0
3.5
4.0
4.5
5.0
5.5
6.0
year
log o
f price index
o
oo
o
ooooo
o
o
o
o
oo
oo
o
o
o
o
o
oo
o
o
o
oooo
o
o
o
o
oo
o
o
oo
oo
o
o
o
o
o
o
oo
o
oo
o
o
o
oo
oo
o
o
o
o
o
o
ooo
o
o
o
o
o
oooo
oooo
oo
o
o
o
o
oo
o
oo
o
o
oo
o
oo
o
oo
ooo
o
oo
oooo
oooo
o
o
o
o
o
oo
o
o
oo
o
o
oo
ooooo
oooo
oo
o
oo
oo
o
o
o
o
o
o
ooo
oo
o
o
o
o
oo
ooooooo
o
o
o
oo
ooo
ooo
o
oo
oo
oo
o
o
o
ooo
o
o
o
o
o
o
o
o
ooo
o
o
oo
o
o
o
oooo
o
o
o
oo
ooo
oo
oo
oo
o
ooooo
o
o
o
o
oo
ooo
oo
o
ooo
oo
oo
ooo
oo
ooooooo
oo
o
oo
oo
ooo
ooooo
oo
o
oo
oo
oo
o
o
o
oo
o
o
oo
o
o
o
oooo
o
o
o
oo
o
o
o
o
oo
oo
ooo
o
o
o
o
o
o
ooooo
o
oo
ooooo
o
o
o
ooo
o
o
ooo
o
oo
o
o
o
o
oo
o
oo
oo
FIGURE 5.6 Beveridge data, observations (circles), the cubic splinefit under the assumption of independent random errors (dashed line),and the cubic spline fit under AR(1) correlation structure (solid line)with 95% Bayesian confidence intervals (shaded region).
158 Smoothing Splines: Methods and Applications
Let x be the year scaled into the interval [0, 1] and y be the logarithmof price index. Consider the following nonparametric regression model
yi = f(xi) + ǫi, i = 1, . . . , n, (5.32)
where f ∈ W 22 [0, 1] and ǫi are random errors. Under the assumption
that ǫi are independent with a common variance, model (5.32) can befitted as follows:
> library(tseries); data(bev)
> y <- log(bev); x <- seq(0,1,len=length(y))
> bev.fit1 <- ssr(y~x, rk=cubic(x))
> summary(bev.fit1)
...
GCV estimate(s) of smoothing parameter(s) : 3.032761e-13
Equivalent Degrees of Freedom (DF): 344.8450
Estimate of sigma: 0.02865570
where the GCV method was used to select the smoothing parameter.The estimate of the smoothing parameter is essentially zero, which in-dicates that the cubic spline fit interpolates observations (dashed line inFigure 5.6). The undersmoothing is likely caused by the autocorrelationin time series. Now consider an AR(1) correlation structure for randomerrors:
> bev.fit2 <- update(bev.fit1, spar=‘‘m’’,
correlation=corAR1(form=~1))
> summary(bev.fit2)
GML estimate(s) of smoothing parameter(s) : 1.249805e-06
Equivalent Degrees of Freedom (DF): 8.091024
Estimate of sigma: 0.2243519
Correlation structure of class corAR1 representing
Phi
0.6936947
Estimate of f and 95% Bayesian confidence intervals under AR(1) corre-lation structure are shown in Figure 5.6 as the solid line and the shadedregion.
5.4.4 Lake Acidity
The lake acidity data contain measurements of 112 lakes in the southernBlue Ridge mountains area. It is of interest to investigate the dependenceof the water pH level on calcium concentration and geological location.To match notations in Chapter 4, we relabel calcium concentration t1
Spline Smoothing with Heteroscedastic and/or Correlated Errors 159
as variable x1, and geological location x1 (latitude) and x2 (longitude)as variables x21 and x22, respectively. Let x2 = (x21, x22).
First, we fit an one-dimensional thin-plate spline to the response vari-able ph using one independent variable x1 (calcium):
ph(xi1) = f(xi1) + ǫi, (5.33)
where f ∈W 22 (R), and ǫi are zero-mean independent random errors with
a common variance.
> data(acid)
> acid$x21 <- acid$x1; acid$x22 <- acid$x2
> acid$x1 <- acid$t1
> acid.tp.fit1 <- ssr(ph~x1, rk=tp(x1), data=acid)
> summary(acid.tp.fit1)
...
GCV estimate(s) of smoothing parameter(s) : 3.433083e-06
Equivalent Degrees of Freedom (DF): 8.20945
Estimate of sigma: 0.281299
Number of Observations: 112
> anova(acid.tp.fit1)
Testing H_0: f in the NULL space
test.value simu.size simu.p-value
LMP 0.003250714 100 0.08
GCV 0.008239078 100 0.01
Both p-values from the LMP and GCV tests suggest that the departurefrom a straight line model is borderline significant. The estimate of thefunction f in model (5.33) is shown in the left panel of Figure 5.7.
Observations of the pH level close in geological locations are oftencorrelated. Suppose that we want to model potential spatial correla-tion among random errors in model (5.33) using the exponential spa-tial correlation structure with nugget effect for the location variablex2 = (x21, x22). That is, we assume that
hnugg(s, c0, ρ) =
{(1 − c0) exp(−s/ρ), s > 0,1, s = 0,
where c0 is the nugget effect and s is the Euclidean distance between twogeological locations. Model (5.33) with the above correlation structurecan be fitted as follows:
160 Smoothing Splines: Methods and Applications
calcium(log10 mg/L)
pH
6.0
6.5
7.0
7.5
8.0
−0.5 0.0 0.5 1.0 1.5
−0.020.00
0.02−0.02
0.00
0.02−0.4
−0.2
0.0
0.2
latitude
long
itude
pH
FIGURE 5.7 Lake acidity data, the left panel includes observations(circles), the fit from model (5.33) (solid line), the fit from model (5.33)with the exponential spatial correlation structure (dashed line), and es-timate of the constant plus main effect of x1 from model (5.36) (dottedline); the right panel includes estimate of the main effect of x2 frommodel (5.36).
> acid.tp.fit2 <- update(acid.tp.fit1, spar=‘‘m’’,
corr=corExp(form=~x21+x22,nugget=T))
> summary(acid.tp.fit2)
...
GML estimate(s) of smoothing parameter(s) : 53310.63
Equivalent Degrees of Freedom (DF): 2
Estimate of sigma: 0.3131702
Correlation structure of class corExp representing
range nugget
0.02454532 0.62321744
The GML estimate of the smoothing parameter is large, and the splineestimate is essentially a straight line (left panel of Figure 5.7). Thesmaller smoothing parameter in the first fit with independence assump-tion might be caused by the spatial correlation. Equivalent degrees offreedom for f have been reduced from 8.2 to 2.
We can also model the effect of geological location directly in the meanfunction. That is to consider the mean pH level as a bivariate functionof x1 (calcium) and x2 (geological location):
ph(xi1,xi2) = f(xi1,xi2) + ǫi, (5.34)
where ǫi are zero-mean independent random errors with a common vari-
Spline Smoothing with Heteroscedastic and/or Correlated Errors 161
ance. One possible model space for x1 is W 22 (R), and one possible model
space for x2 is W 22 (R2). Therefore, we consider the tensor product space
W 22 (R) ⊗W 2
2 (R2). Define three averaging operators
A(1)1 f =
J1∑
j=1
wj1f(uj1),
A(1)2 f =
J1∑
j=1
wj1f(uj1)φ12(uj1)φ12,
A(2)1 f =
J2∑
j=1
wj2f(uj2),
A(2)2 f =
J2∑
j=1
wj2f(uj2){φ22(uj2)φ22 + φ23(uj2)φ23},
where uj1 and uj2 are fixed points in R and R2, and wj1 and wj2 are
fixed positive weights such that∑J1
j=1 wj1 =∑J2
j=1 wj2 = 1; φ11 = 1
and φ12 are orthonormal bases for the null space in W 22 (R) based on the
norm (2.41); φ21 = 1, and φ22 and φ23 are orthonormal bases for the null
space in W 22 (R2) based on the norm (2.41). Let A(1)
3 = I −A(1)1 −A(2)
2
and A(2)3 = I − A(2)
1 − A(2)2 . Then we have the following SS ANOVA
decomposition:
f ={
A(1)1 + A(1)
2 + A(1)3
}{
A(2)1 + A(2)
2 + A(2)3
}
f
= A(1)1 A(2)
1 f + A(1)2 A(2)
1 f + A(1)3 A(2)
1 f
+ A(1)1 A(2)
2 f + A(1)2 A(2)
2 f + A(1)3 A(2)
2 f
+ A(1)1 A(2)
3 f + A(1)2 A(2)
3 f + A(1)3 A(2)
3 f
, µ+ β1φ12(x1) + β2φ22(x2) + β3φ23(x2) + β4φ12(x1)φ22(x2)
+ β5φ12(x1)φ23(x2) + fs1 (x1) + fs
2 (x2) + f ls12(x1,x2)
+ fsl12(x1,x2) + fss
12(x1,x2). (5.35)
Due to small sample size, we ignore all interactions and consider thefollowing additive model
y1 = µ+ β1φ12(xi1) + β2φ22(xi2) + β3φ23(xi2)
+ fs1 (xi1) + fs
2 (xi2) + ǫi, (5.36)
where ǫi are zero-mean independent random errors with a common vari-ance. We fit model (5.36) as follows:
162 Smoothing Splines: Methods and Applications
> acid.ssanova.fit <- ssr(ph~x1+x21+x22,
rk=list(tp(x1), tp(list(x21,x22))),
data=acid, spar=‘‘m’’)
> summary(acid.ssanova.fit)
...
GML estimate(s) of smoothing parameter(s) : 4.819636e-01
2.870235e-07
Equivalent Degrees of Freedom (DF): 8.768487
Estimate of sigma: 0.2560684
The estimate of the main effect of x1 plus the constant, µ+β1φ12(x1)+fs1 (x1), is shown in Figure 5.7(a). This estimate is almost identical to
that from model (5.33) with the exponential spatial correlation structure.The estimate of the main effect of x2 is shown in the right panel of Figure5.7.
An alternative approach to modeling the effect of geological locationusing mixed-effects will be discussed in Section 9.4.2.
Chapter 6
Generalized Smoothing SplineANOVA
6.1 Generalized SS ANOVA Models
Generalized linear models (GLM) (McCullagh and Nelder 1989) providea unified framework for analysis of data from exponential families. De-note (xi, yi) for i = 1, . . . , n as independent observations on independentvariables x = (x1, . . . , xd) and dependent variable y. Assume that yi aregenerated from a distribution in the exponential family with the densityfunction
g(yi; fi, φ) = exp
{yih(fi) − b(fi)
ai(φ)+ c(yi, φ)
}
, (6.1)
where fi = f(xi), h(fi) is a monotone transformation of fi known as thecanonical parameter, and φ is a dispersion parameter. The function fmodels the effect of independent variables x. Denote µi = E(yi), Gc asthe canonical link such that Gc(µi) = h(fi), and G as the link functionsuch that G(µi) = fi. Then h = Gc ◦G−1, and it reduces to the identityfunction when the canonical link is chosen for G. The last term c(yi, φ)in (6.1) is independent of f .
A GLM assumes that f(x) = xTβ. Similar to the linear models forGaussian data, the parametric GLM may be too restrictive for some ap-plications. We consider the nonparametric extension of the GLM in thischapter. In addition to providing more flexible models, the nonparamet-ric extension also provides model building and diagnostic methods forGLMs.
Let the domain of each covariate xk be an arbitrary set Xk. Consider fas a multivariate function on the product domain X = X1×X2×· · ·×Xd.The SS ANOVA decomposition introduced in Chapter 4 may be appliedto construct candidate model spaces for f . In particular, we will assumethat f ∈ M, where
M = H0 ⊕H1 ⊕ · · · ⊕ Hq
163
164 Smoothing Splines: Methods and Applications
is an SS ANOVA model space defined in (4.30), H0 = span{φ1, . . . , φp} isa finite dimensional space collecting all functions that are not penalized,and Hj for j = 1, . . . , q are orthogonal RKHS’s with RKs Rj . The samenotations in Chapter 4 will be used in this Chapter.
We assume the same density function (6.1) for yi. However, for gen-erality, we assume that f is observed through some linear functionals.Specifically, fi = Lif , where Li are known bounded linear functionals.
6.2 Estimation and Inference
6.2.1 Penalized Likelihood Estimation
Assume that ai(φ) = a(φ)/i for i = 1, . . . , n where i are knownconstants. Denote σ2 = a(φ), y = (y1, . . . , yn)T and f = (f1, . . . , fn)T .Let
li(fi) = i{b(fi) − yih(fi)}, i = 1, . . . , n, (6.2)
and l(f) =∑n
i=1 li(fi). Then the log-likelihood
n∑
i=1
log g(yi; fi, φ) =
n∑
i=1
{yih(fi) − b(fi)
ai(φ)+ c(yi, φ)
}
= − 1
σ2l(f) + C, (6.3)
where C is independent of f . Therefore, up to an additive and a mul-tiplying constant, l(f) is the negative log-likelihood. Note that l(f) isindependent of the dispersion parameter.
For a GLM, the MLE of parameters β are the maximizers of the log-likelihood. For a generalized SS ANOVA model, as in the Gaussiancase, a penalty term is necessary to avoid overfitting. We will use thesame form of penalty as in Chapter 4. Specifically, we estimate f as thesolution to the following penalized likelihood (PL)
minf∈M
l(f) +
n
2
q∑
j=1
λj‖Pjf‖2
, (6.4)
where Pj is the orthogonal projector in M onto Hj , and λj are smooth-ing parameters. The multiplying term 1/σ2 is absorbed into smoothingparameters, and the constant C is dropped since it is independent of f .The multiplying constant n/2 is added such that the PL reduces to the
Generalized Smoothing Spline ANOVA 165
PLS (4.32) for Gaussian data. Under the new inner product defined in(4.33), the PL can be rewritten as
minf∈M
{
l(f) +n
2λ‖P ∗
1 f‖2∗
}
, (6.5)
where λj = λ/θj , and P ∗1 =
∑qj=1 Pj is the orthogonal projection in
M onto H∗1 = ⊕q
j=1Hj. Note that the RK of H∗1 under the new inner
product is R∗1 =
∑qj=1 θjR
j.It is easy to check that l(f) is convex in f under the canonical link. In
general, we assume that l(f) is convex in f and has a unique minimizerin H0. Then the PL (6.5) has a unique minimizer (Theorem 2.9 in Gu(2002)). We now show the Kimeldorf–Wahba representer theorem holds.Let R0 be the RK of H0 and R = R0 + R∗
1. Let ηi be the representerassociated with Li. Then, from (2.12),
ηi(x) = Li(z)R(x, z) = Li(z)R0(x, z) + Li(z)R
∗1(x, z) , δi(x) + ξi(x).
That is, the representers ηi for Li belong to the finite dimensional sub-space S = H0 ⊕ span{ξ1, . . . , ξn}. Let Sc be the orthogonal complementof S. Any f ∈ H can be decomposed into f = ς1 + ς2, where ς1 ∈ S andς2 ∈ Sc. Then we have
Lif = (ηi, f) = (ηi, ς1) + (ηi, ς2) = (ηi, ς1) = Liς1.
Consequently, for any f ∈ H, the PL (6.5) satisfies
n∑
i=1
li(Lif) +n
2λ‖P ∗
1 f‖2∗
=
n∑
i=1
li(Liς1) +n
2λ(‖P ∗
1 ς1‖2∗ + ‖P ∗
1 ς2‖2∗)
≥n∑
i=1
li(Liς1) +n
2λ‖P ∗
1 ς1‖2∗,
where equality holds iff ||P ∗1 ς2||∗ = ||ς2||∗ = 0. Thus, the minimizer of
the PL falls in the finite dimensional space S, which can be representedas
f(x) =
p∑
ν=1
dνφν(x) +
n∑
i=1
ciξi
=
p∑
ν=1
dνφν(x) +
n∑
i=1
ci
q∑
j=1
θjLi(z)Rj(x, z). (6.6)
166 Smoothing Splines: Methods and Applications
For simplicity of notation, the dependence of f on the smoothing pa-rameters λ and θ = (θ1, . . . , θq) is not expressed explicitly. Let T ={Liφν}n p
i=1 ν=1, Σk = {LiLjRk}n
i,j=1 for k = 1, . . . , q and Σθ = θ1Σ1 +
· · · + θqΣq. Let d = (d1, . . . , dp)T and c = (c1, . . . , cn)T . Denote f =
(L1f , . . . ,Lnf)T . Then f = Td + Σθc. Note that ‖P ∗1 f‖2
∗ = cT Σθc.Substituting (6.6) into (6.5), we need to solve c and d by minimizing
I(c,d) = l(Td+ Σθc) +n
2λcT Σθc. (6.7)
Except for the Gaussian distribution, the function l(f) is not quadraticand (6.7) cannot be solved directly. For fixed λ and θ, we will applythe Newton–Raphson procedure to compute c and d. Let ui = dli/dfi
and wi = d2li/df2i , where fi = Lif . Let uT = (u1, . . . , un)T and W =
diag(w1, . . . , wn). Then
∂I
∂c= Σθu+ nλΣθc,
∂I
∂d= T Tu,
∂2I
∂c∂cT= ΣθWΣθ + nλΣθ ,
∂2I
∂c∂dT= ΣθWT,
∂2I
∂d∂dT= T TWT.
The Newton–Raphson procedure iteratively solves the linear system(
ΣθW Σθ + nλΣθ ΣθW TT TW Σθ T TW T
)(c− cd− d
)
=
(−Σθu − nλΣθc
−T Tu
)
, (6.8)
where the subscript minus indicates quantities evaluated at the previousNewton–Raphson iteration. Equations in (6.8) can be rewritten as
(ΣθW Σθ + nλΣθ)c+ ΣθW Td = ΣθW f − Σθu ,
T TW Σθc+ T TW Td = T TW f − T Tu ,(6.9)
where f = Td +Σθc . As discussed in Section 2.4, it is only necessary
to derive one set of solutions. Let y = f −W−1u . It is easy to seethat a solution to
(Σθ + nλW−1)c+ Td = y,
T Tc = 0,(6.10)
Generalized Smoothing Spline ANOVA 167
is also a solution to (6.9). Note that W is known at the current Newton–Raphson iteration. Since equations in (6.10) have the same form asthose in (5.6), then methods in Section 5.2.1 can be used to solve (6.10).Furthermore, (6.10) corresponds to the minimizer of the following PWLSproblem:
1
n
n∑
i=1
wi−(yi − fi)2 +
q∑
j=1
λj‖Pjf‖2, (6.11)
where yi is the ith element of y. Therefore, at each iteration, theNewton–Raphson procedure solves the PWLS criterion with workingvariables yi and working weights wi−. Consequently, the procedure canbe regarded as iteratively reweighted PLS.
6.2.2 Selection of Smoothing Parameters
With canonical link such that h(f) = f , we have E(yi) = b′(fi), Var(yi) =b′′(fi)ai(φ), ui = i{b′(fi)−yi}, andwi = ib
′′(fi). Therefore, E(ui/wi)
= 0 and Var(ui/wi) = σ2w−1i . Consequently, when f− is close to f and
under some regularity conditions, it can be shown (Wang 1994, Gu 2002)that the working variables approximately have the same structure as in(5.1)
yi = Lif + ǫi + op(1), (6.12)
where ǫi has mean 0 and variance σ2w−1i . The Newton–Raphson proce-
dure essentially reformulates the problem to model f on working vari-ables at each iteration.
From the above discussion and note that W is known at the cur-rent Newton–Raphson iteration, we can use the UBR, GCV, and GMLmethods discussed in Section 5.2.3 to select smoothing parameters ateach step of the Newton–Raphson procedure. Since working data arereformulated at each iteration, the target criteria of the above iterativesmoothing parameter selection methods change throughout the itera-tion. Therefore, the overall target criteria of these iterative methods arenot explicitly defined. A justification for the UBR criterion can be foundin Gu (2002).
One practical problem with the iterative methods for selecting smooth-ing parameters is that they are not guaranteed to converge. Neverthe-less, extensive simulations indicate that, in general, the above algorithmworks reasonably well in practice (Wang 1994, Wang, Wahba, Chappelland Gu 1995).
Nonconvergence may become a serious problem for certain applica-tions (Liu, Tong and Wang 2007). Some direct noniterative methods
168 Smoothing Splines: Methods and Applications
for selecting smoothing parameters have been proposed. For Poissonand gamma distributions, it is possible to derive unbiased estimates ofthe symmetrized Kullback–Leibler discrepancy (Wong 2006, Liu et al.2007). Xiang and Wahba (1996) proposed a direct GCV method. De-tails about the direct GCV method can be found in Gu (2002). A directGML method will be discussed in Section 6.2.4. These direct methodsare usually more computationally intensive. They have not been imple-mented in the current version of the assist package. It is not difficultto write R functions to implement these direct methods. A simple im-plementation of the direct GML method for gamma distribution will begiven in Sections 6.4 and 6.5.3.
6.2.3 Algorithm and Implementation
The whole procedure discussed in Sections 6.2.1 and 6.2.2 is summarizedin the following algorithm.
Algorithm for generalized SS ANOVA models
1. Compute matrices T and Σk for k = 1, . . . , q, and set an initialvalue for f .
2. Compute u , W , T = W 1/2T , Σk = W 1/2ΣkW1/2 for k = 1, . . . , q,
and y = W 1/2y, and fit the transformed data with smoothingparameters selected by the UBR, GCV, or GML method.
3. Iterate step 2 until convergence.
The above algorithm is easy to implement. All we need to do is tocompute quantities ui and wi. We now compute these quantities forsome special distributions.
First consider logistic regression with logit link function. Assume thaty ∼ Binomial(m, p) with density function
g(y) =
(my
)
py(1 − p)m−y, y = 0, . . . ,m.
Then σ2 = 1 and li = −yifi+mi log(1+exp(fi)), where fi = log(pi/(1−pi)). Consequently, ui = −yi +mipi and wi = mipi(1 − pi). Note thatbinary data is a special case with mi = 1.
Next consider Poisson regression with log-link function. Assume thaty ∼ Poisson(µ) with density function
g(y) =1
y!µy exp(−µ), y = 0, 1, . . . .
Generalized Smoothing Spline ANOVA 169
Then σ2 = 1 and li = −yifi + exp(fi), where fi = logµi. Consequently,ui = −yi + µi and wi = µi.
Last consider gamma regression with log-link function. Assume thaty ∼ Gamma(α, β) with density function
g(y) =1
Γ(α)βαyα−1 exp
(
− y
β
)
, y > 0,
where α > 0 and β > 0 are shape and scale parameters. We are in-terested in modeling the mean µ , E(y) as a function of independentvariables. We assume that the shape parameter does not depend onindependent variables. Note that µ = αβ. The density function may bereparametrized as
g(y) =αα
Γ(α)µαyα−1 exp
(
−αyµ
)
, y > 0.
The canonical parameter −µ−1 is negative. To avoid this constraint, weadopt the log-link. Then σ2 = α−1 and li = yi exp(−fi) + fi, wherefi = logµi. Consequently, ui = −yi/µi +1 and wi = yi/µi. Since wi arenonnegative, then l(f) is a convex function of f .
For the binomial and the Poisson distributions, the dispersion param-eter σ2 is fixed as σ2 = 1. For the gamma distribution, the dispersionparameter σ2 = α−1. Since this constant has been separated from thedefinition of li, then we can set σ2 = 1 in the UBR criterion. There-fore, for binomial, Poisson, and gamma distributions, the UBR criterionreduces to
UBR(λ,θ) =1
n‖(I − H)y‖2 +
2
ntrH, (6.13)
where H is the hat matrix associated with the transformed data.
In general, the weighted average of residuals
σ2− =
1
n
n∑
i=1
wi−
(ui−wi−
)2
=1
n
n∑
i=1
u2i−
wi−(6.14)
provides an estimate of σ2 at the current iteration when it is unknown.Then the UBR criterion reduces to
UBR(λ,θ) =1
n‖(I − H)y‖2 +
2σ2−ntrH. (6.15)
There are two versions of the UBR criterion given in equations (6.13)and (6.15). The first version is favorable when σ2 is known.
170 Smoothing Splines: Methods and Applications
The above algorithm is implemented in the ssr function for bino-mial, Poisson, and gamma distributions based on a collection of For-tran subroutines called GRKPACK (Wang 1997). The specific distribu-tion is specified by the family argument. The method for selectingsmoothing parameters is specified by the argument spar with “u ”, “v”,and “m” representing UBR, GCV, and GML criteria defined in (6.15),(5.25) and (5.26), respectively. The UBR method with fixed dispersionparameter (6.13) is specified as spar=‘‘u’’ together with the optionvarht for specifying the fixed dispersion parameter. Specifically, for bi-nomial, Poisson and gamma distributions with σ2 = 1, the combinationspar=‘‘u’’ and varht=1 is used.
6.2.4 Bayes Model, Direct GML and ApproximateBayesian Confidence Intervals
Suppose observations yi are generated from (6.1) with fi = Lif . Assumethe same prior for f as in (4.42):
F (x) =
p∑
ν=1
ζνφν(x) + δ12
q∑
j=1
√
θjUj(x), (6.16)
where ζνiid∼ N(0, κ); Uj(x) are independent, zero-mean Gaussian stochas-
tic processes with covariance function Rj(x, z); ζν and Uj(x) are mu-tually independent; and κ and δ are positive constants. Conditional onζ = (ζ1, . . . , ζp)
T , f |ζ ∼ N(Tζ, δΣθ). Letting κ → ∞ and integratingout ζ, Gu (1992) showed that the marginal density of f
p(f ) ∝ exp
{
− 1
2δfT(
Σ+
θ− Σ+
θT (T T Σ+
θT )−1T T Σ+
θ
)
f
}
,
where Σ+
θis the Moore–Penrose inverse of Σθ. The marginal density of
y,
p(y) =
∫
p(y|f)p(f )df , (6.17)
usually does not have a closed form since, except for the Gaussian dis-tribution, the log-likelihood log p(y|f) is not quadratic in f . Note that
log p(y|f) =
n∑
i=1
log g(yi; fi, φ) = − 1
σ2l(f) + C.
We now approximate l(f) by a quadratic function.
Let uic and wic be ui and wi evaluated at f . Let uc = (u1c, . . . , unc)T ,
Wc = diag(w1c, . . . , wnc), and yc = f−W−1c uc. Note that ∂l(f)/∂f | ˆf =
Generalized Smoothing Spline ANOVA 171
uc, and ∂2l(f)/∂f∂fT | ˆf = Wc. Expanding l(f) as a function of f
around the fitted values f to the second order leads to
l(f) ≈ 1
2(f − yc)
TWc(f − yc) + l(f) − 1
2uT
c W−1c uc. (6.18)
Note that log p(f) is quadratic in f . Then it can be shown that, apply-ing approximation (6.18), the marginal density of y in (6.17) is approx-imately proportional to (Liu, Meiring and Wang 2005)
p(y) ∝ |Wc|−1
2 |V |− 12 |T TV −1T |− 1
2 p(y|f) exp
{1
2σ2uT
c W−1c uc
}
× exp
{
−1
2yT
c
(V −1 − V −1T (T TV −1T )−1T TV −1
)yc
}
, (6.19)
where V = δΣθ + σ2W−1c . When Σθ is nonsingular, f is the maxi-
mizer of the integrand p(y|f )p(f ) in (6.17) (Gu 2002). In this case theforegoing approximation is simply the Laplace approximation.
Let yc = W1/2c yc, Σθ = W
1/2c ΣθW
1/2c , T = W
1/2c T , and the QR de-
composition of T be (Q1 Q2)(RT 0)T . Let UEUT be the eigendecompo-
sition of QT2 ΣθQ2, where E = diag(e1, . . . , en−p), e1 ≥ e2 ≥ . . . ≥ en−p
are eigenvalues. Let z = (z1, . . . , zn−p)T , UT QT
2 y. Then it can beshown (Liu et al. 2005) that (6.19) is equivalent to
p(y) ∝ |R|−1p(y|f ) exp
{1
2σ2uT
c W−1c uc
}
×n−p∏
ν=1
(δeν + σ2)−12 exp
{
−1
2
n−p∑
ν=1
z2ν
δeν + σ2
}
.
Let δ = σ2/nλ. Then an approximation of the negative log-marginallikelihood is
DGML(λ,θ) = log |R| + 1
σ2l(f) − 1
2σ2uT
c W−1c uc
+1
2
n−p∑
i=1
{
log(eν/nλ+ 1) +z2
v
σ2(eν/nλ+ 1)
}
. (6.20)
Notice that f , uc, Wc, R, eν , and zν all depend on λ and θ even thoughthe dependencies are not expressed explicitly. The function DGML(λ,θ)is referred to as the direct generalized maximum likelihood (DGML) crite-rion, and the minimizers of DGML(λ,θ) are called the DGML estimateof the smoothing parameter. Section 6.4 shows a simple implementationof the DGML criterion for the gamma distribution.
172 Smoothing Splines: Methods and Applications
Let yc = (y1c, . . . , ync)T . Based on (6.12), consider the approximation
model at convergence
yic = Lif + ǫi, i = 1, . . . , n, (6.21)
where ǫi has mean 0 and variance σ2w−1ic . Assume prior (6.16) for f .
Then, as in Section 5.2.5, Bayesian confidence intervals can be con-structed for f in the approximation model (6.21). They provide approx-imate confidence intervals for f in the generalized SS ANOVA model.The bootstrap approach may also be used to construct confidence inter-vals, and the extension is straightforward.
Connections between smoothing spline models and LME models arepresented in Sections 3.5, 4.7, and 5.2.4. We now extend this connectionto data from exponential families. Consider the following generalizedlinear mixed-effects model (GLMM) (Breslow and Clayton 1993)
G{E(y|u)} = Td+ Zu, (6.22)
where G is the link function, d are fixed effects, u = (uT1 , . . . ,u
Tq )T
are random effects, Z = (In, . . . , In), uk ∼ N(0, σ2θkΣk/nλ) for k =1, . . . , q, and uk are mutually independent. Then u ∼ N(0, σ2D/nλ),whereD = diag(θ1Σ1, . . . , θqΣq). Setting uk = θkΣkc as in the Gaussiancase (Opsomer et al. 2001) and noting that ZDZT = Σθ , we have u =DZTc and uT {Cov(u)}+u = nλcTZDD+DZTc/σ2 = nλcTZDZTc/σ2
= nλcT Σθc/σ2. Note that Zu = Σθc. Therefore the PL (6.7) is the
same as the penalized quasi-likelihood (PQL) of the GLMM (6.22) (equa-tion (6) in Breslow and Clayton (1993)).
6.3 Wisconsin Epidemiological Study of DiabeticRetinopathy
We use the Wisconsin Epidemiological Study of Diabetic Retinopathy(WESDR) data to illustrated how to fit an SS ANOVA model to binaryresponses. Based on Wahba, Wang, Gu, Klein and Klein (1995), weinvestigate how probability of progression to diabetic retinopathy at thefirst follow-up (prg) depends on the following covariates at baseline:duration of diabetes (dur), glycosylated hemoglobin (gly), and bodymass index (bmi).
Let y be the response variable prg where y = 1 represents progressionof retinopathy and y = 0 otherwise. Let x1, x2, and x3 be the covariatesdur, gly, and bmi transformed into [0, 1]. Let x = (x1, x2, x3) and
Generalized Smoothing Spline ANOVA 173
f(x) = logitP (y = 1|x). We model f using the tensor product spaceW 2
2 [0, 1]⊗W 22 [0, 1]⊗W 2
2 [0, 1]. The three-way SS ANOVA decompositioncan be derived similarly using the method in Chapter 4. For simplicity,we will ignore three-way interactions and start with the following SSANOVA model with all two-way interactions:
f(x) = µ+ β1 × (x1 − 0.5) + β2 × (x2 − 0.5) + β3 × (x3 − 0.5)
+ β4 × (x1 − 0.5)(x2 − 0.5) + β5 × (x1 − 0.5)(x3 − 0.5)
+ β6 × (x2 − 0.5)(x3 − 0.5) + fs1 (x1) + fs
2 (x2) + fs3 (x3)
+ f ls12(x1, x2) + fsl
12(x1, x2) + fss12 (x1, x2)
+ f ls13(x1, x3) + fsl
13(x1, x3) + fss13 (x1, x3)
+ f ls23(x2, x3) + fsl
23(x2, x3) + fss23 (x2, x3). (6.23)
The following statements fit model (6.23) with smoothing parameterselected by the UBR method:
> data(wesdr); attach(wesdr)
> y <- prg
> x1 <- (dur-min(dur))/diff(range(dur))
> x2 <- (gly-min(gly))/diff(range(gly))
> x3 <- (bmi-min(bmi))/diff(range(bmi))
> wesdr.fit1 <- ssr(y~I(x1-.5)+I(x2-.5)+I(x3-.5)+
I((x1-.5)*(x2-.5))+I((x1-.5)*(x3-.5))+
I((x2-.5)*(x3-.5)),
rk=list(cubic(x1), cubic(x2), cubic(x3),
rk.prod(kron(x1-.5),cubic(x2)),
rk.prod(kron(x2-.5),cubic(x1)),
rk.prod(cubic(x1),cubic(x2)),
rk.prod(kron(x1-.5),cubic(x3)),
rk.prod(kron(x3-.5),cubic(x1)),
rk.prod(cubic(x1),cubic(x3)),
rk.prod(kron(x2-.5),cubic(x3)),
rk.prod(kron(x3-.5),cubic(x2)),
rk.prod(cubic(x2),cubic(x3))),
family=‘‘binary’’, spar=‘‘u’’, varht=1)
> summary(wesdr.fit1)
...
UBR estimate(s) of smoothing parameter(s) :
6.913248e-06 1.920409e+04 9.516636e-01 2.966542e+03
6.005694e+02 6.345814e+01 5.602521e+02 2.472658e+02
1.816387e-07 9.820496e-07 1.481754e+03 2.789458e-07
Equivalent Degrees of Freedom (DF): 12.16
174 Smoothing Splines: Methods and Applications
Components corresponding to large values of smoothing parameters aresmall. As in Section 4.9.2, we compute the Euclidean norm of the esti-mate for each component centered around zero:
> norm.cen <- function(x) sqrt(sum((x-mean(x))**2))
> comp.est1 <- predict(wesdr.fit1, pstd=F,
terms=diag(rep(1,19))[-1,])
> comp.norm1 <- apply(comp.est1$fit, 2, norm.cen)
> print(round(comp.norm1,2))
9.13 44.02 15.25 5.40 14.29 21.88 8.51 0.00 0.00
0.00 0.00 0.00 0.00 0.00 5.17 5.39 0.00 2.41
Both the estimates of smoothing parameters and the norms indicatethat the interaction between dur (x1) and gly (x2) can be dropped.Therefore, we fit the SS ANOVA model (6.23) with components f ls
12,fsl12, and fss
12 being eliminated:
> wesdr.fit2 <- update(wesdr.fit1,
rk=list(cubic(x1), cubic(x2), cubic(x3),
rk.prod(kron(x1-.5),cubic(x3)),
rk.prod(kron(x3-.5),cubic(x1)),
rk.prod(cubic(x1),cubic(x3)),
rk.prod(kron(x2-.5),cubic(x3)),
rk.prod(kron(x3-.5),cubic(x2)),
rk.prod(cubic(x2),cubic(x3))))
> comp.est2 <- predict(wesdr.fit2,
terms=diag(rep(1,16))[-1,], pstd=F)
> comp.norm2 <- apply(comp.est2$fit, 2, norm.cen)
> print(round(comp.norm2,2))
9.13 44.02 15.25 5.40 14.29 21.88 8.51 0.00 0.00
0.00 0.00 5.17 5.39 0.00 2.41
Compared to other components, the norms of the nonparametric inter-actions f ls
13, fsl13, f
ss13 , f ls
23, fsl23, and fss
23 are relatively small. We furthercompute 95% Bayesian confidence intervals for the overall nonparametricinteraction between x1 and x3, f
ls13+fsl
13+fss13 , the overall nonparametric
interaction between x2 and x3, fls23 + fsl
23 + fss23 , and the proportion of
design points for which zero is outside these confidence intervals:
> int.dur.bmi <- predict(wesdr.fit2,
terms=c(rep(0,10),rep(1,3),rep(0,3)))
> mean((int.dur.bmi$fit-1.96*int.dur.bmi$pstd>0)|
(int.dur.bmi$fit+1.96*int.dur.bmi$pstd<0))
0
Generalized Smoothing Spline ANOVA 175
> int.gly.bmi <- predict(wesdr.fit2,
terms=c(rep(0,13),rep(1,3)))
> mean((int.gly.bmi$fit-1.96*int.gly.bmi$pstd>0)|
(int.gly.bmi$fit+1.96*int.gly.bmi$pstd<0))
0
Therefore, for both overall nonparametric interactions, the 95% confi-dence intervals contain zero at all design points. This suggests that thenonparametric interactions may be dropped. We fit the SS ANOVAmodel (6.23) with all nonparametric interactions being eliminated:
> wesdr.fit3 <- update(wesdr.fit2,
rk=list(cubic(x1), cubic(x2), cubic(x3)))
> summary(wesdr.fit3)
UBR estimate(s) of smoothing parameter(s) :
4.902745e-06 5.474122e+00 1.108322e-05
Equivalent Degrees of Freedom (DF): 11.78733
> comp.est3 <- predict(wesdr.fit3, pstd=F,
terms=diag(rep(1,10))[-1,])
> comp.norm3 <- apply(comp.est3$fit, 2, norm.cen)
> print(round(comp.norm3))
7.22 33.71 12.32 4.02 8.99 11.95 10.15 0.00 6.79
Based on the estimates of smoothing parameters and the norms, thenonparametric main effect of x2, f
s2 , can be dropped. Therefore, we fit
the final model:
> wesdr.fit4 <- update(wesdr.fit3,
rk=list(cubic(x1), cubic(x3)))
> summary(wesdr.fit4)
...
UBR estimate(s) of smoothing parameter(s) :
4.902693e-06 1.108310e-05
Equivalent Degrees of Freedom (DF): 11.78733
Estimate of sigma: 1
To look at the effect of dur, with gly and bmi being fixed at the me-dians of their observed values, we compute estimates of the probabilitiesand posterior standard deviations at a grid point of dur. The estimatedprobability function and the approximate 95% Bayesian confidence in-tervals are shown in Figure 6.1(a). The risk of progression of retinopathyincreases up to a duration of about 10 years and then decreases, possiblycaused by censoring due to death in patients with longer durations. Sim-ilar plots for the effects of gly and bmi are shown in Figures 6.1(b) and
176 Smoothing Splines: Methods and Applications
0 10 30 50
0.0
0.2
0.4
0.6
0.8
1.0
duration (yr)
pro
ba
bili
ty
(a)
10 15 20
0.0
0.2
0.4
0.6
0.8
1.0
gly. hemoglobin
pro
ba
bili
ty
(b)
20 30 40 50
0.0
0.2
0.4
0.6
0.8
1.0
body mass index (kg/m2)
pro
ba
bili
ty
(c)
FIGURE 6.1 WESDR data, plots of (a) the estimated probabilityas a function of dur with gly and bmi being fixed at the medians oftheir observed values, (b) the estimated probability as a function of glywith dur and bmi being fixed at the medians of their observed values,and (c) the estimated probability as a function of bmi with dur and gly
being fixed at the medians of their observed values. Shaded regions areapproximate 95% Bayesian confidence intervals. Rugs on the bottomand the top of each plot are observations of prg.
6.1(c). The risk of progression of retinopathy increases with increasingglycosylated hemoglobin, and the risk increases with increasing bodymass index until a value of about 25 kg/m2, after which the trend isuncertain due to wide confidence intervals. As expected, the confidenceintervals are wider in areas where observations are sparse.
6.4 Smoothing Spline Estimation of VarianceFunctions
Consider the following heteroscedastic SS ANOVA model
yi = L1if1 + exp{L2if2/2}ǫi, i = 1, . . . , n, (6.24)
where fk is a function on Xk = Xk1 ×Xk2 × · · ·×Xkdkwith model space
Mk = Hk0⊕Hk1⊕· · ·⊕Hkqk for k = 1, 2; L1i and L2i are bounded linear
functionals; and ǫiiid∼ N(0, 1). The goal is to estimate both the mean
function f1 and the variance function f2. Note that both the mean andvariance functions are modeled nonparametrically in this section. This
Generalized Smoothing Spline ANOVA 177
is in contrast to the parametric model (5.28) for variance functions inChapter 5.
One simple approach to estimating functions f1 and f2 is to use thefollowing two-step procedure:
1. Estimate the mean function f1 as if random errors are homo-scedastic.
2. Estimate the variance function f2 based on squared residuals fromthe first step.
3. Estimate the mean function again using the estimated variancefunction.
The PLS estimation method in Chapter 4 can be used in the first step.Denote the estimate at the first step as f1 and ri = yi−L1if1 as residuals.Let zi = r2i . Under suitable conditions, zi ≈ exp{L2if2}χ2
i,1, where χ2i,1
are iid Chi-square random variables with degree of freedom 1. RegardingChi-square distribution as a special case of the gamma distribution, thePL method described in this chapter can be used to estimate the variancefunction f2 at the second step. Denote the estimate at the second stepas f2. Then the PWLS method in Chapter 5 can be used in the thirdstep with known covariance W−1 = diag(L21f2, . . . ,L2nf2).
We now use the motorcycle data to illustrate the foregoing two-stepprocedure. We have fitted a cubic spline to the logarithm of squaredresiduals in Section 5.4.1. Based on model (3.53), consider the het-eroscedastic partial spline model
yi = f1(xi) + exp{f2(xi)/2}ǫi, i = 1, . . . , n, (6.25)
where f1(x) =∑3
j=1 βj(x − tj)+ + g1(x), t1 = 0.2128, t2 = 0.3666
and t3 = 0.5113 are the change-points in the first derivative, and ǫiiid∼
N(0, 1). We model both g1 and f2 using the cubic spline model spaceW 2
2 [0, 1]. The following statements implement the two-step procedurefor model (6.25):
# step 1
> t1 <- .2128; t2 <- .3666; t3 <- .5113;
> s1 <- (x-t1)*(x>t1); s2 <- (x-t2)*(x>t2)
> s3 <- (x-t3)*(x>t3)
> mcycle.ps.fit4 <- ssr(accel~x+s1+s2+s3, rk=cubic(x))
# step 2
> z1 <- residuals(mcycle.ps.fit4)**2
> mcycle.v.1 <- ssr(z1~x, cubic(x), limnla=c(-6,-1),
178 Smoothing Splines: Methods and Applications
family=‘‘gamma’’, spar=‘‘u’’, varht=1)
# step 3
> mcycle.ps.fit5 <- update(mcycle.ps.fit4,
weights=mcycle.v.1$fit)
In the second step, the search range for log10(nλ) is set as [−6,−1]to avoid numerical problems. The actual estimate of the smoothingparameter log10(nλ) = −6, which leads to a rough estimate of f2 (Figure6.2(a)).
time (ms)
log(s
quare
d r
esid
ual)
−10
−5
05
0 20 40 60
o
o
ooo
ooo
o
o
o
o
o
o
o
oo
ooo
o
o
ooo
o
o
o
o
o
oo
o
oo
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
oooo
o
o
o
oo
o
o
ooo
oo
oo
o
o
o
o
oo
o
oo
o
oo
o
o
o
o
oo
oooo
o
o
o
o
o
oooo
oo
o
o
oo
o
o
o
oo
o
oo
o
o
o
o
oo
o
o
oo
o
oo
o
oo
o
o
o o
(a)
time (ms)
log(s
quare
d d
iffe
rence)
−10
−5
05
0 20 40 60
oo
oo
oo
oo
o
oo
oo
ooo
o
o
o
o
o
o
o
o
oo
oo
o
o
o
oo
o
o
o
o
o
o
o
oo
o
o
oo
ooo
o
o
o
o
oo
o
o
o
oo
o
o
o
o
o
o
ooo
o
o
o
o
o
oo
oo
oo
oo
o
oo
o
o
o
ooo
oo
o
o
ooo
oo
o
o
o
o
oo
oo
oo
o
o
oooo
o
o
o
o
o
o
o
o
o
oo
o
oooo
(b)
time (ms)
log(s
quare
d r
esid
ual)
−10
−5
05
0 20 40 60
oo
oo
o
o
o
o
oo
o
o
o
o
o
o
oo
oo
o
o
oo
oo
oo
o
o
oo
o
ooo
o
o
o
o
o
oo
o
o
o
o
o
o
o
oooooo
o
oo
o
o
ooo
oo
oo
o
o
o
o
oo
o
ooo
oo
o
o
o
o
o
o
oooo
o
o
o
o
o
o
oo
o
oo
o
o
oo
o
o
o
o
o
o
ooo
o
o
o
oo
o
o
oo
o
ooo
oo
o
o
o
o
(c)
FIGURE 6.2 Motorcycle data, plots of the PL cubic spline estimatesof f2 (lines), and 95% approximate Bayesian confidence intervals (shadedregions) based on (a) the two-step procedure with squared residuals, (b)the two-step procedure with squared differences, and (c) the backfittingprocedure. Circles are (a) logarithm of the squared residuals based onmodel (3.53), (b) logarithm of the squared differences, and (c) logarithmof the squared residuals based on the final fit to the mean function inthe backfitting procedure.
The first step in the two-step procedure may be replaced by a difference-based method. Note that x1 ≤ x2 ≤ · · · ≤ xn in the motorcycledata. Let zi = (yi+1 − yi)
2/2 for i = 1, . . . , n − 1. When xi+1 −xi is small, yi+1 − yi = f1(xi+1) − f1(xi) + exp{f2(xi+1)/2}ǫi+1 −exp{f2(xi)/2}ǫi ≈ exp{f2(xi)/2}(ǫi+1 − ǫi), where xi = (xi+1 + xi)/2.Then, zi ≈ exp{f2(xi)}χ2
i,1, where χ2i,1 are Chi-square random variables
with degree of freedom 1. Ignoring correlations between neighboringobservations, the following statements implement this difference-basedmethod:
> z2 <- diff(accel)**2/2; z2[z2<.00001] <- .00001
Generalized Smoothing Spline ANOVA 179
> n <- length(x); xx <- (x[1:(n-1)]+x[2:n])/2
> mcycle.v.g <- ssr(z2~xx, cubic(xx), limnla=c(-6,-1),
family=‘‘gamma’’, spar=‘‘u’’, varht=1)
> w <- predict(mcycle.v.g, pstd=F,
newdata=data.frame(xx=x))
> mcycle.ps.fit6 <- update(mcycle.ps.fit4,
weights=exp(w$fit))
As in Yuan and Wahba (2004), we set zi = max{.00001, (yi+1 − yi)2/2}
to avoid overfitting. Again, the actual estimate of the smoothing param-eters reaches the lower bound such that log10(nλ) = −6. The estimateof f2 is rough (Figure 6.2(b)).
A more formal procedure is to estimate f1 and f2 in model (6.24)jointly as the minimizers of the following doubly penalized likelihood(DPL) (Yuan and Wahba 2004):
1
n
n∑
i=1
{(yi − L1if1)
2 exp(−L2if2) + L2if2}
+
2∑
k=1
qk∑
j=1
λkj‖Pkjfk‖2,
(6.26)where the first part is the negative log-likelihood, Pkj is the orthogonalprojector in Mk onto Hkj for k = 1, 2, and λkj are smoothing parame-ters. Following the same arguments in Section 6.2.1, the minimizers ofthe DPL can be represented as
fk(x) =
pk∑
ν=1
dk,νφk,ν(x) +
n∑
i=1
ck,i
qk∑
j=1
θkjLk,i(z)Rkj(x, z),
x, z ∈ Xk, k = 1, 2, (6.27)
where λkj = λk/θkj , φk,ν for ν = 1, . . . , pk are basis functions of Hk0,and Rkj for j = 1, . . . , qk are RKs of Hkj . It is difficult to solve coeffi-cients dk,ν and ck,ν directly. However, it is easy to implement a back-fitting procedure by iterating the following two steps until convergence:(a) Conditional on current estimates of d2,ν and c2,ν , update d1,ν andc1,ν ; (b) Conditional on current estimates of d1,ν and c1,ν , update d2,ν
and c2,ν . Note that, when d2,ν and c2,ν are fixed, f2 in (6.26) is fixed,and the DPL reduces to the PWLS (5.3) with known weights. Whend1,ν and c1,ν are fixed, f1 in (6.27) is fixed and the DPL reduces tothe PL (6.4) for observations zi = exp{L2if2}χ2
i,1. Therefore, steps (a)and (b) correspond to steps 2 and 3 in the two-step procedure, and thebackfitting procedure extends the two-step procedure by iterating steps2 and 3 until convergence. The above backfitting procedure is essentiallythe same as the iterative procedure in Yuan and Wahba (2004) where
180 Smoothing Splines: Methods and Applications
different methods were used to select smoothing parameters. The fol-lowing is a simple R function that implements the backfitting procedurefor motorcycle data:
> jemv <- function(x, y, prec=1e-6, maxit=30) {
t1 <- .2128; t2 <- .3666; t3 <- .5113
s1 <- (x-t1)*(x>t1); s2 <- (x-t2)*(x>t2)
s3 <- (x-t3)*(x>t3)
err <- 1e20; eta <- rep(1,length(x))
while (err>prec&maxit>0) {
fitf <- ssr(y~x+s1+s2+s3, cubic(x), weights=exp(eta))
z <- fitf$resi**2; z[z<.00001] <- .00001
fitv <- ssr(z~x, cubic(x), limnla=c(-5,-1),
family=‘‘gamma’’, spar=‘‘u’’, varht=1)
oldeta <- eta
eta <- fitv$rkpk$eta
err <- sqrt(mean(((eta-oldeta)/(1+abs(oldeta)))**2))
maxit <- maxit-1
}
return(list(fitf=fitf,fitv=fitv))
}
> mcycle.mv <- jemv(x, accel)
For estimation of the variance function, a new search range for log10(nλ)has to be set as [−5,−1] to avoid numerical problems. The backfittingalgorithm converged in 13 iterations. The estimate of f2 is shown inFigure 6.2(c).
The smoothing parameters in the above procedures are selected by theiterative UBR method. For Chi-square distributed response variableswith small degrees of freedom, nonconvergence of the iterative approachmay become a serious problem (Liu et al. 2007). Note that the degreesof freedom of Chi-square random variables in the above procedures equalto 1. The direct methods such as the DGML criterion guarantee con-vergence. We now show a simple implementation of the DGML methodfor gamma regression. Instead of the Newton–Raphson method, we usethe Fisher scoring method, which leads to ui = 1 − yi/µi and wi = 1.Consequently, Wc = I, and there is no need for the transformation. Fur-thermore, |R| will be dropped since it is independent of the smoothingparameter. The following R function computes the DGML in (6.20) forgamma regression:
DGML <- function(th, nlaht, y, S, Q) {
if (length(th)==1) { nlaht <- th; Qt <- Q }
if (length(th)>1) {
Generalized Smoothing Spline ANOVA 181
theta <- 10^th; Qt <- 0
for (i in 1:dim(Q)[3]) Qt <- Qt + theta[i]*Q[,,i]
}
fit <- try(ssr(y~S-1, Qt, limnla=nlaht, family=‘‘gamma’’,
spar=‘‘u’’, varht=1))
if(class(fit)==‘‘ssr’’) {
fht <- fit$rkpk$eta
tmp <- y*exp(-fht)
uc <- 1-tmp
yt <- fht-uc
qrq <- qr.Q(qr(S), complete=T)
q2 <- qrq[ ,(ncol(S)+1):nrow(S)]
V <- t(q2)%*%Qt%*%q2
l <- eigen((V + t(V))/2)
U <- l$vectors
e <- l$values
z <- t(U)%*%t(q2)%*%yt
delta <- 10^{-nlaht}
GML <- sum(tmp+fit)-sum(uc^2)/2+sum(log(delta*e+1))/2
+sum(z^2/(delta*e+1))/2
return(GML)
}
else return(1e10)
}
where fht, uc, yt, V, U, e, z, and delta correspond to f , uc, y, QT2 ΣθQ2,
U , (e1, . . . , en−p)T , z, and δ, respectively, in the definition of the DGML
in (6.20). Note that σ2 = 1. The R functions qr and eigen are usedto compute the QR decomposition and eigendecomposition. For an SSANOVA model with q = 1, the input th corresponds to log10(nλ), nlahtis not used, y are observations, S corresponds to the matrix T , and Q
corresponds to the matrix Σ. For an SS ANOVA model with q > 1, theinput th corresponds to log10 θ, nlaht corresponds to log10(nλ), y areobservations, S corresponds to the matrix T , and Q is an (n, n, q) arraycorresponding to the matrices Σk for k = 1, . . . , q.
We now apply the above DGML method to the second step in the two-step procedure. We compute the DGML criterion on a grid of log10(nλ),find the DGML estimate of the smoothing parameter as the minimizer ofthe DGML criterion, and fit again with the smoothing parameter beingfixed as the DGML estimate:
> S <- cbind(1,x); Q <- cubic(x)
> lgrid <- seq(-6,-2,by=.1)
> gml <- sapply(lgrid, DGML, nlaht=0, y=z1, S=S, Q=Q)
182 Smoothing Splines: Methods and Applications
> nlaht <- lgrid[order(gml)[1]]
> mcycle.v.g4 <- ssr(z1~x, cubic(x), limnla=nlaht,
family=‘‘gamma’’, spar=‘‘u’’, varht=1)
−6 −5 −4 −3 −2
840
850
860
870
log10(nλ)
DG
ML
(a)
time (ms)
log(s
quare
d r
esid
ual)
−5
05
0 20 40 60
o
o
o
oo
ooo
o
o
o
o
o
o
o
oo
ooo
o
o
oo
o
o
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oooo
o
o
o
oo
o
o
o
oo
oo
oo
o
o
o
o
oo
o
o
o
o
oo
o
o
o
o
o
o
oooo
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
oo
o
oo
o
o
o o
(b)
time (ms)accele
ration (
g)
−150
−50
050
0 20 40 60
ooooo oooooooooooooooo
oooooo
o
o
o
oo
o
oo
oo
oo
o
o
o
o
o
o
o
o
o
o
o
oooo
o
o
o
o
ooo
o
oo
ooo
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
ooo
o
o
o
o
o
o
o
ooo
o
o
o
o
o
o
o
oo
o
o
o
ooo
oo
o
o
o
o
o
o
oo
oo
o
oo
o
o
o
o
o
o
o
(c)
FIGURE 6.3 Motorcycle data, plots of (a) the DGML function wherethe minimum point is marked with a short bar at the bottom; (b) log-arithm of squared residuals (circles) based on model (3.53), PL cubicspline estimate of f2 (line) to squared residuals with the DGML estimateof the smoothing parameter, and 95% approximate Bayesian confidenceintervals (shaded region); and (c) observations (circles) and PWLS par-tial spline estimate of f1 (line) with 95% Bayesian confidence intervals(shaded region).
The DGML function is shown in Figure 6.3(a). It reaches the mini-
mum at log10(nλ) = −4.2. The PL cubic spline estimate of f2 with theDGML choice of the smoothing parameter is shown in Figure 6.3(b).The PWLS partial spline estimate of f1 and 95% Bayesian confidenceintervals are shown in Figure 6.3(c). The effect of unequal variances isreflected in the widths of the confidence intervals.
6.5 Smoothing Spline Spectral Analysis
6.5.1 Spectrum Estimation of a Stationary Process
The spectrum is often used to describe the power distribution of a timeseries. Consider a zero-mean stationary time seriesXt, t = 0,±1,±2, . . . .
Generalized Smoothing Spline ANOVA 183
The spectrum is defined as
S(ω) =∞∑
u=−∞γ(u) exp(−i2πωu), ω ∈ [0, 1], (6.28)
where γ(u) = E(XtXt+u) is the covariance function, and i =√−1. Let
X0, X1, . . . , XT−1 be a finite sample of the stationary process and definethe periodogram at frequency ωk = k/T as
yk = T−1∣∣
T−1∑
t=0
Xt exp(i2πωkt)∣∣2, k = 0, . . . , T − 1. (6.29)
The periodogram is an asymptotically unbiased but inconsistent esti-mator of the underlying true spectrum. Many different smoothing tech-niques have been proposed to overcome this problem. We now show howto estimate the spectrum using smoothing spline based on observations{(ωk, yk), k = 0, . . . , T − 1}.
Under standard mixing conditions, the periodograms are asymptoti-cally independent and distributed as
yk ∼{S(ωk)χ2
1, ωk = 0, 1/2,S(ωk)χ2
2/2, ωk 6= 0, 1/2.
Regarding Chi-square distribution as a special case of the gamma distri-bution, the method described in this chapter can be used to estimate thespectrum. Consider the logarithmic link function and let f = log(S) bethe log spectrum. We model function f using the periodic spline spaceWm
2 (per). Note that f(ω) is symmetric about ω = 0.5. Therefore, itsuffices to estimate f(ω) for ω ∈ [0, 0.5]. Nevertheless, to estimate f as aperiodic function, we will use periodograms at all frequencies in Section6.5.3.
6.5.2 Time-Varying Spectrum Estimation of a LocallyStationary Process
Many time series are nonstationary. Locally stationary processes havebeen proposed to approximate the nonstationary time series. The time-varying spectrum of a locally stationary time series characterizes changesof stochastic variation over time.
A zero-mean stochastic process {Xt, t = 0, . . . , T − 1} is a locallystationary process if
Xt =
∫ 1
0
A(ω, t/T ) exp(i2πωt)dZ(ω), (6.30)
184 Smoothing Splines: Methods and Applications
where Z(ω) is a zero-mean stochastic process on [0, 1], and A(ω, u) de-notes a transfer function with continuous second-order derivatives for(ω, u) ∈ [0, 1]× [0, 1]. Define the time-varying spectrum as
S(ω, u) = ‖A(ω, u)‖2. (6.31)
Consider the logarithmic link function and let f(ω, u) = logS(ω, u).Since it is a periodic function of ω, we model the log spectrum f usingthe tensor product space W 2
2 (per) ⊗ W 22 [0, 1], where the SS ANOVA
decomposition was derived in Section 4.4.5. The SS ANOVA modelusing notations in this section can be represented as
f(ω, u) = µ+β×(u−0.5)+fs2(u)+f1(ω)+fsl
12(ω, u)+fss12(ω, u), (6.32)
where β×(u−0.5) and fs2 (u) are linear and smooth main effects of time,
f1(ω) is the smooth main effect of frequency, and fsl12(ω, u) and fss
12 (ω, u)are linear–smooth and smooth–smooth interactions between frequencyand time.
To estimate the bivariate function f , we compute local periodogramson some time-frequency grids. Specifically, divide the time domain intoJ disjoint blocks [bj , bj+1), where 0 = b1 < b2 < · · · < bJ < bJ+1 = 1.Let uj = (bj +bj+1)/2 be the middle points of these J blocks. Let ωk fork = 1, . . . ,K be K frequencies in [0, 1]. Define the local periodogramsas
y(k−1)J+j =|∑bj+1−1
t=bjXt exp(i2πωkt)|2
|bj+1 − bj |, k = 1, . . . ,K; j = 1, . . . , J.
(6.33)Again, under regularity conditions, the local periodograms are asymp-totically independent and distributed as (Guo, Dai, Ombao and vonSachs 2003)
y(k−1)J+j ∼{
exp{f(ωk, uj)}χ21, ωk = 0, 1/2,
exp{f(ωk, uj)}χ22/2, ωk 6= 0, 1/2.
Let x = (ω, u) and x(k−1)J+j = (ωk, uj) for k = 1, . . . ,K, and j =1, . . . , J . Then the time-varying spectrum can be estimated based onobservations {(xi, yi), i = 1, . . . ,KJ}.
The SS ANOVA decomposition (6.32) may also be used to test if alocally stationary process is stationary. The locally stationary processX(t) is stationary if f(ω, u) is independent of u. Let h(u) = β × (u −0.5) + fs
2 (u) + fsl12(ω, u) + fss
12(ω, u), which collects all terms involvingtime in (6.32). Then the hypothesis is
H0 : h(u) = 0 for all u, H1 : h(u) 6= 0 for some u.
Generalized Smoothing Spline ANOVA 185
The full SS ANOVA model (6.32) is reduced to f(ω, u) = µ + f1(ω)
under H0. Denote fF and fR as the estimates of f under the full andreduced models, respectively. Let
DF =n∑
i=1
{fFi + yi exp(−fF
i ) − log yi − 1},
DR =
n∑
i=1
{fRi + yi exp(−fR
i ) − log yi − 1}
be deviances under the full and reduced models. We construct two teststatistics
T1 = DR −DF ,
T2 =
∫ 1
0
∫ 1
0
{fF (ω, u) − fR(ω, u)}2dωdu,
where T1 corresponds to the Chi-square statistics commonly used forgeneralized linear models, and T2 computes the L2 distance between fF
and fR. We reject H0 when these statistics are large. It is difficultto derive the null distributions of these statistics. Note that f doesnot depend on u under H0. Therefore, the null distribution can beapproximated by permutation. Specifically, permutation samples aregenerated by shuffling time grid, and statistics T1 and T2 are computedfor each permutation sample. Then the p-values are approximated asthe proportion of statistics based on permutation samples that are largerthan those based on the original data.
6.5.3 Epileptic EEG
We now illustrate how to estimate the spectrum of a stationary pro-cess and the time-varying spectrum of a locally stationary process usingthe seizure data. The data contain two 5-minute intracranial electroen-cephalograms (IEEG) segments from a patient: one at baseline extractedat least 4 hours before the seizure’s onset (labeled as baseline), and oneright before a seizure’s clinical onset (labeled as preseizure). Observa-tions are shown in Figure 6.4.
First assume that both the baseline and preseizure series are station-ary. Then the following statements compute the periodograms and fit aperiodic spline model for the baseline series:
> data(seizure); attach(seizure)
> n <- nrow(seizure)
> x <- seq(1,n-1, by=120)/n
186 Smoothing Splines: Methods and Applications
−3
00
−1
00
10
03
00
0 1 2 3 4 5
Baseline
time (minute)
−3
00
−1
00
10
03
00
0 1 2 3 4 5
Preseizure
FIGURE 6.4 Seizure data, plots of 5-minute baseline segments col-lected hours away from the seizure (above), and 5-minute preseizuresegments with the seizure’s onset at the 5th minute (below). The sam-pling rate is 200 Hertz. The total number of time points is 60,000 foreach segment.
> y <- (abs(fft(seizure$base))^2/n)[-1][seq(1,n-1, by=120)]
> seizure.s.base <- ssr(y~x, periodic(x), spar=‘‘u’’,
varht=1, family=‘‘gamma’’,limnla=c(-5,1))
> grid <- data.frame(x=seq(0,.5,len=100))
> seizure.s.base.p <- predict(seizure.s.base,grid)
where the function fft computes an unscaled discrete Fourier transfor-mation. A subset of periodograms is used. The UBR criterion is used toselect the smoothing parameter with the dispersion parameter set to 1.We limit the search range for log10(nλ) to avoid undersmoothing. Logperiodograms, periodic spline estimate of the log spectrum, and 95%Bayesian confidence intervals of the baseline series are shown in the left
Generalized Smoothing Spline ANOVA 187
panel of Figure 6.5. The log spectrum of the preseizure series is esti-mated similarly. Log periodograms, periodic spline estimate of the logspectrum, and 95% Bayesian confidence intervals of the preseizure seriesare shown in the right panel of Figure 6.5. Because the sampling rateis 200 HZ and the spectrum is symmetric around 100 HZ, we only showthe estimated spectra in frequency bands 0–100 HZ.
0 20 40 60 80 100
02
46
810
frequency (HZ)
log s
pectr
um
Baseline
0 20 40 60 80 100
02
46
810
frequency (HZ)
log s
pectr
um
Preseizure
FIGURE 6.5 Seizure data, plots log periodograms (dots), estimateof the log spectra (line) based on the iterative UBR method, and 95%Bayesian confidence intervals (shaded region) of the baseline series (left)and preseizure series (right).
The IEEG series may become nonstationary before seizure events.Assume that both the baseline and preseizure series are locally sta-tionary. We create time-frequency grids data with ωk = k/(K + 1)and uj = (j − .5)/J + 1/2n for k = 1, . . . ,K and j = 1, . . . , J wheren = 60000 is the length of a time series, and K = J = 32. We computelocal periodograms for the baseline series:
> pgram <- function(x, freqs) {
y <- numeric(length(freqs))
tser <- seq(length(x))-1
for(i in seq(length(freqs))) y[i] <-
Mod(sum(x*complex(mod=1, arg=-2*pi*freqs[i]*tser)))^2
y/length(x)
}
> lpgram <- function(x,times,freqs) {
188 Smoothing Splines: Methods and Applications
nsub <- floor(length(x)/length(times))
ymat <- matrix(0, length(freqs), length(times))
for (j in seq(length(times)))
ymat[,j] <- pgram(x[((j-1)*nsub+1):(j*nsub)], freqs)
as.vector(ymat)
}
> nf <- nt <- 32; freqs <- 1:nf/(nf+1)
> nsub <- floor(n/nt)
> times <- (seq(from=(1+nsub)/2, by=nsub, length=nt))/n
> y <- lpgram(seizure$base, times, freqs)
where the functions pgram and lpgram, respectively, compute periodogramsfor a stationary process and local periodograms for a locally stationaryprocess. We now fit the SS ANOVA model (6.32) for the baseline series:
> ftgrid <- expand.grid(freqs,times)
> x1 <- ftgrid[,1]
> x2 <- ftgrid[,2]
> seizure.ls.base <- ssr(y~x2,
rk=list(periodic(x1), cubic(x2),
rk.prod(periodic(x1),kron(x2-.5)),
rk.prod(periodic(x1),cubic(x2))),
spar=‘‘u’’, varht=1, family=‘‘gamma’’)
> grid <- expand.grid(x1=seq(0,.5,len=50),
x2=seq(0,1,len=50))
> seizure.ls.base.p <- predict(seizure.ls.base,
newdata=grid)
The SS ANOVA model (6.32) for the preseizure series is fitted simi-larly. Figure 6.6 shows estimates of time-varying spectra of the baselineseries and preseizure series. The iterative UBR method was used inthe above fits. The DGML method usually leads to a better estimateof the spectrum or the time-varying spectrum (Qin and Wang 2008).The following function spdest estimates a spectrum or a time-varyingspectrum using the DGML method.
spdest <- function(y, freq, time, process, control=list(
optfactr=1e10, limnla=c(-6,1), prec=1e-6, maxit=30))
{
if (process==‘‘S’’) {
thhat <- rep(NA, 4)
S <- as.matrix(rep(1,length(y)))
Q <- periodic(freq)
tmp <- try(optim(-2, DGML, y=y, S=S, Q=Q,
Generalized Smoothing Spline ANOVA 189
0 20 40 60 80 10001
23
452
4
6
8
frequency (HZ) time
(min
ute)
log
sp
ectr
um
0 20 40 60 80 10001
23
45
0
2
4
6
8
frequency (HZ) time
(min
ute)
log
sp
ectr
um
FIGURE 6.6 Seizure data, plots of estimates of the time-varyingspectra of the baseline series (left), and preseizure series (right) basedon the iterative UBR method.
method=‘‘L-BFGS-B’’,
lower=control$limnla[1], upper=control$limnla[2],
control=list(factr=control$optfactr)))
if (class(tmp)==‘‘try-error’’) {
info <- ‘‘optim failure’’
}
else {
if (tmp$convergence==0) {
nlaht <- tmp$par
fit <- ssr(y~S-1, Q, limnla=tmp$par, family=‘‘gamma’’,
spar=‘‘u’’, varht=1)
info <- ‘‘success’’
}
else info <- paste(‘‘optim failure’’,
as.character(tmp$convergence))
}
}
if (process==‘‘LS’’) {
S <- cbind(1,time-.5)
Q <- array(NA,c(length(freq),length(freq),4))
Q[,,1] <- periodic(freq)
Q[,,2] <- cubic(time)
Q[,,3] <- rk.prod(periodic(freq),kron(time-.5))
190 Smoothing Splines: Methods and Applications
Q[,,4] <- rk.prod(periodic(freq),cubic(time))
# compute initial values for optimization of theta
zz <- log(y)+.57721
tmp <- try(ssr(zz~S-1, Q, spar=‘‘m’’),T)
if (class(tmp)==‘‘ssr’’) {
thini <- tmp$rkpk.obj$theta
nlaht <- tmp$rkpk.obj$nlaht
}
else {
thini <- rep(1,4)
nlaht <- 1.e-8
}
tmp <- try(optim(thini, DGML, nlaht=nlaht,
y=y, S=S, Q=Q, method=‘‘L-BFGS-B’’,
control=list(factr=control$optfactr)))
if (class(tmp)==‘‘try-error’’) {info=‘‘optim failure’’}
else {
if (tmp$convergence==0) {
thhat <- tmp$par
thetahat <- 10**thhat
Qt <- 0
for (i in 1:dim(Q)[3]) Qt <- Qt + thetahat[i]*Q[,,i]
fit <- ssr(y~S-1, Qt, limnla=nlaht, family=‘‘gamma’’,
spar=‘‘u’’, varht=1)
info <- ‘‘success’’
}
else { info <- paste(‘‘optim failure’’,
as.character(tmp$convergence)) }
}
}
return(list(fit=fit, nlaht=nlaht, thhat=thhat, info=info))
}
The argument process specifies the type of process with “S” and “LS”corresponding to the stationary and locally stationary processes, respec-tively. For a stationary process, the argument y inputs periodogramsat frequencies specified by freq. The argument time is irrelevant inthis case. For a locally stationary process, the argument y inputs localperiodograms at the time-frequency grid specified by time and freq,respectively. For locally stationary processes, we fit the logarithmictransformed periodograms to get initial values for smoothing parameters(Wahba 1980, Qin and Wang 2008). The DGML function was presented
Generalized Smoothing Spline ANOVA 191
in Section 6.4.The following statements estimate the spectrum of the baseline series
as a stationary process using the DGML method and compute posteriormeans and standard deviations:
> x <- seq(1,n-1,by=120)/n
> y <- pgram(seizure$base, x)
> tmp <- spdest(y, x, 0, ‘‘S’’)
> seizure.s.base.dgml <- ssr(y~x, periodic(x), spar=‘‘u’’,
varht=1, family=‘‘gamma’’,limnla=tmp$nlaht)
> grid <- data.frame(x=seq(0,.5,len=100))
> seizure.s.base.dgml.p <- predict(seizure.s.base.dgml,
grid)
Note that, to use the predict function, we need to call ssr again us-ing the DGML estimate of the smoothing parameter. The spectrum ofthe preseizure series as a stationary can be estimated similarly. Theestimates of spectra are similar to those in Figure 6.5.
Next we estimate the time-varying spectrum of the baseline seriesas a locally stationary process using the DGML method and computeposterior means:
> y <- lpgram(seizure$base, times, freqs)
> tmp <- spdest(y, x1, x2, ‘‘LS’’)
> th <- 10**tmp$thhat
> S <- cbind(1, x2-.5)
> Q1 <- periodic(x1)
> Q2 <- cubic(x2)
> Q3 <- rk.prod(periodic(x1), kron(x2-.5))
> Q4 <- rk.prod(periodic(x1), cubic(x2))
> Qt <- th[1]*Q1+th[2]*Q2+th[3]*Q3+th[4]*Q4
> seizure.ls.base.dgml <- ssr(y~x2, rk=Qt, spar=‘‘u’’,
varht=1, family=‘‘gamma’’, limnla=tmp$nlaht)
> grid <- expand.grid(x1=seq(0,.5,len=50),
x2=seq(0,1,len=50))
> Sg <- cbind(1,grid$x2-.5)
> Qg1 <- periodic(grid$x1,x1)
> Qg2 <- cubic(grid$x2,x2)
> Qg3 <- rk.prod(periodic(grid$x1,x1),
kron(grid$x2-.5,x2-.5))
> Qg4 <- rk.prod(periodic(grid$x1,x1), cubic(grid$x2,x2))
> Qgt <- th[1]*Qg1+th[2]*Qg2+th[3]*Qg3+th[4]*Qg4
> seizure.ls.base.dglm.p <-
Sg%*%seizure.ls.base.dgml$coef$d+
Qgt%*%seizure.ls.base.dgml$coef$c
192 Smoothing Splines: Methods and Applications
where the spdest function was used to find the DGML estimates of thesmoothing parameters, and the ssr function was used to calculate thecoefficients c and d. The time-varying spectrum of the preseizure seriesas a locally stationary process can be estimated similarly. The estimatesof time-varying spectra are shown in Figure 6.7. The estimate of thepreseizure series (right panel in Figure 6.7) is smoother than that basedon the iterative UBR method (right panel in Figure 6.6).
0 20 40 60 80 10001
23
452
4
6
8
frequency (HZ) time
(min
ute)
log
sp
ectr
um
0 20 40 60 80 10001
23
45
0
2
4
6
8
frequency (HZ) time
(min
ute)
log
sp
ectr
um
FIGURE 6.7 Seizure data, plots of estimates of the time-varyingspectra based on the DGML method of the baseline series (left) andpreseizure series (right).
It appears that the baseline spectrum does not vary much over time,while the preseizure spectrum varies over time. Therefore, the baselineseries may be stationary, while the preseizure series may be nonsta-tionary. The following statements perform the permutation test for thebaseline series with 100 permutations.
z <- seizure$base
n <- length(z); nt <- 32; nf <- 32
freqs <- 1:nf/(nf+1); nsub <- floor(n/nt)
times <- (seq(from=(1+nsub)/2, by=nsub, length=nt))/n
y <- lpgram(z,times,freqs)
ftgrid <- expand.grid(freqs,times)
x1 <- ftgrid[,1]
x2 <- ftgrid[,2]
Generalized Smoothing Spline ANOVA 193
full <- spdest(y, x1, x2, ‘‘LS’’)
reduced <- spdest(y, x1, x2, ‘‘S’’)
fitf <- full$fit$rkpk$eta
fitr <- reduced$fit$rkpk$eta
devf <- sum(-1-log(y)+y*exp(-fitf)+fitf)
devr <- sum(-1-log(y)+y*exp(-fitr)+fitr)
cstat <- devr-devf
l2d <- mean((fitf-fitr)**2)
nperm <- 100; totperm <- 0
cstatp <- l2dp <- NULL
while (totperm < nperm) {
x2p <- rep(sample(times),rep(nf,nt))
full <- spdest(y, x1, x2p, ‘‘LS’’)
if (full$info==‘‘success’’) {
totperm <- totperm+1
fitf <- full$fit$rkpk$eta
devf <- sum(-1-log(y)+y*exp(-fitf)+fitf)
cstatp <- c(cstatp, devr-devf)
l2dp <- c(l2dp, mean((fitf-fitr)**2))
}
}
print(c(mean(cstatp>cstat),mean(l2dp>l2d)))
Note that there is no need to fit the reduced model for the permuteddata since the log spectrum does not depend on time under the nullhypothesis. The permutation test for the preseizure series is performedsimilarly. The p-values for testing stationarity based on two statistics are0.82 and 0.73 for the baseline series, and 0.03 and 0.06 for the preseizureseries. Therefore, the processes far away from the seizure’s clinical onsetcan be regarded as stationary, while the processes close to the seizure’sclinical onset are nonstationary.
This page intentionally left blankThis page intentionally left blank
Chapter 7
Smoothing Spline NonlinearRegression
7.1 Motivation
The general smoothing spline regression model (2.10) in Chapter 2 andthe SS ANOVA model (4.31) in Chapter 4 assume that the unknownfunction is observed through some linear functionals. This chapter dealswith situations when some unknown functions are observed indirectlythrough nonlinear functionals. We discuss some potential applicationsin this section. More examples can be found in Section 7.6.
In some applications, the theoretical models depend on the unknownfunctions nonlinearly. For example, in remote sensing, the satellite up-welling radiance measurements Rv are related to the underlying atmo-spheric temperature distribution f through the following equation
Rv(f) = Bv(f(xs))τv(xs) −∫ xs
x0
Bv(f(x))τ ′v(x)dx,
where x is some monotone transformation of pressure p, for example,the kappa units x(p) = p5/8; x0 and xs are x values at the surface andtop of the atmosphere; τv(x) is the transmittance of the atmosphereabove x at wavenumber v; and Bv(t) is the Planck’s function, Bv(t) =c1v
3/{exp(c2v/t) − 1}, with known constants c1 and c2. The goal isto estimate f as a function of x based on noisy observations of Rv(f).Obviously, Rv(f) is nonlinear in f . Other examples involving reservoirmodeling and three-dimensional atmospheric temperature distributionfrom satellite-observed radiances can be found in Wahba (1987) andO’Sullivan (1986).
Very often there are certain constraints such as positivity and mono-tonicity on the function of interest, and sometimes nonlinear transfor-mations may be used to relax those constraints. For example, considerthe following nonparametric regression model
yi = g(xi) + ǫi, xi ∈ [0, 1], i = 1, . . . , n. (7.1)
195
196 Smoothing Splines: Methods and Applications
Suppose g is known to be positive. The transformation g = exp(f)substitutes the original constrained estimation of g by the unconstrainedestimation of f . The resulting transformed model,
yi = exp{f(xi)} + ǫi, xi ∈ [0, 1], i = 1, . . . , n, (7.2)
depends on f nonlinearly. Monotonicity is another common constraint.Consider model (7.1) again and suppose g is known to be strictly in-creasing with g′(x) > 0. Write g′ as g′(x) = exp{f(x)}. Reexpressing gas g(x) = f(0) +
∫ x
0exp{f(s)}ds leads to the following model
yi = β +
∫ xi
0
exp{f(s)}ds+ ǫi, xi ∈ [0, 1], i = 1, . . . , n. (7.3)
The function f is free of constraints and acts nonlinearly. Strictly speak-ing, model (7.3) is a semiparametric nonlinear regression model in Sec-tion 8.3 since it contains a parameter β.
Sometimes one may want to consider empirical models that dependon unknown functions nonlinearly. One such model, the multiplicativemodel, will be introduced in Section 7.6.4.
7.2 Nonparametric Nonlinear Regression Models
A general nonparametric nonlinear regression (NNR) model assumesthat
yi = Ni(f ) + ǫi, i = 1, . . . , n, (7.4)
where f = (f1, . . . , fr) are r unknown functions, fk belongs to an RKHSHk on an arbitrary domain Xk for k = 1, . . . , r, Ni are known nonlinearfunctionals on H1 × · · ·×Hr, and ǫi are zero-mean independent randomerrors with a common variance σ2. Note that domains Xk for differentfunctions fk may be the same or different. It is obvious that the NNRmodel (7.4) is an extension of both the SSR model (2.10) and the SSANOVA model (4.31). It may be considered as an extension of the para-metric nonlinear regression model with functions in infinite dimensionalspaces as parameters.
An interesting special case of the NNR model (7.4) is when Ni de-pends on f through a nonlinear function and some linear functionals.Specifically,
yi = ψ(L1if1, . . . ,Lrifr) + ǫi, i = 1, . . . , n, (7.5)
Smoothing Spline Nonlinear Regression 197
where ψ is a known nonlinear function, and L1i, . . . ,Lri are boundedlinear functionals. Model (7.2) for positive constraint is a special case of(7.5) with r = 1, ψ(z) = exp(z) and Lif = f(xi). However, model (7.3)for monotonicity constraint cannot be written in the form of (7.5).
7.3 Estimation with a Single Function
In this section we restrict our attention to the case when r = 1, anddrop subscript k for simplicity of notation. Our goal is to estimate thenonparametric function f in H.
Assume that H = H0⊕H1, where H0 = span{φ1, . . . , φp} is an RKHSwith RK R0, and H1 is an RKHS with RK R1. We estimate f as theminimizer of the following PLS:
minf∈H
{
1
n
n∑
i=1
(yi −Nif)2 + λ‖P1f‖2
}
, (7.6)
where P1 is a projection operator from H onto H1, and λ is a smoothingparameter. We assume that the solution to (7.6) exists and is unique(conditions can be found in the supplement document of Ke and Wang
(2004)), and denote the solution as f .
7.3.1 Gauss–Newton and Newton–Raphson Methods
We first consider the special NNR model (7.5) in this section. Let
ηi(x) = Li(z)R(x, z) = Li(z)R0(x, z) + Li(z)R1(x, z) , δi(x) + ξi(x).
Then ηi ∈ S, where S = H0 ⊕ span{ξ1, . . . , ξn}. Any f ∈ H can bedecomposed into f = ς1 + ς2, where ς1 ∈ S and ς2 ∈ Sc. Furthermore,Lif = Liς1. Denote
LS(ψ(L1f), . . . , ψ(Lnf)) =1
n
n∑
i=1
(yi − ψ(Lif))2
as the least squares. Then the penalized least squares
PLS(ψ(L1f), . . . , ψ(Lnf))
, LS(ψ(L1f), . . . , ψ(Lnf)) + λ||P1f ||2= LS(ψ(L1ς1), . . . , ψ(Lnς1)) + λ(||P1ς1||2 + ||P1ς2||2)≥ LS(ψ(L1ς1), . . . , ψ(Lnς1)) + λ||P1ς1||2= PLS(ψ(L1ς1), . . . , ψ(Lnς1)).
198 Smoothing Splines: Methods and Applications
Equality holds iff ||P1ς2|| = ||ς2|| = 0. Thus the minimizer of the PLSfalls in the finite dimensional space S, which can be represented as
f(x) =
p∑
ν=1
dνφν(x) +
n∑
i=1
ciξi(x). (7.7)
The representation in (7.7) is an extension of the Kimeldorf–Wahbarepresenter theorem. Let c = (c1, . . . , cn)T and d = (d1, . . . , dp)
T . Basedon (7.7), the PLS (7.6) becomes
1
n
n∑
i=1
(yi − ψ(Lif))2 + λcT Σc, (7.8)
where Lif =∑p
ν=1 dνLiφν +∑n
j=1 cjLiξj , and Σ = {LiLjR1}ni,j=1.
We need to find minimizers c and d. Standard nonlinear optimizationprocedures can be employed to solve (7.8). We now describe the Gauss–Newton and Newton–Raphson methods.
We first describe the Gauss–Newton method. Let c−, d−, and f−be the current approximations of c, d, and f , respectively. Replacingψ(Lif) by its first-order expansion at Lif−,
ψ(Lif) ≈ ψ(Lif−) − ψ′(Lif−)Lif− + ψ′(Lif−)Lif , (7.9)
the PLS (7.8) can be approximated by
1
n||y − V (Td+ Σc)||2 + λcT Σc, (7.10)
where yi = yi − ψ(Lif−) + ψ′(Lif−)Lif−, y = (y1, . . . , yn)T , V =
diag(ψ′(L1f−), . . . , ψ′(Lnf−)), and T = {Liφν}n pi=1 ν=1. Assume that
V is invertible, and let T = V T , Σ = V ΣV , c = V −1c, and d = d.Then the approximated PLS (7.10) reduces to
1
n||y − T d− Σc||2 + λcT Σc, (7.11)
which has the same form as (2.19). Thus the Gauss–Newton methodupdates c and d by solving
(Σ + nλI)c+ T d = y,
T T c = 0.(7.12)
Equations in (7.12) have the same form as those in (2.21). Therefore,the same method in Section 2.4 can be used to compute c and d. New
Smoothing Spline Nonlinear Regression 199
estimates of c and d can be derived from c = V c and d = d. It is easyto see that the equations in (7.12) are equivalent to
(Σ + nλV −2)c+ Td = V −1y,
T Tc = 0,(7.13)
which have the same form as those in (5.6).We next describe the Newton–Raphson method. Let
I(c,d) =1
2
n∑
i=1
{yi − ψ
(p∑
ν=1
dνLiφν +n∑
j=1
cjLiξj)}2
.
Let ψ = (ψ(L1f−), . . . , ψ(Lnf−))T , u = −V (y−ψ), where V is defined
above,O = diag((y1−ψ(L1f−))ψ′′(L1f−), . . . , (yn−ψ(Lnf−))ψ′′(Lnf−)),and W = V 2 − O. Then ∂I/∂c = −ΣV (y − ψ) = Σu, ∂I/∂d =−T TV (y−ψ) = T Tu, ∂2I/∂c∂cT = ΣV 2Σ−ΣOΣ = ΣWΣ, ∂2I/∂c∂dT
= ΣWT , and ∂2I/∂d∂dT = T TWT . The Newton–Raphson iterationsatisfies the following equations(
ΣWΣ + nλΣ ΣWTT TWΣ T TWT
)(c − c−d− d−
)
=
(−Σu− nλΣc−
−T Tu
)
. (7.14)
Assume that W is positive definite. It is easy to see that solutions to
(Σ + nλW−1)c + Td = f− −W−1u,
T Tc = 0,(7.15)
are also solutions to (7.14), where f− = (L1f−, . . . ,Lnf−)T . Again, theequations in (7.15) have the same form as those in (5.6). Methods inSection 5.2.1 can be used to solve (7.15).
Note that, when O is ignored, W = V 2, and V −1y = f− −W−1u.Then, equations in (7.13) are the same as those in (7.15), and theNewton–Raphson method is the same as the Gauss–Newton method.
7.3.2 Extended Gauss–Newton Method
We now consider the estimation of the general NNR model (7.4). In
general, the solution to the PLS (7.6) f no longer falls in a finite dimen-sional space. Therefore, certain approximation is necessary.
Let f be a fixed element in H. A nonlinear functional N on H is calledFrechet differentiable at f if there exists a bounded linear functionalDN (f) such that (Debnath and Mikusinski 1999)
lim||h||→0
|N (f + h) −N (f) −DN (f)(h)|||h|| = 0.
200 Smoothing Splines: Methods and Applications
Denote f− as the current approximation of f . Assume that Ni areFrechet differentiable at f− and denote Di = DNi(f−). Note that Di
are known bounded linear functionals at the current approximation. Thebest linear approximation of Ni near f− is (Debnath and Mikusinski1999)
Nif ≈ Nif− + Di(f − f−). (7.16)
Based on the linear approximation (7.16), the NNR model can be ap-proximated by
yi = Dif + ǫi, i = 1, . . . , n, (7.17)
where yi = yi −Nif− + Dif−. We minimize
1
n
n∑
i=1
(yi −Dif)2 + λ||P1f ||2 (7.18)
to get a new approximation of f . Since Di are bounded linear function-als, the solution to (7.18) has the form
f =
p∑
ν=1
dνφi(x) +
n∑
i=1
ciξi(x), (7.19)
where ξi(x) = Di(z)R1(x, z). Coefficients ci and dν can be calculatedusing the same method in Section 2.4. An iterative algorithm can thenbe formed with the convergent solution as the final approximation off . We refer to this algorithm as the extended Gauss–Newton (EGN)algorithm since it is an extension of the Gauss–Newton method to infinitedimensional spaces.
Note that the linear functionals Di depend on the current approx-imation f−. Thus, representers ξj change along iterations. This ap-proach adaptively chooses representers to approximate the PLS estimatef . As in nonlinear regression, the performance of this algorithm de-pends largely on the curvature of the nonlinear functional. Simulationsindicate that the EGN algorithm works well and converges quickly forcommonly used nonlinear functionals. For the special NNR model (7.5),it can be shown that the EGN algorithm is equivalent to the standardGauss–Newton method presented in Section 7.3.1 (Ke and Wang 2004).
An alternative approach is to approximate f by a finite series and solvecoefficients using the Gauss–Newton or the Newton–Raphson algorithm.When a finite series with good approximation property is available, thisapproach may be preferable since it is relatively easy to implement.
Smoothing Spline Nonlinear Regression 201
However, the choice of such a finite series may become difficult in cer-tain situations. The EGN approach is fully automatic and adaptive.On the other hand, it is more difficult to implement and may becomecomputationally intensive.
7.3.3 Smoothing Parameter Selection and Inference
The smoothing parameter λ is fixed in Sections 7.3.1 and 7.3.2. Wenow discuss methods for selecting λ. Similar to Section 3.4, we definethe leaving-out-one cross-validation (CV) criterion as
CV(λ) =1
n
n∑
i=1
(Nif[i] − yi)
2, (7.20)
where f [i] is the minimizer of the following PLS
1
n
∑
j 6=i
(yj −Njf)2 + λ||P1f ||2. (7.21)
Again, computation of f [i] for each i = 1, . . . , n is costly. For fixed iand z, let h(i, z) be the minimizer of
1
n(z −Nif)2 +
∑
j 6=i
(yj −Njf)2 + λ||P1f ||2.
That is, h(i, z) is the solution to (7.6) when the ith observation is re-placed by z. It is not difficult to check that arguments in Section 3.4 stillhold for nonlinear functionals. Therefore, we have the following lemma.
Leaving-out-one Lemma For any fixed i, h(i,Nif[i]) = f [i].
Note that h(i, yi) = f . Define
ai ,∂Nih(i, yi)
∂yi= DNi(h(i, yi))
∂h(i, yi)
∂yi, (7.22)
where DNi(h(i, yi)) is the Frechet differential of Ni at h(i, yi). Let
y[i]i = Nif
[i]. Applying the leaving-out-one lemma, we have
∆i(λ) ,Nif −Nif
[i]
yi −Nif [i]=
Nih(i, yi) −Nih(i, y[i]i )
yi − y[i]i
≈ ∂Nih(i, yi)
∂yi= ai.
Then
yi −Nif[i] =
yi −Nif
1 − ∆i(λ)≈ yi −Nif
1 − ai.
202 Smoothing Splines: Methods and Applications
Subsequently, the CV criterion (7.20) is approximated by
CV(λ) ≈ 1
n
n∑
i=1
(yi −Nif)2
(1 − ai)2. (7.23)
Replacing ai by the average∑n
i=1 ai/n, we have the GCV criterion
GCV(λ) =1n
∑ni=1(yi −Nif)2
(1 − 1n
∑ni=1 ai)2
. (7.24)
Note that ai in (7.22) depends on f , and f may depend on y nonlin-early. Therefore, unlike the linear case, in general, there is no explicitformula for ai, and it is impossible to compute the CV or GCV estimateof λ by minimizing (7.24) directly. One approach is to replace f in (7.22)
by its current approximation, f−. This suggests estimating λ at eachiteration. Specifically, at each iteration, select λ for the approximatingSSR model (7.17) by the standard GCV criterion (3.4). This iterativeapproach is easy to implement. However, it does not guarantee conver-gence. Simulations indicate that convergence is achieved in most cases(Ke and Wang 2004).
We use the following connections between an NNR model (7.5) anda nonlinear mixed-effects (NLME) model to extend the GML method.Consider the following NLME model
y = ψ(γ) + ǫ, ǫ ∼ N(0, σ2I),
γ = Td+ Σc, c ∼ N(0, σ2Σ+/nλ),(7.25)
where γ = (γ1, . . . , γn)T , ψ(γ) = (ψ(γ1), . . . , ψ(γn))T , d are fixed ef-fects, c are random effects, ǫ = (ǫ1, . . . , ǫn)T are random errors inde-pendent of c, and Σ+ is the Moore–Penrose inverse of Σ. It is commonpractice to estimate c and d as minimizers of the following joint negativelog-likelihood (Lindstrom and Bates 1990)
1
n||y −ψ(Td+ Σc)||2 + λcT Σc. (7.26)
The joint negative log-likelihood (7.26) and the PLS (7.8) lead to thesame estimates of c and d. In their two-step procedure, at the LME step,Lindstrom and Bates (1990) approximated the NLME model (7.25) bythe following linear mixed-effects model
w = Xd+ Zc+ ǫ, (7.27)
Smoothing Spline Nonlinear Regression 203
where
X =∂ψ(Td+ Σc)
∂d
∣∣∣c=c−,d=d−
,
Z =∂ψ(Td+ Σc)
∂c
∣∣∣c=c−,d=d−
,
w = y −ψ(Td− + Σc−) +Xd− + Zc−.
The subscript minus indicates quantities evaluated at the current iter-ation. It is not difficult to see that w = y, X = V T , and Z = V Σ.Therefore, the REML estimate of λ based on the LME model (7.27) isthe same as the GML estimate for the approximate SSR model corre-sponding to (7.12). This suggests an iterative approach by estimating λfor the approximate SSR model (7.17) at each iteration using the GMLmethod.
The UBR method (3.3) may also be used to estimate λ at each itera-tion. The following algorithm summarizes the above procedures.
Linearization Algorithm
1. Initialize: f = f0.
2. Linearize: Update f by finding the PLS estimate of an approxi-mate SSR model with linear functionals. The smoothing parameterλ is estimated using the GCV, GML, or UBR method.
3. Repeat Step 2 until convergence.
For the special model (7.5), the Gauss–Newton method (i.e. solve(7.12)) or the Newton–Raphson method (i.e., solve (7.15)) may be usedat Step 2. For the general NNR model (7.4), the EGN method (i.e., fitmodel (7.17)) may be used at Step 2.
We estimate σ2 by
σ2 =
∑ni=1(yi −Nif)2
n−∑ni=1 ai
,
where ai are defined in (7.22) and computed at convergence.We now discuss how to construct Bayesian confidence intervals. At
convergence, the extended Gauss–Newton method approximates the orig-inal model by
y∗i = D∗i f + ǫi, i = 1, . . . , n, (7.28)
where D∗i = DNi(f) and y∗i = yi −Nif + D∗
i f . Assume a prior distri-bution for f as
F (x) =
p∑
ν=1
ζνφν(x) + δ1/2U(x),
204 Smoothing Splines: Methods and Applications
where ζνiid∼ N(0, κ) and U(x) is a zero-mean Gaussian stochastic pro-
cess with Cov(U(x), U(z)) = R1(x, z). Assume that observations aregenerated from
y∗i = D∗i F + ǫi, i = 1, . . . , n. (7.29)
Since D∗i are bounded linear functionals, the posterior mean of the
Bayesian model (7.29) equals f . Posterior variances and Bayesian con-fidence intervals for model (7.29) can be calculated as in Section 3.8.Based on the first-order approximation (7.28), the performances of theseapproximate Bayesian confidence intervals depend largely on the accu-racy of the linear approximation at convergence. The bootstrap methodmay also be used to construct confidence intervals.
7.4 Estimation with Multiple Functions
We now consider the case when r > 1. The goal is to estimate non-parametric functions f = (f1, . . . , fr) in model (7.4). Note that fk ∈Hk for k = 1, . . . , r. Assume that Hk = Hk0 ⊕ Hk1, where Hk0 =span{φk1, . . . , φkpk
} and Hk1 is an RKHS with RK Rk1. We estimate fas minimizers of the following PLS
1
n
n∑
i=1
(yi −Ni(f))2 +
r∑
k=1
λk‖P1kfk‖2, (7.30)
where Pk1 are projection operators from Hk onto Hk1, and λ1, . . . , λr
are smoothing parameters.The linearization procedures in Section 7.3 may be applied to all func-
tions simultaneously. However, it is usually computationally intensivewhen n and/or r are large. We use a Gauss–Seidel type algorithm toestimate functions iteratively one at a time.
Nonlinear Gauss–Seidel Algorithm
1. Initialize: fk = fk0, k = 1, . . . , r.
2. Cycle: For k = 1, . . . , r, 1, . . . , r, . . . , conditional on the currentapproximations of f1, . . . , fk−1, fk+1, . . . , fr, update fk using thelinearization algorithm in Section 7.3.
3. Repeat Step 2 until convergence.
Smoothing Spline Nonlinear Regression 205
Step 2 involves an inner iteration of the linearization algorithm. Theconvergence of this inner iteration is usually unnecessary, and a smallnumber of iterations is usually good enough. The nonlinear Gauss–Seidelprocedure is an extension of the backfitting procedure.
Denote f as the estimate of f at convergence. Note here that fdenotes the estimate of the vector of functions f rather than the fittedvalues of a single function. Assume that the Frechet differentials of Ni
with respect to f evaluated at f , D∗i = DNi(f ), exist. Then D∗
i h =∑r
k=1 D∗kihk, where D∗
ki is the partial Frechet differential of Ni with
respect to fk evaluated at f , h = (h1, . . . , hr), and hk ∈ Hk (Flett1980). Using the linear approximation, the NNR model (7.4) may beapproximated by
y∗i =
r∑
k=1
D∗kifk + ǫi, i = 1, . . . , n, (7.31)
where y∗i = yi−Ni(f)+∑r
k=1 D∗kifk. The corresponding Bayes model for
(7.31) is given in (8.12) and (8.13). Section 8.2.2 provides formulae forposterior means and covariances, and discussion about how to constructBayesian confidence intervals for functions fk in model (7.31). TheseBayesian confidence intervals provide approximate confidence intervalsfor fk in the NNR model. Again, the bootstrap method may also beused to construct confidence intervals.
7.5 The nnr Function
The function nnr in the assist package is designed to fit the specialNNR model (7.5) and when Li are evaluational functionals. Sections7.6.2 and 7.6.3 provide two example implementations of the EGN methodfor two NNR models that cannot be fitted by the nnr function.
A typical call is
nnr(formula, func, start)
where formula is a two-sided formula specifying the response variableon the left side of a ~ operator and an expression for the function ψ onthe right side with fk treated as parameters. The argument func inputsa list of formulae, each specifying bases φk1, . . . , φkpk
for Hk0, and RKRk1 for Hk1. Each formula in this list has the form
f ∼ list( ∼ φ1 + · · · + φp, rk).
206 Smoothing Splines: Methods and Applications
For example, suppose f = (f1, f2), where f1 and f2 are functions of anindependent variable x. Suppose both f1 and f2 are modeled using cubicsplines. Then the bases and RKs of f1 and f2 can be specified using
func=list(f1(x)~list(~x,cubic(x)),f2(x)~list(~x,cubic(x)))
or simply
func=list(f1(x)+f2(x)~list(~x, cubic(x)))
The argument start inputs a vector or an expression that specifies theinitial values of the unknown functions.
The method of selecting smoothing parameters is specified by the ar-gument spar. The options spar=‘‘v’’, spar=‘‘m’’, and spar=‘‘u’’
correspond to the GCV, GML, and UBR methods, respectively, withGCV as the default. The option method in the argument control speci-fies the computational method with method=‘‘GN’’ and method=‘‘NR’’
corresponding to the Gauss–Newton and Newton–Raphson methods, re-spectively. The default is the Newton–Raphson method.
An object of nnr class is returned. The generic function summary canbe applied to extract further information. Approximate Bayesian confi-dence intervals can be constructed based on the output of the intervalsfunction.
7.6 Examples
7.6.1 Nonparametric Regression Subject to PositiveConstraint
The exponential transformation in (7.2) may be used to relax the positiveconstraint for a univariate or multivariate regression function. In thissection we use a simple simulation to illustrate how to fit model (7.2)and the advantage of the exponential transformation over simple cubicspline fit.
We generate n = 50 samples from model (7.1) with g(x) = exp(−6x),
xi equally spaced in [0, 1], and ǫiiid∼ N(0, 0.12). We fit g with a cubic
spline and the exponential transformation model (7.2) with f modeledby a cubic spline:
> n <- 50
> x <- seq(0,1,len=n)
> y <- exp(-6*x)+.1*rnorm(n)
Smoothing Spline Nonlinear Regression 207
> ssrfit <- ssr(y~x, cubic(x))
> nnrfit <- nnr(y~exp(f(x)), func=f(u)~list(~u,cubic(u)),
start=list(f=log(abs(y)+0.001)))
where, for simplicity, we used log(|yi| + 0.001) as initial values.We compute the mean squared error (MSE) as
MSE =1
n
n∑
i=1
(g(xi) − g(xi))2,
where g is either the cubic spline fit or the fit based on the exponentialtransformation model (7.2). The simulation was repeated 100 times.Ignoring the positive constraint, the cubic spline fits have larger MSEsthan those based on the exponential transformation (Figure 7.1(a)). Fig-ure 7.1(b) shows observations, the true function, and fits for a typicalreplication. The cubic spline fit has portions that are negative. Theexponential transformation leads to a better fit.
cubic exp. trans.
−12
−10
−8
−6
−4
log(M
SE
)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
o
o
o
o
o
oo
o
oo
o
o
o
oo
oo
o
o
oo
o
o
oo
o
o
o
o
o
o
o
o
oo
o
oo
oo
oo
ooo
o
ooo
o
true functioncubic splineexponential transformation
FIGURE 7.1 (a) Boxplots of MSEs on the logarithmic scale of thecubic spline fits and the fits based on the exponential transformation;and (b) observations (circles), the true function, and estimates for atypical replication.
7.6.2 Nonparametric Regression Subject to MonotoneConstraint
Consider model (7.1) and suppose that g is known to be strictly in-creasing. We relax the constraint by considering the transformed model
208 Smoothing Splines: Methods and Applications
(7.3). It is clear that model (7.3) cannot be written in the form of (7.5).Therefore, it cannot be fitted using the current version of the nnr func-tion. We now illustrate how to apply the EGN method in Section 7.3.2to estimate the function f in (7.3). A similar procedure may be derivedfor other situations. See Section 7.6.3 for another example.
For simplicity, we model f in (7.3) using the cubic spline model spaceW 2
2 [0, 1]. Let Nif =∫ xi
0exp{f(s)}ds. Then it can be shown that
the Frechet differential of Ni at f− exists, and Dif = DNi(f−)(f) =∫ xi
0exp{f−(s)}f(s)ds (Debnath and Mikusinski 1999). Then model
(7.3) can be approximated by
yi = β + Dif + ǫi, i = 1, . . . , n, (7.32)
where yi = yi−Nif−+Dif−. Model (7.32) is a partial spline model. Sup-pose we use the construction of W 2
2 [0, 1] in Section 2.6, where φ1(x) =k0(x) = 1 and φ2(x) = k1(x) = x − 0.5 are basis functions for H0,and R1(x, z) = k2(x)k2(z) − k4(|x − z|) is the RK for H1. To fitmodel (7.32) we need to compute yi, T = {Diφν}n 2
i=1 ν=1 and Σ ={Di(x)Dj(z)R1(x, z)}n
i,j=1. It can be shown that
yi = yi −∫ xi
0
exp{f−(s)}{1 − f−(s)}ds,
Diφ1 =
∫ xi
0
exp{f−(s)}ds,
Diφ2 =
∫ xi
0
exp{f−(s)}(s− 0.5)ds,
Di(x)Dj(z)R1(x, z) =
∫ xi
0
∫ xj
0
exp{f−(s) + f−(t)}R1(s, t)dsdt.
The above integrals do not have closed forms. We approximate themusing the Gaussian quadrature method. For simplicity of notation, sup-pose that x values are distinct and ordered such that x1 < x2 < · · · < xn.Let x0 = 0. Write
∫ xi
0
exp{f−(s)}ds =
i∑
j=1
∫ xj
xj−1
exp{f−(s)}ds.
We then approximate each integral∫ xj
xj−1exp{f−(s)}ds using Gaussian
quadrature with three points. The approximation is quite accurate whenxj−xj−1 is small. More points may be added for wider intervals. We use
the same method to approximate integrals∫ xi
0 exp{f−(s)}{1− f−(s)}ds
Smoothing Spline Nonlinear Regression 209
and∫ xi
0exp{f−(s)}(s− 0.5)ds. Write the double integral
∫ xi
0
∫ xj
0
exp{f−(s) + f−(t)}R1(s, t)dsdt
=i∑
k=1
j∑
l=1
∫ xk
xk−1
∫ xl
xl−1
exp{f−(s) + f−(t)}R1(s, t)dsdt.
We then approximate each integral
∫ xk
xk−1
∫ xl
xl−1
exp{f−(s) + f−(t)}R1(s, t)dsdt
using the simple product rule (Evans and Swartz 2000).The R function inc in Appendix B implements the EGN procedure
for model (7.3). The function inc is available in the assist package.We now use a small-scale simulation to show the advantage of the
transformed model (7.3) over simple cubic spline fit. We generate n = 50samples from model (7.1) with g(x) = 1 − exp(−6x), xi equally spaced
in [0, 1], and ǫiiid∼ N(0, 0.12). We fit g with a cubic spline and the model
(7.3) with f modeled by a cubic spline, both with the GML choice ofthe smoothing parameter:
> n <- 50; x <- seq(0,1,len=n)
> y <- 1-exp(-6*x)+.1*rnorm(n)
> ssrfit <- ssr(y~x, cubic(x), spar=‘‘m’’)
> incfit <- inc(y, x, spar=‘‘m’’)
MSEs are computed as in Section 7.6.1. The simulation was repeated100 times. Ignoring the monotonicity constraint, cubic spline fits havelarger MSEs than those based on model (7.3) (Figure 7.2(a)). Figure7.2(b) shows observations, the true function, and fits for a typical repli-cation. The cubic spline fit is not monotone. The fit based on model(7.3) is closer to the true function.
If the function g in model (7.1) is known to be both positive andstrictly increasing, then we can consider model (7.3) with an additionalconstraint β > 0. Writing β = exp(α), then model (7.3) becomes asemiparametric nonlinear regression model in Section 8.3. A similarEGN procedure can be derived to fit such a model.
Now consider the child growth data consisting of height measurementsof a child over one school year. Observations are shown in Figure 7.3(a).
We first fit a cubic spline to model (7.1) with the GCV choice of thesmoothing parameter. The cubic spline fit, shown in Figure 7.3(a) as thedashed line, is not monotone. It is reasonable to assume that the mean
210 Smoothing Splines: Methods and Applications
cubic monotone
−11
−9
−8
−7
−6
log(M
SE
)
(a)
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
1.2
xy
(b)
o
o
o
o
o
o
o
o
o
ooo
o
o
o
o
o
o
o
oo
oo
o
ooo
oo
oo
o
o
o
o
o
oo
ooo
o
oo
oo
oo
oo
true functioncubic splinemonotone
FIGURE 7.2 (a) Boxplots of MSEs on the logarithmic scale of thecubic spline fits and the fits based on model (7.3); and (b) observations(circles), the true function, and estimates for a typical replication.
0 50 150 250
124
126
128
130
day
heig
ht (c
m)
(a)
o
ooooooooo
o
oo
oo
oo
o
oooooo
o
o
ooo
ooo
oooooo
o
o
ooo
o
o
o
oooo
oo
o
oooooo
oooooo
ooo
oo ooooo
ooo
ooooo
0 50 150 250
0.0
10.0
30.0
5
day
gro
wth
velo
city (
cm
/day)
(b)
FIGURE 7.3 Child growth data, plots of (a) observations (circles),the cubic spline fit (dashed line), and the fit based on model (7.3) with95% bootstrap confidence intervals (shaded region), and (b) estimate ofthe velocity of growth with 95% bootstrap confidence intervals (shadedregion).
growth function g in (7.1) is a strictly increasing function. Therefore, wefit model (7.3) using the inc function and compute bootstrap confidenceintervals with 10,000 bootstrap samples:
> library(fda); attach(onechild)
Smoothing Spline Nonlinear Regression 211
> x <- day/365; y <- height
> onechild.cub.fit <- ssr(y~x, cubic(x))
> onechild.inc.fit <- inc(y, x, spar=‘‘m’’)
> nboot <- 10000; set.seed <- 23057
> yhat <- onechild.inc.fit$y.fit; resi <- y-yhat
> fb <- gb <- NULL; totboot <- 0
> while (totboot<nboot) {
yb <- yhat+sample(resi, replace=T)
tmp <- try(inc(yb, x, spar=‘‘m’’))
if(class(tmp)!=‘‘try-error’’&tmp$iter[2]<1.e-6) {
fb <- cbind(fb,
(tmp$pred$f-onechild.inc.fit$pred$f)/tmp$sigma)
gb <- cbind(gb,
(tmp$pred$y-onechild.inc.fit$pred$y)/tmp$sigma)
totboot <- totboot+1
}
}
> shat <- onechild.inc.fit$sigma
> gl <- onechild.inc.fit$pred$y-
shat*apply(gb,1,quantile,prob=.975)
> gu <- onechild.inc.fit$pred$y-
shat*apply(gb,1,quantile,prob=.025)
> hl <- exp(onechild.inc.fit$pred$f-
shat*apply(fb,1,quantile,prob=.975))/365
> hu <- exp(onechild.inc.fit$pred$f-
shat*apply(fb,1,quantile,prob=.025))/365
where gl and gu are lower and upper bounds for the mean functiong, and hl and hu are lower and upper bounds for the velocity definedas h(x) = g′(x) = exp{f(x)}. The percentile-t bootstrap confidenceintervals were constructed.
The constrained fit based on model (7.3) is shown in Figure 7.3(a) with95% bootstrap confidence intervals. Figure 7.3(b) shows the estimate ofg′(x) with 95% bootstrap confidence intervals. As in Ramsay (1998),the velocity shows three short bursts. Note that Ramsay (1998) used adifferent transformation, g(x) = β1 + β2
∫ x
0 exp{∫ s
0 w(u)du}ds, in model(7.1) to relax the monotone constraint. It is easy to see that w(x) =f ′(x), where f(x) = log g′(x). Therefore, with respect to the mean
function g, the penalty in this section equals∫ 1
0 [{log g′(x)}′′]2dx, and
the penalty in Ramsay (1998) equals∫ 1
0 {g′′(x)/g′(x)}2dx.
212 Smoothing Splines: Methods and Applications
7.6.3 Term Structure of Interest Rates
In this section we use the bond data to investigate the term structure ofinterest rate, a concept central to economic and financial theory.
Consider a set of n coupon bonds from which the interest rate termstructure is to be inferred. Denote the current time as zero. Let yi bethe current price of bond i, rij be the payment paid at a future time xij ,0 < xi1 < · · · < ximi
, i = 1, . . . , n, j = 1, . . . ,mi. The pricing modelassumes that (Fisher, Nychka and Zervos 1995, Jarrow, Ruppert and Yu2004)
yi =
mi∑
j=1
rijg(xij) + ǫi, i = 1, . . . , n, (7.33)
where g is a discount function and g(xij) represents the price of adollar delivered at time xij , and ǫi are iid random errors with meanzero and variance σ2. The goal is to estimate the discount functiong from observations yi. Assume that g ∈ W 2
2 [0, b] for a fixed timeb and define Lig =
∑mi
j=1 rijg(xij). Then, it is easy to see that Li
are bounded linear functionals on W 22 [0, b]. Therefore, model (7.33)
is a special case of the general SSR model (2.10). Consider the cu-bic spline construction in Section 2.2. Two basis functions of the nullspace are φ1(x) = 1 and φ2(x) = x, and denote the RK of H1 asR1. To fit model (7.33) using the ssr function, we need to computeT = {Liφν}n 2
i=1 ν=1 and Σ = {LiLjR1}ni,j=1. Define ri = (ri1, . . . , rimi
)and X = diag(r1, . . . , rn). Then
T ={
mi∑
j=1
rijφν(xij)}n 2
i=1 ν=1= XS,
where S = (φ1,φ2) and φν = (φν(x11), . . . , φν(x1m1), . . . , φν(xn1), . . . ,
φν(xnmn))T for ν = 1, 2. Similarly, it can be shown that
Σ ={
mi∑
k=1
mj∑
l=1
rikrjlR1(xik, xjl)}n
i,j=1= XΛXT ,
where Λ = {Λij}ni,j=1 and Λij = {R1(xik, xjl)}mi mj
k=1 l=1. Note that R1 canbe calculated by the cubic2 function.
The bond data set contains 78 Treasury bonds and 144 GE (GeneralElectric Company) bonds. We first fit model (7.33) to the Treasury bondand compute estimates of g at grid points as follows:
> data(bond); attach(bond)
Smoothing Spline Nonlinear Regression 213
> group <- as.vector(table(name[type==‘‘govt’’]))
> y <- price[type==‘‘govt’’][cumsum(group)]
> x <- time[type==‘‘govt’’]
> r <- payment[type==‘‘govt’’]
> X <- assist:::diagComp(matrix(r,nrow=1),group)
> S <- cbind(1, x)
> T <- X%*%S
> Q <- X%*%cubic2(x)%*%t(X)
> bond.cub.fit.govt <- ssr(y~T-1, Q, spar=‘‘m’’,
limnla=c(6,10))
> grid1 <- seq(min(x),max(x),len=100)
> g.cub.p.govt <- cbind(1,grid1)%*%bond.cub.fit.govt$coef$d
+cubic2(grid1,x)%*%t(X)%*%bond.cub.fit.govt$coef$c
where diagComp is a function in the assist package that constructsthe matrix X = diag(r1, . . . , rn). We compute bootstrap confidenceintervals for g at grid points as follows:
> boot.bond.one <- function(x, y, yhat, X, T, Q,
spar=‘‘m’’, limnla=c(-3,6), grid, nboot, seed=0) {
set.seed <- seed
resi <- y-yhat
gb <- NULL
for(i in 1:nboot) {
yb <- yhat + sample(resi, replace=TRUE)
tmp <- try(ssr(yb~T-1, Q, spar=spar, limnla=c(-3,6)))
if(class(tmp)!=‘‘try-error’’) gb <- cbind(gb,
cbind(1,grid)%*%tmp$coef$d+
cubic2(grid,x)%*%t(X)%*%tmp$coef$c)
}
return(gb)
}
> g.cub.b.govt <- boot.bond.one(x, y,
yhat=bond.cub.fit.govt$fit,
X, T, Q, grid=grid1, nboot=5000, seed=3498)
> gl <- apply(g.cub.b.govt, 1, quantile, prob=.025)
> gu <- apply(g.cub.b.govt, 1, quantile, prob=.975)
where gl and gu are lower and upper bounds for the discount functiong, and the 95% percentile bootstrap confidence intervals were computedbased on 5, 000 bootstrap samples. Model (7.33) for the GE bond canbe fitted similarly. The estimates of discount functions and bootstrapconfidence intervals for the Treasury and GE bonds are shown in Figure7.4(a).
214 Smoothing Splines: Methods and Applications
0 5 10 15 20 25
0.2
0.4
0.6
0.8
1.0
time (yr)
dis
count ra
te
(a)
0 5 10 15 20 25
0.2
0.4
0.6
0.8
1.0
time (yr)
dis
count ra
te
(b)
0 5 10 15 20 25
0.0
00.0
20.0
40.0
60.0
80.1
0
time (yr)
forw
ard
rate
(c)
0 5 10 15
−0.0
20.0
00.0
10.0
2
time (yr)
cre
dit s
pre
ad
(d)
FIGURE 7.4 Bond data, plots of (a) unconstrained estimates of thediscount function for Treasury bond (solid line) and GE bond (dashedline) based on model (7.33) with 95% bootstrap confidence intervals(shaded regions), (b) constrained estimates of the discount function forTreasury bond (solid line) and GE bond (dashed line) based on model(7.35) with 95% bootstrap confidence intervals (shaded regions), (c) es-timates of the forward rate for Treasury bond (solid line) and GE bond(dashed line) based on model (7.35) with 95% bootstrap confidence in-tervals (shaded regions), and (d) estimates of the credit spread based onmodel (7.35) with 95% bootstrap confidence intervals (shaded region).
The discount function g is required to be positive, decreasing, andsatisfy g(0) = 1. These constraints are ignored in the above directestimation. One simple approach to dealing with these constraints is torepresent g by
g(x) = exp
{
−∫ x
0
f(s)ds
}
, (7.34)
Smoothing Spline Nonlinear Regression 215
where f(s) ≥ 0 is the so-called forward rate. Reparametrization (7.34)takes care of all constraints on g. The goal now is to estimate the forwardrate f . Assuming f ∈ W 2
2 [0, b] and replacing g in (7.33) by (7.34) leadsto the following NNR model
yi =
mi∑
j=1
rij exp
{
−∫ xij
0
f(s)ds
}
+ ǫi, i = 1, . . . , n. (7.35)
For simplicity, the nonnegative constraint on f is not enforced since itsestimate is not close to zero. When necessary, the nonnegative constraintcan be enforced with a further reparametrization of f such as f = exp(h).The R function one.bond in Appendix C implements the EGN algorithmto fit model (7.35) for a single bond. For example, we can fit model (7.35)to Treasury bond as follows:
bond.nnr.fit.govt <- one.bond(price= price[type==‘‘govt’’],
payment=payment[type==‘‘govt’’],
time=time[type==‘‘govt’’],
name=name[type==‘‘govt’’])
In the following, we model Treasury and GE bonds jointly. Assumethat
yki =
mki∑
j=1
rkij exp
[
−∫ xkij
0
{f1(s) + f2(s)δk,2}ds]
+ ǫki,
k = 1, 2; i = 1, . . . , nk, (7.36)
where k represents bond type with k = 1 and k = 2 correspondingto Treasury and GE bonds, respectively, yki is the current price forbond i of type k, rkij are future payments for bond i of type k, δν,µ isthe Kronecker delta, f1 is the forward rate for Treasury bond, and f2represents the difference between GE and Treasury bonds (also calledthe credit spread). We assume that f1, f2 ∈ W 2
2 [0, b]. Model (7.36) is ageneral NNR model with two functions f1 and f2. It cannot be fitteddirectly by the nnr function. We now describe how to implement thenonlinear Gauss–Seidel algorithm to fit model (7.36).
Denote the current estimates of f1 and f2 as f1− and f2−, respectively.First consider updating f1 for fixed f2. The Frechet differential of Ni
with respect to f1 at current estimates f1− and f2− is
Dkih = −mki∑
j=1
rkij exp
[
−∫ xkij
0
{f1−(s) + f2−(s)δk,2}ds]∫ xkij
0
h(s)ds
= −mki∑
j=1
rkijwkij
∫ xkij
0
h(s)ds,
216 Smoothing Splines: Methods and Applications
where wkij = exp[
−∫ xkij
0{f1−(s) + f2−(s)δk,2}ds
]
. Let
yki,1 =
mki∑
j=1
rkijwkij(1 + fkij,1) − yki,
Lki,1f1 =
mki∑
j=1
rkijwkij
∫ xkij
0
f1(s)ds,
where fkij,1 =∫ xkij
0f1−(s)ds. We need to fit the following SSR model
yki,1 = Lki,1f1 + ǫki, k = 1, 2; i = 1, . . . , nk, (7.37)
to update f1. Let Tk = {Lki,1φν}nk 2i=1 ν=1 for k = 1, 2, T = (T T
1 , TT2 )T ,
Σuv = {Lui,1Lvj,1R1}nu nv
i=1 j=1 for u, v = 1, 2, and Σ = {Σuv}2u,v=1. To fit
model (7.37) using the ssr function, we need to compute matrices T andΣ. Define bki = (rki1wki1, . . . , rkimi
wkimi) andXk = diag(bk1, . . . , bknk
)for k = 1, 2. Define
ψkν =
(∫ xk11
0
φν(s)ds, . . . ,
∫ xk1m1
0
φν(s)ds, . . . ,
∫ xkn1
0
φν(s)ds, . . . ,
∫ xknmn
0
φν(s)ds
)T
, ν = 1, 2; k = 1, 2,
Sk = (ψk1,ψk2), k = 1, 2,
Λuv,ij ={∫ xuik
0
∫ xvjl
0
R1(s, t)dsdt}mui mvj
k=1 l=1, u, v = 1, 2;
i = 1, . . . , nu; j = 1, . . . , nv,
Λuv ={Λuv,ij
}nu nv
i=1 j=1, u, v = 1, 2.
Then it can be shown that
Tk ={
mki∑
j=1
rkijwkij
∫ xkij
0
φν(s)ds}nk 2
i=1 ν=1= XkSk, k = 1, 2,
Σuv ={
mui∑
k=1
mvj∑
l=1
ruikwuikrvjlwvjl
∫ xuik
0
∫ xvjl
0
R1(s, t)dsdt}nu nv
i=1 j=1
= XuΛuvXTv , u, v = 1, 2.
As in Section 7.6.2, integrals∫ xkij
0 φν(s)ds and∫ xuik
0
∫ xvjl
0 R1(s, t)dsdtare approximated using the Gaussian quadrature method. Note thatthese integrals do not change along iterations. Therefore, they onlyneed to be computed once.
Smoothing Spline Nonlinear Regression 217
Next we consider updating f2 with fixed f1. The Frechet differentialof Ni at current estimates f1− and f2−
Dkih = −mki∑
j=1
rkijδk,2 exp
[
−∫ xkij
0
{f1−(s) + f2−(s)δk,2}ds]
×∫ xkij
0
h(s)ds.
Note that Dkih = 0 when k = 1. Therefore, we use observations withk = 2 only at this step. Let
y2i,2 =
m2i∑
j=1
r2ijw2ij(1 + f2ij,2) − y2i,
L2i,2f =
m2i∑
j=1
rk2jw2ij
∫ x2ij
0
f(s)ds,
where f2ij,2 =∫ x2ij
0 f2−(s)ds. We need to fit the following SSR model
y2i,2 = L2i,2f2 + ǫ2i, i = 1, . . . , n2, (7.38)
to update f2. It can be shown that
{L2i,2φν}nk 2i=1 ν=1 = T2,
{L2i,2L2j,2R1}nu nv
i=1 j=1 = Σ22.
The R function two.bond in Appendix C implements the nonlinearGauss–Seidel algorithm to fit model (7.36) for two bonds. For example,we can fit model (7.36) to the bond data and 5, 000 bootstrap samplesas follows:
> bond.nnr.fit <- two.bond(price=price, payment=payment,
time=time, name=name, type=type, spar=‘‘m’’)
> boot.bond.two <- function(y, yhat, price, payment, time,
name, type, spar=‘‘m’’, limnla=c(-3,6), nboot, seed=0)
{
set.seed <- seed
resi <- y-yhat
group <- c(as.vector(table(name[type==‘‘govt’’])),
as.vector(table(name[type==‘‘ge’’])))
fb <- f2b <- gb <- NULL
for(i in 1:nboot) {
price.b <- rep(yhat+sample(resi,replace=TRUE), group)
218 Smoothing Splines: Methods and Applications
tmp <- try(two.bond(price=price.b, payment=payment,
time=time, name=name, type=type))
if(class(tmp)!=‘‘try-error’’&
tmp$iter[2]<1.e-4&tmp$iter[3]==0) {
fb <- cbind(fb,c(tmp$f.val$f1,tmp$f.val$f2))
f2b <- cbind(f2b,tmp$f2.val)
gb <- cbind(gb,c(tmp$dc[[1]],tmp$dc[[2]]))
}
list(fb=fb, f2b=f2b, gb=gb)
}
}
> gf.nnr.b <- boot.bond.two(y=bond.nnr.fit$y$y,
yhat=bond.nnr.fit$y$yhat, price=price, payment=payment,
time=time, name=name, type=type, nboot=5000,
seed=2394)
where the function boot.bond.two returns a list of estimates of thediscount functions (gb), the forward rates (fb), and the credit spread(fb2). The 95% percentile bootstrap confidence intervals for the discountfunctions of the Treasury and GE bonds can be computed as follows:
> n1 <- nrow(bond[type=="govt",])
> gl1 <- apply(gf.nnr.b$gb[1:n1,],1,quantile,prob=.025)
> gu1 <- apply(gf.nnr.b$gb[1:n1,],1,quantile,prob=.975)
> gl2 <- apply(gf.nnr.b$gb[-(1:n1),],1,quantile,prob=.025)
> gu2 <- apply(gf.nnr.b$gb[-(1:n1),],1,quantile,prob=.975)
where gl1 and gu1 are lower and upper bounds for the discount functionof the Treasury bond, and gl2 and gu2 are lower and upper bounds forthe discount function of the GE bond. The 95% percentile bootstrapconfidence intervals for the forward rates and credit spread can be com-puted similarly based on the objects gf.nnr.b$fb and gf.nnr.b$fb2.
Figures 7.4(b) and 7.4(c) show the estimates for the discount andforward rates, respectively. As expected, the GE discount rate is con-sistently smaller than that of Treasury bonds, representing a higher riskassociated with corporate bonds. To assess the difference between thetwo forward rates, we plot the estimated credit spread and its 95% boot-strap confidence intervals in Figure 7.4(d). The credit spread is positivewhen time is smaller than 11 years.
7.6.4 A Multiplicative Model for Chickenpox Epidemic
Consider the chickenpox epidemic data consisting of monthly numberof chickenpox in New York City during 1931–1972. Denote y as the
Smoothing Spline Nonlinear Regression 219
square root of reported cases in month x1 of year x2. Both x1 and x2
are transformed into the interval [0, 1]. Figure 7.5 shows the time seriesplot of square root of the monthly case numbers.
year
y
10
20
30
40
50
60
31 41 51 61 71
FIGURE 7.5 Chickenpox data, plots of the square root of the numberof cases (dotted line), and the fits from the multiplicative model (7.39)(solid line) and SS ANOVA model (7.40) (dashed line).
There is a clear seasonal variation within a year that has long beenrecognized (Yorke and London 1973, Earn, Rohani, Bolker and Gernfell2000). The seasonal variation was mainly caused by social behaviorof children who made close contacts when school was in session, andtemperature and humidity, that may affect the survival and transmissionof dispersal stages. Thus the seasonal variations were similar over theyears. The magnitude of this variation and the average number of casesmay change over the years. Consequently, we consider the followingmultiplicative model
y(x1, x2) = g1(x2) + exp{g2(x2)} × g3(x1) + ǫ(x1, x2), (7.39)
where y(x1, x2) is the square root of reported cases in month x1 of yearx2; and g1, g3 and exp(g2) represent respectively the yearly mean, theseasonal trend in a year, and the magnitude of the seasonal variation fora particular year. The function exp(g2) is referred to as the amplitude.A bigger amplitude corresponds to a bigger seasonal variation. For sim-plicity, we assume that random errors ǫ(x1, x2) are independent with a
220 Smoothing Splines: Methods and Applications
constant variance.To make model (7.39) identifiable, we use the following two side con-
ditions:
(a)∫ 1
0 g2(x2)dx2 = 0. The exponential transformation of g2 and thiscondition make g2 identifiable with g3: the exponential transfor-
mation makes exp(g2) free of a sign change, and∫ 1
0g2(x2)dx2 = 0
makes exp(g2) free of a positive multiplying constant. This condi-tion can be fulfilled by removing the constant functions from themodel space of g2.
(b)∫ 1
0 g3(x1)dx1 = 0. This condition eliminates the additive constantmaking g3 identifiable with g1. This condition can be fulfilled byremoving the constant functions from the model space of g3.
We model g1 and g2 using cubic splines. Specifically, we assume thatg1 ∈ W 2
2 [0, 1] and g2 ∈ W 22 [0, 1] ⊖ {1}, where constant functions are
removed from the model space for g2 to satisfy the side condition (a).Since the seasonal trend g3 is close to a sinusoidal model, we model g3using a trigonometric spline with L = D2 + (2π)2 (m = 2 in (2.70)).That is, g3 ∈W 2
2 (per)⊖{1} where constant functions are removed fromthe model space for g3 to satisfy the side condition (b). Obviously,the multiplicative model (7.39) is a special case of the NNR model withfunctions denoted as g1, g2, and g3 instead of f1, f2, and f3 (the notationsof fi are saved for an SS ANOVA model later). The multiplicative model(7.39) is fitted as follows:
> data(chickenpox)
> y <- sqrt(chickenpox$count)
> x1 <- (chickenpox$month-0.5)/12
> x2 <- ident(chickenpox$year)
> tmp <- ssr(y~1, rk=periodic(x1), spar=‘‘m’’)
> g3.ini <- predict(tmp, term=c(0,1))$fit
> chick.nnr <- nnr(y~g1(x2)+exp(g2(x2))*g3(x1),
func=list(g1(x)~list(~I(x-.5),cubic(x)),
g2(x)~list(~I(x-.5)-1,cubic(x)),
g3(x)~list(~sin(2*pi*x)+cos(2*pi*x)-1,
lspline(x,type=‘‘sine0’’))),
start=list(g1=mean(y),g2=0,g3=g3.ini),
control=list(converg=‘‘coef’’),spar=‘‘m’’)
> grid <- data.frame(x1=seq(0,1,len=50),
x2=seq(0,1,len=50))
> chick.nnr.pred <- intervals(chick.nnr, newdata=grid,
terms=list(g1=matrix(c(1,1,1,1,1,0,0,0,1),
Smoothing Spline Nonlinear Regression 221
nrow=3,byrow=T),
g2=matrix(c(1,1,1,0,0,1),nrow=3,byrow=T),
g3=matrix(c(1,1,1,1,1,0,0,0,1),nrow=3,byrow=T)))
We first fitted a periodic spline with variable x1 only and used the fittedvalues to the smooth component as initial values for g3. We used theaverage of y as the initial value for g1 and constant zero as the initialvalue for g2. The intervals function was used to compute approximatemeans and standard deviations for functions g1, g2, and g3 and theirprojections. The estimates of g1 and g2 are shown in Figure 7.6. Wealso superimpose yearly averages in Figure 7.6(a) and the logarithm ofscaled ranges in Figure 7.6(b). The scaled range of a specific year wascalculated as the differences between the maximum and the minimummonthly number of cases divided by the range of the estimated seasonaltrend g3. It is clear that g1 captures the long-term trend in the mean, andg2 captures the long-term trend in the range of the seasonal variation.From the estimate of g1 in Figure 7.6(a), we can see that yearly averagespeaked in the 1930s and 1950s, and gradually decreased in the 1960s afterthe introduction of mass vaccination. The amplitude reflects the seasonalvariation in the transmission rate (Yorke and London 1973). From theestimate of g2 in Figure 7.6(b), we can see that the magnitude of theseasonal variation peaked in the 1950s and then declined in the 1960s,possibly as a result of changing public health conditions including massvaccination. Figure 7.7 shows the estimate of the seasonal trend g3, itsprojection onto the null space H30 = span{sin 2πx, cos 2πx} (the simplesinusoidal model), and its projection onto the orthogonal complement ofthe null space H31 = W 2
2 (per) ⊖ span{1, sin 2πx, cos 2πx}. Since theprojection onto the complement space is significantly different from zero,a simple sinusoidal model does not provide an accurate approximation.
To check the multiplicative model (7.39), we use the SS ANOVA de-composition in (4.21) and consider the following SS ANOVA model
y(x1, x2) = µ+ f1(x1) + f2(x2) + f12(x1, x2) + ǫ(x1, x2), (7.40)
where µ is a constant, f1 and f2 are overall main effects of month andyear, and f12 is the overall interaction between month and year. Note
that side conditions∫ 1
0 f1dx1 =∫ 1
0 f2dx2 =∫ 1
0 f12dx1 =∫ 1
0 f12dx2 = 0are satisfied by the SS ANOVA decomposition. It is not difficult to check
222 Smoothing Splines: Methods and Applications
year
g1
15
20
25
30
31 41 51 61 71
oo
oo
o
o
o
o
oo
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
o
o
o
(a)
yearg
2
−0.8
−0.4
0.0
0.4
31 41 51 61 71
o
o
o
oo
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
oo
oo
oo
oo
o
o
o
o
o
o
o
o
o
o
(b)
FIGURE 7.6 Chickenpox data, plots of (a) yearly averages (circles),estimate of g1 (line), and 95% Bayesian confidence intervals (shaded re-gion), and (b) yearly scaled ranges on logarithm scale (circles), estimateof g2 (line), and 95% Bayesian confidence intervals (shaded region).
month
g3
−15−10−5
05
1015
0 2 4 6 8 10 12
overall
0 2 4 6 8 10 12
parametric
0 2 4 6 8 10 12
smooth
FIGURE 7.7 Chickenpox data, estimated g3 (overall), and its pro-jections to H30 (parametric) and H31 (smooth). Dotted lines are 95%Bayesian confidence intervals.
that model (7.40) reduces to model (7.39) iff
µ =
∫ 1
0
g1dx2,
f1(x1) =
{∫ 1
0
exp(g2)dx2
}
g3(x1),
f2(x2) = g1(x2) −∫ 1
0
g1dx2,
f12(x1, x2) =
[
exp{g2(x2)} −∫ 1
0
exp(g2)dx2
]
g3(x1).
Smoothing Spline Nonlinear Regression 223
Thus the multiplicative model assumes a multiplicative interaction.We fit the SS ANOVA model (7.40) and compare it with the multi-
plicative model (7.39) using AIC, BIC, and GCV criteria:
> chick.ssanova <- ssr(y~x2,
rk=list(periodic(x1),cubic(x2),
rk.prod(periodic(x1),kron(x2-.5)),
rk.prod(periodic(x1),cubic(x2))))
> n <- length(y)
> rss <- c(sum(chick.ssanova$resi**2),
sum(chick.nnr$resi**2))/n
> df <- c(chick.ssanova$df, chick.nnr$df$f)
> gcv <- rss/(1-df/n)**2
> aic <- n*log(rss)+2*df
> bic <- n*log(rss)+log(n)*df
> print(round(rbind(gcv,aic,bic),2))
gcv 8.85 14.14
aic 826.75 1318.03
bic 1999.38 1422.83
The AIC and GCV criteria select the SS ANOVA model, while the BICselects the multiplicative model. The SS ANOVA model captures localtrend, particularly biennial pattern from 1945 to 1955, better than themultiplicative model (Figure 7.5).
7.6.5 A Multiplicative Model for Texas Weather
The domains for variables x1 and x2 in the multiplicative model (7.39)are not limited to continuous intervals. As the NNR model, these do-mains may be arbitrary sets. We now revisit the Texas weather datato which an SS ANOVA model has been fitted in Section 4.9.4. Wehave defined x1 = (lat, long) as the geological location of a station,and x2 as the month variable scaled into [0, 1]. For illustration purposesonly, assuming that temperature profiles from all stations have the sameshape except a vertical shift and scale transformation that may dependon geographical location, we consider the following multiplicative model
y(x1, x2) = g1(x1) + exp{g2(x1)} × g3(x2) + ǫ(x1, x2), (7.41)
where y(x1, x2) is the average temperature of month x2 at location x1,g1 and exp(g2) represent, respectively, the mean and magnitude of theseasonal variation at location x1, and g3(x2) represents the seasonaltrend in a year. For simplicity, we assume that random errors ǫ(x1, x2)are independent with a constant variance. We model functions g1 and
224 Smoothing Splines: Methods and Applications
g2 using thin-plate splines, and g3 using a periodic spline. Specifically,for identifiability, we assume that g1 ∈W 2
2 (R2), g2 ∈ W 22 (R2)⊖{1}, and
g3 ∈W 22 (per) ⊖ {1}. Model (7.41) is fitted as follows:
> g3.ini <- predict(ssr(y~1,rk=periodic(x2),data=tx.dat,
spar=‘‘m’’),terms=c(0,1), pstd=F)$fit
> tx.nnr <- nnr(y~g1(x11,x12)+exp(g2(x11,x12))*g3(x2),
func=list(g1(x,z)~list(~x+z,tp(list(x,z))),
g2(x,z)~list(~x+z-1,tp(list(x,z))),
g3(x)~list(periodic(x))),
data=tx.dat, start=list(g1=mean(y),g2=0,g3=g3.ini))
The estimates of g1 and g2 are shown in Figure 7.8. From northwest tosoutheast, the temperature gets warmer and the variation during a yeargets smaller.
57 59
61
63
63
65
67
69
71 −0.16
−0.12 −0.08
−0.04
0 0.04
0.08
0.12
0.16
FIGURE 7.8 Texas weather data, plots of estimates of g1 (left) andg2 (right).
In Section 4.9.4 we have fitted the following SS ANOVA model
y(x1, x2) = µ+ f1(x1) + f2(x2) + f12(x1, x2) + ǫ(x1, x2), (7.42)
where f1 and f2 are the overall main effects of location and month, andf12 is the overall interaction between location and month. The overallmain effects and interaction are defined in Section 4.4.6. To compare themultiplicative model (7.41) with the SS ANOVA model fitted in Section4.9.4, we compute GCV, AIC, and BIC criteria:
Smoothing Spline Nonlinear Regression 225
> n <- length(y)
> rss <- c(sum(tx.ssanova$resi**2),sum(tx.nnr$resi**2))
> df <- c(tx.ssanova$df,tx.nnr$df$f)
> gcv <- rss/(1-df/n)**2
> aic <- n*log(rss)+2*df
> bic <- n*log(rss)+log(n)*df
> print(rbind(gcv,aic,bic))
gcv 131.5558 434.5253
aic 2661.9293 3481.2111
bic 3730.4315 3894.1836
All criteria choose the SS ANOVA model. One possible reason is thatthe assumption of common temperature profile for all stations is too re-strictive. To look at variation among temperature profiles, we fit eachstation separately using a periodic spline and compute normalized tem-perature profiles such that all profiles integrate to zero and have verticalrange equal to one. These normalized temperature profiles are shownin Figure 7.9(a). Variation among these temperature profiles may benonignorable. To further check if the variation among temperature pro-files may be accounted for by a horizontal shift, we align all profiles suchthat all of them reach maximum at point 0.5 and plot the aligned profilesin Figure 7.9(b). Variation among these aligned temperature profiles isagain nonignorable.
0.0 0.4 0.8
−0.4
0.0
0.4
x2
sh
ap
e
(a)
0.0 0.4 0.8
−0.4
0.0
0.4
x2
sh
ap
e
(b)
FIGURE 7.9 Texas weather data, plots of (a) normalized tempera-ture profiles, and (b) aligned normalized temperature profiles.
226 Smoothing Splines: Methods and Applications
It is easy to see that the SS ANOVA model (7.42) reduces to themultiplicative model (7.41) iff
f12(x1, x2) =
[
exp{g2(x1)}∑J
j=1 wj exp{g2(uj)}− 1
]
f2(x2), (7.43)
where uj are fixed points in R2, and wj are fixed positive weights such
that∑J
j=1 wj = 1. Equation (7.43) suggests the following simple ap-proach to check the multiplicative model (7.41): for a fixed station (thusa fixed x1), compute estimates of f2 and f12 from the SS ANOVA fitand plot f12 against f2 to see if the points fall on a straight line. Figure7.10 shows plots of f12 against f2 for two selected stations. The patternsare quite different from straight lines, especially for the Liberty station.Again, it indicates that the multiplicative model may not be appropriatefor this case.
−20 0 10
−1.0
0.0
1.0
f2
f 12
Albany
ooo
ooo
o
o
oo
o o o oo
o
o
o
o
oooooo
o
o
o
o
o
oo
oo
oo
ooo
o
−20 0 10
−1.5
0.0
1.0
f2
f 12
Liberty
ooo
o
o
o o
oo o
oo o
oo o
oo o
o
o
o
o
o
ooo
o
o
o
o
o
o
o
oo
o
ooo
FIGURE 7.10 Texas weather data, plots of f12 against f2 for twostations.
Chapter 8
Semiparametric Regression
8.1 Motivation
Postulating strict relationships between dependent and independent vari-ables, parametric models are, in general, parsimonious. Parameters inthese models often have meaningful interpretations. On the other hand,based on minimal assumptions about the relationship, nonparametricmodels are flexible. However, nonparametric models lose the advantageof having interpretable parameters and may suffer from the curse of di-mensionality. Often, in practice, there is enough knowledge to modelsome components in the regression function parametrically. For othervague and/or nuisance components, one may want to leave them un-specified. Combining both parametric and nonparametric components,a semiparametric regression model can overcome limitations in paramet-ric and nonparametric models while maintaining advantages of havinginterpretable parameters and flexibility.
Many specific semiparametric models have been proposed in the lit-erature. The partial spline model (2.43) in Section 2.10 is perhaps thesimplest semiparametric regression model. Other semiparametric mod-els include the projection pursuit, single index, varying coefficients, func-tional linear, and shape invariant models. A projection pursuit regressionmodel (Friedman and Stuetzle 1981) assumes that
yi = β0 +
r∑
k=1
fk(βTk xi) + ǫi, i = 1, . . . , n, (8.1)
where x are independent variables, β0 and βk are parameters, and fk arenonparametric functions. A partially linear single index model (Carroll,Fan, Gijbels and Wand 1997, Yu and Ruppert 2002) assumes that
yi = βT1 si + f(βT
2 ti) + ǫi, i = 1, . . . , n, (8.2)
where s and t are independent variables, β1 and β2 are parameters, andf is a nonparametric function. A varying-coefficient model (Hastie and
227
228 Smoothing Splines: Methods and Applications
Tibshirani 1993) assumes that
yi = β1 +
r∑
k=1
uikfk(xik) + ǫi, i = 1, . . . , n, (8.3)
where xk and uk are independent variables, β1 is a parameter, and fk
are nonparametric functions.The semiparametric linear and nonlinear regression models in this
chapter include all the foregoing models as special cases. The generalform of these models provides a framework for unified estimation, infer-ence, and software development.
8.2 Semiparametric Linear Regression Models
8.2.1 The Model
Define a semiparametric linear regression model as
yi = sTi β +
r∑
k=1
Lkifk + ǫi, i = 1, . . . , n, (8.4)
where s is a q-dimensional vector of independent variables, β is a vec-tor of parameters, Lki are bounded linear functionals, fk are unknownfunctions, and ǫ = (ǫ1, . . . , ǫn)T ∼ N(0, σ2W−1). We assume that Wdepends on an unknown vector of parameters τ . Let Xk be the domainof fk and assume that fk ∈ Hk, where Hk is an RKHS on Xk. Note thatthe domains Xk for different functions may be the same or different.
Model (8.4) is referred to as a semiparametric linear regression modelsince the mean response depends on β and fk linearly. Extension to thenonlinear case will be introduced in Section 8.3. The semiparametriclinear regression model (8.4) is an extension of the partial spline model(2.43) by allowing more than one nonparametric function. It is an ex-tension of the additive models by including a parametric component andallowing general linear functionals. Note that some of the functions fk
may represent main effects and interactions in an SS ANOVA decom-position. Therefore, model (8.4) is also an extension of the SS ANOVAmodel (5.1) by allowing different linear functionals for different compo-nents. It is easy to see that the varying-coefficient model (8.3) is a specialcase of the semiparametric linear regression model with q = 1, si = 1,and Lkifk = uikfk(xik) for k = 1, . . . , r. The functional linear models(FLM) are also a special case of the semiparametric linear regression
Semiparametric Regression 229
model (see Section 8.4.1). In addition, random errors are allowed to becorrelated and/or have unequal variances.
8.2.2 Estimation and Inference
Assume that Hk = Hk0 ⊕ Hk1, where Hk0 = span{φk1, . . . , φkpk} and
Hk1 is an RKHS with RK Rk1. Let y = (y1, . . . , yn)T , f = (f1, . . . , fr),and η(β,f) = (sT
1 β +∑r
k=1 Lk1fk, . . . , sTnβ +
∑rk=1 Lknfk)T .
For a fixed W , we estimate β and f as minimizers to the followingPWLS
1
n(y − η(β,f))TW (y − η(β,f)) + λ
r∑
k=1
θ−1k ‖Pk1fk‖2, (8.5)
where Pk1 is the projection operator onto Hk1 in Hk. As in (4.32),different smoothing parameters λθ−1
k allow different penalties for eachfunction.
For k = 1, . . . , r, let ξki(x) = Lki(z)Rk1(x, z) and Sk = Hk0⊕span{ξk1,. . . , ξkn}. Let
WLS(β,L1f , . . . ,Lnf) =1
n(y − η(β,f))TW (y − η(β,f))
be the weighted LS where Lif = (L1if1, . . . ,Lrifr). Then the PWLS(8.5) can be written as
PWLS(β,L1f , . . . ,Lnf) = WLS(β,L1f , . . . ,Lnf )+λr∑
k=1
θ−1k ‖Pk1fk‖2.
For any fk ∈ Hk, write fk = ςk1 + ςk2, where ςk1 ∈ Sk and ςk2 ∈ Sck. As
shown in Section 6.2.1, we have Lkifk = Lkiςk1. Then for any f ,
PWLS(β,L1f , . . . ,Lnf)
= WLS(β,L1ς1, . . . ,Lnς1) + λ
r∑
k=1
θ−1k (‖Pk1ςk1‖2 + ‖Pk1ςk2‖2)
≥ WLS(β,L1ς1, . . . ,Lnς1) + λ
r∑
k=1
θ−1k ‖Pk1ςk1‖2
= PWLS(β,L1ς1, . . . ,Lnς1),
where Liς1 = (L1iς11, . . . ,Lriςr1) for i = 1, . . . , n. Equality holds iff
||Pk1ςk2|| = ||ςk2|| = 0 for all k = 1, . . . , r. Thus the minimizers fk of thePWLS fall in the finite dimensional spaces Sk, which can be represented
230 Smoothing Splines: Methods and Applications
as
fk =
pk∑
ν=1
dkνφkν + θk
n∑
i=1
ckiξki, k = 1, . . . , r, (8.6)
where the multiplying constants θk make the solution and notations sim-ilar to those for the SS ANOVA models. Note that we only used the factthe WLS criterion depends on some bounded linear functionals Lki inthe foregoing arguments. Therefore, the Kimeldorf–Wahba representertheorem holds in general so long as the goodness-of-fit criterion dependson some bounded linear functionals.
Let S = (s1, . . . , sn)T , Tk = {Lkiφkν}n pk
i=1 ν=1 for k = 1, . . . , r, T =(T1, . . . , Tr), and X = (S T ). Let Σk = {LkiLkjRk1}n
i,j=1 for k =
1, . . . , r, and Σθ =∑r
k=1 θkΣk, where θ = (θ1, . . . , θr). Let fk =
(Lk1fk, . . . ,Lknfk)T , ck = (ck1, . . . , ckn)T , and dk = (dk1, . . . , dkpk)T
for k = 1, . . . , r. Based on (8.6), fk = Tkdk + θkΣkck for k = 1, . . . , r.Let d = (dT
1 , . . . ,dTr )T . The overall fit
η = Sβ +r∑
k=1
fk = Sβ + Td+r∑
k=1
θkΣkck = Xα+r∑
k=1
θkΣkck,
where α = (βT ,dT )T . Plugging back into (8.5), we have
1
n
(y−Xα−
r∑
k=1
θkΣkck
)TW(y−Xα−
r∑
k=1
θkΣkck
)+λ
r∑
k=1
θkcTk Σkck.
(8.7)Taking the first derivatives with respect to ck and α, we have
ΣkW(y −Xα−
r∑
k=1
θkΣkck
)− nλΣkck = 0, k = 1, . . . , r,
XTW(y −Xα−
r∑
k=1
θkΣkck
)= 0.
(8.8)
When all Σk are nonsingular, from the first equation in (8.8), we musthave c1 = · · · = cr. Setting c1 = · · · = cr = c, it is easy to see thatsolutions to
(Σθ + nλW−1)c+Xα = y,
XTc = 0,(8.9)
are also solutions to (8.8). Equations in (8.9) have the same form asthose in (5.6). Therefore, a similar procedure as in Section 5.2 can be
Semiparametric Regression 231
used to compute coefficients c and d. Let the QR decomposition of Xbe
X = (Q1 Q2)
(R0
)
and M = Σθ + nλW−1. Then the solutions
c = Q2(QT2 MQ2)
−1QT2 y,
d = R−1QT1 (y −Mc).
(8.10)
Furthermore,η = H(λ,θ, τ )y,
where
H(λ,θ, τ ) = I − nλW−1Q2(QT2MQ2)
−1QT2 (8.11)
is the hat matrix.When W is known, the UBR, GCV, and GML criteria presented in
Section 5.2.3 can be used to estimate the smoothing parameters. Wenow extend the GML method to estimate the smoothing and covarianceparameters simultaneously when W is unknown.
We first introduce a Bayes model for the semiparametric linear regres-sion model (8.4). Since the parametric component sTβ can be absorbedinto any one of the null spaces Hk0 in the setup of a Bayes model, forsimplicity of notation, we drop this term in the following discussion.Assume priors for fk as
Fk(xk) =
pk∑
ν=1
ζkνφkν(xk) +√
δθkUk(xk), k = 1, . . . , r, (8.12)
where ζkνiid∼ N(0, κ), Uk(xk) are independent zero-mean Gaussian stochas-
tic processes with covariance function Rk1(xk, zk), ζkν , and Uk are mu-tually independent, and κ and δ are positive constants. Suppose thatobservations are generated by
yi =
r∑
k=1
LkiFk + ǫi, i = 1, . . . , n. (8.13)
Let Lk0 be a bounded linear functional on Hk. Let λ = σ2/nδ. Thesame arguments in Section 3.6 hold when M = Σ + nλI is replaced byM = Σθ + nλW−1 in this chapter. Therefore,
limκ→∞
E(Lk0Fk|y) = fk, k = 1, . . . , r,
232 Smoothing Splines: Methods and Applications
and an extension of the GML criterion is
GML(λ,θ, τ ) =y′W (I −H)y
[det+(W (I −H))]1
n−p
, (8.14)
where det+ is the product of the nonzero eigenvalues and p =∑r
k=1 pk.As in Section 5.2.4, we fit the corresponding LME model (5.27) with Tbeing replaced by X in this section to compute the minimizers of theGML criterion (8.14).
For clustered data, the leaving-out-one-cluster approach presented inSection 5.2.2 may also be used to estimate the smoothing and covarianceparameters.
We now discuss how to construct Bayesian confidence intervals. Anyfunction fk ∈ Hk can be represented as
fk =
pk∑
ν=1
f0kν + f1k, (8.15)
where f0kν ∈ span{φkν} for ν = 1, . . . , pk, and f1k ∈ Hk1. Our goal isto construct Bayesian confidence interval for
Lk0fk,γk
=
pk∑
ν=1
γk,νLk0f0kν + γk,pk+1Lk0f1k (8.16)
for any bounded linear functional Lk0 on Hk and any combination ofγk = (γk,1, . . . , γk,pk+1)
T , where γk,j = 1 when the corresponding com-ponent in (8.15) is to be included and 0 otherwise.
Let F0jν = ζjνφjν for ν = 1, . . . , pj , F1j =√δθjUj, and F1k =√
δθkUk for j = 1, . . . , r and k = 1, . . . , r. Let L0j , L0j1 be boundedlinear functionals on Hj , and L0k2 be a bounded linear functional onHk.
Posterior means and covariancesFor j = 1, . . . , r, ν = 1, . . . , pj, the posterior means are
E(L0jF0jν |y) = (L0jφjν )eTj,νd,
E(L0jF1j |y) = θj(L0jξj)Tc.
(8.17)
For j = 1, . . . , r, k = 1, . . . , r, ν = 1, . . . , pj, µ = 1, . . . , pk, the posteriorcovariances are
δ−1Cov(L0j1F0jν ,L0k2F0kµ|y) = (L0j1φjν )(L0k2φkµ)eTj,νAek,µ,
δ−1Cov(L0j1F0jν ,L0k2F1k|y) = −θk(L0j1φjν )eTj,νB(L0k2ξk),
δ−1Cov(L0j1F1j ,L0k2F1k|y) = δj,kθkL0j1L0k2Rk1−θjθk(L0j1ξj)
TC(L0k2ξk),
(8.18)
Semiparametric Regression 233
where ej,µ is a vector of dimension∑r
l=1 pl with the (∑j−1
l=1 pl +µ)th el-ement being one and all other elements being zero, c and d are given inthe equation (8.10), L0j1ξj = (L0j1Lj1Rj1, . . . ,L0j1LjnRj1)
T , L0k2ξk =
(L0k2Lk1Rk1, . . . ,L0k2LknRk1)T , M = Σθ+nλW−1, A = (T TM−1T )−1,
B = AT TM−1, and C = M−1(I − B). For simplicity of notation, we
define∑0
l=1 pl = 0.
Derivation of the above results can be found in Wang and Ke (2009).Posterior mean and variance of Lj0fj,γ
jin (8.16) can be calculated using
the above formulae. Bayesian confidence intervals for Lj0fj,γ can thenbe constructed. Bootstrap confidence intervals can also be constructedas in previous chapters.
The semiparametric linear regression model(8.4) can be fitted by thessr function. The independent variables s and null bases φk1, . . . , φkpk
for k = 1, . . . , r are specified on the right-hand side of the formula ar-gument, and RKs Rk1 for k = 1, . . . , r are specified in the rk argumentas a list. For non-iid random errors, variance and/or correlation struc-tures are specified using the arguments weights and correlation. Theargument spar specifies a method for selecting the smoothing parame-ter(s). UBR, GCV, and GML methods are available for the case whenW is known, and the GML method is available when W needs to be es-timated. The predict function can be used to compute posterior meansand standard deviations. See Section 8.4.1 for examples.
8.2.3 Vector Spline
Suppose we have observations on r dependent variables z1, . . . , zr. As-sume the following partial spline models:
zjk = sTjkβk + Lkjfk + εjk, k = 1, . . . , r; j = 1, . . . , nk, (8.19)
where zjk is the jth observation on zk, sjk is the jth observation ona qk-dimensional vector of independent variables sk, fk ∈ Hk is an un-known function, Hk is an RKHS on an arbitrary set Xk, Lkj is a boundedlinear functional, and εjk is a random error. Model (8.19) is a semipara-metric extension of the linear seemingly unrelated regression model. Forsimplicity, it is assumed that the regression model for each dependentvariable involves one nonparametric function only. The following discus-sions hold when the partial spline models in (8.19) are replaced by thesemiparametric linear regression models (8.4).
There are two possible approaches to estimating the parameters β =(βT
1 , . . . ,βTr )T and the functions f = (f1, . . . , fr). The first approach
is to fit the partial spline models in (8.19) separately, once for each de-pendent variable. The second approach is to fit all partial spline models
234 Smoothing Splines: Methods and Applications
in (8.19) simultaneously, which can be more efficient when the randomerrors are correlated (Wang, Guo and Brown 2000, Smith and Kohn2000). We now discuss how to accomplish the second approach usingthe methods in this section.
Let m1 = 0 and mk =∑k−1
l=1 nl for k = 2, . . . , r. Let i = mk+j for j =1, . . . , nk and k = 1, . . . , r. Then there is an one-to-one correspondencebetween i and (j, k). Define yi = zjk for i = 1, . . . , n, where n =
∑rl=1 nl.
Then the partial spline models in (8.19) can be written jointly as
yi = sTjkβk + Lkjfk + εjk
=
r∑
l=1
δk,lsTjlβl +
r∑
l=1
δk,lLljfl + εjk
= sTi β +
r∑
l=1
Llifl + ǫi, (8.20)
where δk,l is the Kronecker delta, sTi = (δk,1s
Tj1, . . . , δk,rs
Tjr), Lli =
δk,lLlj , and ǫi = εjk. Assume that ǫ = (ǫ1, . . . , ǫn)T ∼ N(0, σ2W−1). It
is easy to see that Lli are bounded linear functionals. Thus the model(8.20) for all dependent variables is a special case of the semiparametriclinear regression model (8.4). Therefore, the estimation and inferencemethods described in Section 8.2.2 can be used. In particular, all param-eters β and nonparametric functions f are estimated jointly based on thePWLS (8.5). In comparison, the first approach that fits model (8.19) sep-arately for each dependent variable is equivalent to fitting model (8.20)based on the PLS.
As an interesting special case, consider the following SSR model forr = 2 dependent variables
zjk = fk(xjk) + εjk, k = 1, 2; j = 1, . . . , nk, (8.21)
where the model space of fk is an RKHS Hk on Xk. Assume thatHk = Hk0 ⊕ Hk1, where Hk0 = span{φk1, . . . , φkpk
}, and Hk1 is anRKHS with RK Rk1. Then it is easy to check that
Lkiφkν =
{L1iφ1ν , 1 ≤ i ≤ n1,L2iφ2ν , n1 < i ≤ n1 + n2,
and
LkiLkjRk1 =
{L1iL1jR11, 1 ≤ i, j ≤ n1,L2iL2jR21, n1 < i, j ≤ n1 + n2.
For illustration, we generate a data set from model (8.21) with X1 =X2 = [0, 1], n1 = n2 = 100, xi1 = xi2 = i/n, f1(x) = sin(2πx), f2(x) =
Semiparametric Regression 235
sin(2πx) + 2x, and the paired random errors (εi1, εi2) are iid bivariatenormal random variables with mean zero and Var(ǫi1) = 0.25, Var(ǫi2) =1, and Cor(ǫi1, ǫi2) = 0.8.
> n <- 100; s1 <- .5; s2 <- 1; r <- .8
> A <- diag(c(s1,s2))%*%matrix(c(sqrt(1-r**2),0,r,1),2,2)
> e <- NULL; for (i in 1:n) e <- c(e,A%*%rnorm(2))
> x <- 1:n/n
> y1 <- sin(2*pi*x) + e[seq(1,2*n,by=2)]
> y2 <- sin(2*pi*x) + 2*x + e[seq(2,2*n,by=2)]
> bisp.dat <- data.frame(y=c(y1,y2),x=rep(x,2),
id=as.factor(rep(c(0,1),rep(n,2))), pair=rep(1:n,2))
We model both f1 and f2 using the cubic spline space W 22 [0, 1] under
the construction in Section 2.6. We first fit each SSR model in (8.21)separately and compute posterior means and standard deviations:
> bisp.fit1 <- ssr(y~I(x-.5), rk=cubic(x), spar=‘‘m’’,
data=bisp.dat[bisp.dat$id==0,])
> bisp.p1 <- predict(bisp.fit1)
> bisp.fit2 <- ssr(y~I(x-.5), rk=cubic(x), spar=‘‘m’’,
data=bisp.dat[bisp.dat$id==1,])
> bisp.p2 <- predict(bisp.fit2)
The functions of f1 and f2 and their estimates and confidence intervalsbased on separate fits are shown in the top panel of Figure 8.1.
Next we fit the SSR models in (8.21) jointly, compute posterior meansand standard deviations, and compare the posterior standard deviationswith those based on separate fits:
> bisp.fit3 <- ssr(y~id/I(x-.5)-1,
rk=list(rk.prod(cubic(x),kron(id==0)),
rk.prod(cubic(x),kron(id==1))), spar=‘‘m’’,
weights=varIdent(form=~1|id),
correlation=corSymm(form=~1|pair), data=bisp.dat)
> summary(bisp.fit3)
...
Coefficients (d):
id0 id1 id0:I(x - 0.5) id1:I(x - 0.5)
-0.002441981 1.059626687 -0.440873366 2.086878037
GML estimate(s) of smoothing parameter(s) :
8.358606e-06 5.381727e-06
Equivalent Degrees of Freedom (DF): 13.52413
Estimate of sigma: 0.483716
236 Smoothing Splines: Methods and Applications
0.0 0.2 0.4 0.6 0.8 1.0
−1.5
−0.5
0.5
1.5
x
f 1
separate estimate of f1
0.0 0.2 0.4 0.6 0.8 1.0
−1
01
23
x
f 2
separate estimate of f2
0.0 0.2 0.4 0.6 0.8 1.0
−1.5
−0.5
0.5
1.5
x
f 1
joint estimate of f1
0.0 0.2 0.4 0.6 0.8 1.0
−1
01
23
x
f 2
joint estimate of f2
FIGURE 8.1 Plots of the true functions (dotted lines), cubic splineestimates (solid lines), and 95% Bayesian confidence intervals (shadedregions) of f1 (left) and f2 (right). Plots in the top panel are based onthe separate fits and plots in the bottom panel are based on the jointfit.
Correlation structure of class corSymm representing
Correlation:
1
2 0.777
Variance function structure of class varIdent representing
0 1
1.000000 1.981937
> bisp.p31 <- predict(bisp.fit3,
newdata=bisp.dat[bisp.dat$id==0,],
terms=c(1,0,1,0,1,0))
> bisp.p32 <- predict(bisp.fit3,
Semiparametric Regression 237
newdata=bisp.dat[bisp.dat$id==1,],
terms=c(0,1,0,1,0,1))
> mean((bisp.p1$pstd-bisp.p31$pstd)/bisp.p1$pstd)
0.08417699
> mean((bisp.p2$pstd-bisp.p32$pstd)/bisp.p2$pstd)
0.04500096
An arbitrary pairwise variance–covariance structure was assumed, andthe variance–covariance structure was specified with the combinationof the weights and correlation options. On average, the posteriorstandard deviations based on the joint fit are smaller than those basedon separate fits. Estimates and confidence intervals based on the jointfit are shown in the bottom panel of Figure 8.1.
In many applications, the domains of f1 and f2 are the same. That is,X1 = X2 = X . Then we can rewrite fj(x) as f(j, x) and regard it as abivariate function of j and x defined on the product domain {1, 2}⊗X .The joint approach described above for fitting the SSR models in (8.21)is equivalent to representing the original functions as
f(j, x) = δj,1f1(x) + δj,2f2(x). (8.22)
Sometimes the main interest is the difference between f1 and f2:d(x) = f2(x) − f1(x). We may reparametrize f(j, x) as
f(j, x) = f1(x) + δj,2d(x) (8.23)
or
f(j, x) =1
2{f1(x) + f2(x)} +
1
2(δj,2 − δj,1)d(x). (8.24)
Models (8.23) and (8.24) correspond to the SS ANOVA decompositionsof f(j, t) with the set-to-zero and sum-to-zero side conditions, respec-tively. The following statements fit model (8.23) and compute posteriormeans and standard deviations of d(x):
> bisp.fit4 <- update(bisp.fit3,
y~I(x-.5)+I(id==1)+I((x-.5)*(id==1)),
rk=list(cubic(x),rk.prod(cubic(x),kron(id==1))))
> bisp.p41 <- predict(bisp.fit4,
newdata=bisp.dat[bisp.dat$id==1,], terms=c(0,0,1,1,0,1))
where d(x) is modeled using the cubic spline space W 22 [0, 1] under the
construction in Section 2.6. The function of d(x) and its estimate areshown in the left panel of Figure 8.2. Model (8.24) can be fitted similarly.Sometimes it is of interest to check if f1 and f2 are parallel rather than ifd(x) = 0. Let d1(x) be the projection of d(x) onto W 2
2 [0, 1]⊖{1}. Then
238 Smoothing Splines: Methods and Applications
f1 and f2 are parallel iff d1(x) = 0. Similarly, the projection of d(x)onto W 2
2 [0, 1]⊖ {1} ⊖ {x− .5}, d2(x), can be used to check if f1 and f2differ by a linear function. We compute posterior means and standarddeviations of d1(x) and d2(x) as follows:
> bisp.p42 <- predict(bisp.fit4,
newdata=bisp.dat[bisp.dat$id==1,], terms=c(0,0,0,1,0,1))
> bisp.p43 <- predict(bisp.fit4,
newdata=bisp.dat[bisp.dat$id==1,], terms=c(0,0,0,0,0,1))
The functions of d1(x) and d2(x), their estimates, and 95% Bayesianconfidence intervals are shown in the middle and right panels of Figure8.2. We can see that f1 and f2 are not parallel but differ by a linearfunction.
0.0 0.2 0.4 0.6 0.8 1.0
−1
01
23
x
d
estimate of d(x)
0.0 0.2 0.4 0.6 0.8 1.0
−2
−1
01
2
x
d1
estimate of d1(x)
0.0 0.2 0.4 0.6 0.8 1.0
−1e−
04
0e+
00
5e−
05
1e−
04
x
d2
estimate of d2(x)
FIGURE 8.2 Plots of the functions d(x) (left), d1(x) (middle), andd2(x) (right) as dotted lines. Estimates of these functions are plottedas solid lines, and 95% Bayesian confidence intervals are marked as theshaded regions.
We now introduce a more sophisticated SS ANOVA decompositionthat can be used to investigate various relationships between f1 and f2.
Define two averaging operators A(1)1 and A(2)
2 such that
A(1)1 f = w1f(1, x) + w2f(2, x),
A(2)1 f =
∫
Xf(j, x)dP (x),
where w1 +w2 = 1, and P is a probability measure on X . Then we have
Semiparametric Regression 239
the following SS ANOVA decomposition
f(j, x) = µ+ g1(j) + g2(x) + g12(j, x), (8.25)
where
µ = A(1)1 A(2)
1 f =
∫
X{w1f1(x) + w2f2(x)}dP (x),
g1(j) = (I −A(1)1 )A(2)
1 f =
∫
Xfj(x)dP (x) − µ,
g2(x) = A(1)1 (I −A(2)
1 )f = w1f1(x) + w2f2(x) − µ,
g12(j, x) = (I −A(1)1 )(I −A(2)
1 )f = fj(x) − µ− g1(j) − g2(x).
The constant µ is the overall mean, the marginal functions g1 and g2 arethe main effects, and the bivariate function g12 is the interaction.
The SS ANOVA decomposition (8.25) makes certain hypotheses moretransparent. For example, it is easy to check that the following hypothe-ses are equivalent:
H0 : f1(x) = f2(x) ⇐⇒ H0 : g1(j) + g12(j, x) = 0,
H0 : f1(x) − f2(x) = constant ⇐⇒ H0 : g12(j, x) = 0,
H0 :
∫
Xf1(x)dP (x) =
∫
Xf2(x)dP (x) ⇐⇒ H0 : g1(j) = 0,
H0 : w1f1(x) + w2f2(x) = constant ⇐⇒ H0 : g2(x) = 0.
Furthermore, if g1(j) 6= 0 and g2(x) 6= 0,
H0 : af1(x) + bf2(x) = c, |a| + |b| > 0 ⇐⇒ H0 : g12(j, t) = βg1(j)g2(x).
Therefore, the hypothesis that f1 and f2 are equal is equivalent tog1(j) + g12(j, x) = 0. The hypothesis that f1 and f2 are parallel isequivalent to the hypothesis that the interaction g12 = 0. The hypothe-sis that the integrals of f1 and f2 with respect to the probability measureP are equal is equivalent to the hypothesis that the main effect g1(j) = 0.The hypothesis that the weighted average of f1 and f2 is a constant isequivalent to the hypothesis that the main effect g2(x) = 0. Note thatthe probability measure P and weights wj are arbitrary, which can beselected for specific hypotheses. Under the specified conditions, the hy-pothesis that there exists a linear relationship between the functions f1and f2 is equivalent to the hypothesis that the interaction is multiplica-tive. Thus, for these hypotheses, we can fit the SS ANOVA model (8.25)and perform tests on the corresponding components.
240 Smoothing Splines: Methods and Applications
8.3 Semiparametric Nonlinear Regression Models
8.3.1 The Model
A semiparametric nonlinear regression (SNR) model assumes that
yi = Ni(β,f) + ǫi, i = 1, . . . , n, (8.26)
where β = (β1, . . . , βq)T ∈ R
q is a vector of parameters, f = (f1, . . . , fr)are unknown functions, fk belongs to an RKHS Hk on an arbitrarydomain Xk for k = 1, . . . , r, Ni are known nonlinear functionals onR
q ×H1 × · · · × Hr, and ǫ = (ǫ1, . . . , ǫn)T ∼ N(0, σ2W−1).It is obvious that the SNR model (8.26) is an extension of the NNR
model (7.4) to allow parameters in the model. The mean function maydepend on both parameters β and nonparametric functions f nonlin-early. The nonparametric functions f are regarded as parameters justlike β. Certain constraints may be required to make an SNR modelidentifiable. Specific conditions depend on the form of a model and thepurpose of an analysis. Often, identifiability can be achieved by addingconstraints on parameters, absorbing some parameters into f and/oradding constraints on f by removing certain components from the modelspaces. Illustrations of how to make an SNR model identifiable can befound in Section 8.4.
The semiparametric linear regression model (8.4) is a special case ofSNR model when Ni are linear in β and f . When Ni are linear in f forfixed β, model (8.26) can be expressed as
yi = α(β;xi) +
r∑
k=1
Lki(β)fk + ǫi, i = 1, . . . , n, (8.27)
where α is a known linear or nonlinear function of independent variablesx = (x1, . . . , xd), xi = (xi1, . . . , xid), and Lki(β) are bounded linearfunctionals that may depend on β. Model (8.27) will be referred to as asemiparametric conditional linear model.
One special case of model (8.27) is
yi = α(β;xi) +
r∑
k=1
δk(β;xi)fk(γk(β;xi)) + ǫi, i = 1, . . . , n, (8.28)
where α, δk, and γk are known functions. Containing many existingmodels as special cases, model (8.28) is interesting in its own right. Itis obvious that both nonlinear regression and nonparametric regression
Semiparametric Regression 241
models are special cases of model (8.28). The project pursuit regres-sion model (8.1) is a special case with α(β;x) = β0, δk(β;x) ≡ 1 andγk(β;x) = β
Tk x, where β = (β0,β
T1 , . . . ,β
Tr )T . Partially linear sin-
gle index model (8.2) is a special case with r = 1, x = (sT , tT )T ,α(β;x) = βT
1 s, δ1(β;x) ≡ 1, γ1(β;x) = βT2 t, and β = (βT
1 ,βT2 )T .
Other special cases can be found in Section 8.4.Sometimes one may want to investigate how parameters β depend on
other covariates. One approach is to build a second-stage linear model,β = Aϑ, where A is a known matrix. See Section 7.5 in Pinheiro andBates (2000) for details. The general form of models (8.26), (8.27), and(8.28) remains the same when the second-stage model is plugged in.Therefore, the estimation procedures in Section 8.3.3 also apply to theSNR model combined with a second-stage model with ϑ as parameters.
Let y = (y1, . . . , yn)T , η(β,f ) = (N1(β,f), . . . ,Nn(β,f))T , and ǫ =(ǫ1, . . . , ǫn)T . Then model (8.26) can be written in a vector form as
y = η(β,f) + ǫ. (8.29)
8.3.2 SNR Models for Clustered Data
Clustered data such as repeated measures, longitudinal, and multileveldata are common in practice. The SNR model for clustered data assumesthat
yij = Nij(βi,f) + ǫij , i = 1, . . . ,m; j = 1, . . . , ni, (8.30)
where yij is the jth observation in cluster i, βi = (βi1, . . . , βiq)T ∈ R
q
is a vector of parameters for cluster i, f = (f1, . . . , fr) are unknownfunctions, fk belongs to an RKHS Hk on an arbitrary domain Xk for k =1, . . . , r, Nij are known nonlinear functionals on R
q ×H1×· · ·×Hr, andǫij are random errors. Let ǫi = (ǫi1, . . . , ǫini
)T and ǫ = (ǫT1 , . . . , ǫ
Tm)T .
We assume that ǫ ∼ N(0, σ2W−1). Usually, observations are correlatedwithin a cluster and independent between clusters. In this case, W−1 isblock diagonal.
Again, when Nij are linear in f for fixed βi, model (8.30) can beexpressed as
yij = α(βi;xij) +r∑
k=1
Lkij(βi)fk + ǫij , i = 1, . . . ,m; j = 1, . . . , ni,
(8.31)where α is a known function of independent variables x = (x1, . . . , xd),xij = (xij1, . . . , xijd), and Lkij(β) are bounded linear functionals thatmay depend on β. Model (8.31) will be referred to as a semiparametricconditional linear model for clustered data.
242 Smoothing Splines: Methods and Applications
Similar to model (8.28), one special case of model (8.31) is
yij = α(βi;xij) +
r∑
k=1
δk(βi;xij)fk(γk(βi;xij)) + ǫij ,
i = 1, . . . ,m; j = 1, . . . , ni, (8.32)
where α, δk, and γk are known functions. Model (8.32) can be re-garded as an extension of the self-modeling nonlinear regression (SE-MOR) model proposed by Lawton, Sylvestre and Maggio (1972). Inparticular, the shape invariant model (SIM) (Lawton et al. 1972, Wangand Brown 1996),
yij = βi1+βi2f
(xij − βi3
βi4
)
+ǫij , i = 1, . . . ,m; j = 1, . . . , ni, (8.33)
is a special case of model (8.32) with d = 1, r = 1, q = 4, α(βi;xij) = βi1,δ1(βi;xij) = βi2, and γk(βi;xij) = (xij − βi3)/βi4. Again, a second-stage linear model may also be constructed for parameters βi, and theestimation procedures in Section 8.3.3 apply to the combined model.
Let n =∑m
i=1 ni, yi = (yi1, . . . , yini)T , y = (yT
1 , . . . ,yTm)T , β =
(βT1 , . . . ,β
Tm)T , ηi(βi,f) = (Ni1(βi,f), . . . ,Nini
(βi,f))T , and η(β,f)=(ηT
1 (β1,f ), . . . ,ηTm(βm,f))T . Then model (8.30) can be written in the
vector form (8.29).
8.3.3 Estimation and Inference
For simplicity we present the estimation and inference procedures forSNR models in Section 8.3.1 only. The same methods apply to SNRmodels in Section 8.3.2 for clustered data with a slight modification ofnotation.
Consider the vector form (8.29) and assume that W depends on anunknown vector of parameters τ . Assume that fk ∈ Hk, and Hk =Hk0⊕Hk1, where Hk0 = span{φk1, . . . , φkpk
} and Hk1 is an RKHS withRK Rk1. Our goal is to estimate parameters β, τ , σ2, and nonparametricfunctions f .
Let
l(y;β,f , τ , σ2) = log |σ2W−1| + 1
σ2(y − η)TW (y − η) (8.34)
be twice the negative log-likelihood where an additive constant is ignoredfor simplicity. We estimate β, τ , and f as minimizers of the penalizedlikelihood (PL)
l(y;β,f , τ , σ2) +nλ
σ2
r∑
k=1
θ−1k ||Pk1fk||2, (8.35)
Semiparametric Regression 243
where Pk1 is the projection operator onto Hk1 in Hk, and λθ−1k are
smoothing parameters. The multiplying constant n/σ2 is introduced inthe penalty term such that, ignoring an additive constant, the PL (8.35)has the same form as the PWLS (8.5) when N is linear in both β andf .
We will first develop a backfitting procedure for the semiparametricconditional linear model (8.27) and then develop an algorithm for thegeneral SNR model (8.26).
Consider the semiparametric conditional linear model (8.27). We firstconsider the estimation of f with fixed β and τ . When β and τ are fixed,the PL (8.35) is equivalent to the PWLS (8.5). Therefore, the solutionsof f to (8.35) can be represented as those in (8.6). We will use the samenotations as in Section 8.2.2. Note that both T and Σθ may depend onβ even though the dependence is not expressed explicitly for simplicity.Let α = (α(β;x1), . . . , α(β;xn))T . We need to solve equations
(Σθ + nλW−1)c + Td = y −α,T Tc = 0,
(8.36)
for coefficients c and d. Note that α and W are fixed since β and τ arefixed, and the equations in (8.36) have the same form as those in (5.6).Therefore, methods in Section 5.2.1 can be used to solve (8.36), and theUBR, GCV, and GML methods in Section 5.2.3 can be used to estimatesmoothing parameters λ and θ.
Next we consider the estimation of β and τ with fixed f . When f isfixed, the PL (8.35) is equivalent to
log |σ2W−1| + 1
σ2{y − η(β,f)}TW{y − η(β,f)}. (8.37)
We use the backfitting and Gauss–Newton algorithms in Pinheiro andBates (2000) to find minimizers of (8.37) by updating β and τ itera-tively. Details about the backfitting and Gauss–Newton algorithms canbe found in Section 7.5 of Pinheiro and Bates (2000).
Putting pieces together, we have the following algorithm.
Algorithm for semiparametric conditional linear models
1. Initialize: Set initial values for β and τ .
2. Cycle: Alternate between (a) and (b) until convergence:
(a) Conditional on current estimates of β and τ , update f usingmethods in Section 5.2.1 with smoothing parameters selectedby the UBR, GCV, or GML method in Section 5.2.3.
244 Smoothing Splines: Methods and Applications
(b) Conditional on current estimates of f , update β and τ bysolving (8.37) alternatively using the backfitting and Gauss–Newton algorithms.
Note that smoothing parameters are estimated iteratively with fixed τat step 2(a). The parameters τ are estimated at step 2(b), which makesthe algorithm relatively easy to implement. An alternative computa-tionally more expensive approach is to estimate smoothing parametersand τ jointly at step (a).
Finally, we consider the estimation for the general SNR model (8.26).When η is nonlinear in f , the solutions of f to (8.35) usually do notfall in finite dimensional spaces. Therefore, certain approximations arenecessary. Again, first consider estimating f with fixed β and τ . Wenow extend the EGN procedure in Section 7.3 to multiple functions. Letf− be the current estimate of f . For any fixed β, Ni is a functionalon H1 × · · · × Hr. We assume that Ni is Frechet differentiable at f−and write Di = DNi(f−). Then Dih =
∑rk=1 Dkihk, where Dki is the
partial Frechet differential of Ni with respect to fk evaluated at f−,h = (h1, . . . , hr) and hk ∈ Hk (Flett 1980). For k = 1, . . . , r, Lki is abounded linear functional on Hk. Approximating Ni(β,f ) by its linearapproximation
Ni(β,f) ≈ Ni(β,f−) +r∑
k=1
Dki(fk − fk−), (8.38)
we have an approximate semiparametric conditional linear model
yi =
r∑
k=1
Dkifk + ǫi, i = 1, . . . , n, (8.39)
where yi = yi −Ni(β,f−)+∑r
k=1 Dkifk−. Functions f in model (8.39)can be estimated using the method in Section 8.2.2. Consequently, wehave the following algorithm for the general SNR model.
Algorithm for general SNR model
1. Initialize: Set initial values for β, τ , and f .
2. Cycle: Alternate between (a) and (b) until convergence:
(a) Conditional on current estimates of β, τ , and f , compute Dki
and yi, and update f by applying step 2(a) in the Algorithmfor semiparametric conditional linear models to the approxi-mate model (8.39). Repeat this step until convergence.
Semiparametric Regression 245
(b) Conditional on current estimates of f , update β and τ bysolving (8.37) alternatively using the backfitting and Gauss–Newton algorithms.
Denote the final estimates of β, τ , and f as β, τ , and f . We estimateσ2 by
σ2 =(y − η)T W (y − η)
n− d− tr(H∗), (8.40)
where η = η(β, f), W is the estimate of W with τ being replaced by τ ,d is the degrees of freedom for parameters, which is usually taken as thetotal number of parameters, and H∗ is the hat matrix for model (8.39)computed at convergence.
Conditional on f , inference for β and τ can be made based on theapproximate distributions of the maximum likelihood estimates. Condi-tional on β and τ , model (8.27) is a special case of the semiparametriclinear regression model (8.4). Therefore, Bayesian confidence intervalscan be constructed as in Section 8.2 for semiparametric conditional linearmodels. For the general SNR model, approximate Bayesian confidenceintervals can be constructed based on the linear approximation (8.39)at convergence. The bootstrap approach may also be used to constructconfidence intervals.
8.3.4 The snr Function
The function snr in the assist package is designed to fit the followingspecial SNR models
yi = ψ(β,L1i(β)f1, . . . ,Lri(β)fr) + ǫi, i = 1, . . . , n, (8.41)
for cross-sectional data and
yij = ψ(βi,L1ij(βi)f1, . . . ,Lrij(βi)fr) + ǫij ,
i = 1, . . . ,m; j = 1, . . . , ni, (8.42)
for clustered data, where ψ is a known nonlinear function, and Lki(β)and Lkij(βi) are evaluational functionals on Hk. Obviously, (8.41) in-cludes the model (8.28) as a special case, and (8.42) includes the model(8.32) as a special case.
A modified procedure is implemented in the snr function. Note thatmodels (8.41) and (8.42) reduce to the NNR model (7.5) when β isfixed and random errors are iid. For non-iid random errors, when bothβ and τ are fixed, similar transformations as in Section 5.2.1 may be
246 Smoothing Splines: Methods and Applications
used. The nonlinear Gauss–Seidel algorithm in Section 7.4 can then beused to update nonparametric functions f . Therefore, we implement thefollowing procedure in the snr function.
Algorithm in the snr function
1. Initialize: Set initial values for β, τ , and f .
2. Cycle: Alternate between (a) and (b) until convergence:
(a) Conditional on current estimates of β, τ and f , apply trans-formations as in Section 5.2.1 when random errors are non-iid,and use the nonlinear Gauss–Seidel algorithm to update f .
(b) Conditional on current estimates of f , update β and τ bysolving (8.37) alternatively using the backfitting and Gauss–Newton algorithms.
No initial values for fk are necessary if ψ depends on fk linearly in (8.41)or (8.42). The initial τ is set such that W equals the identity matrix.Step (b) is implemented using the gnls function in the nlme package.
A typical call is
snr(formula, func, params, start)
where formula is a two-sided formula specifying the response variableon the left side of a ~ operator and an expression for the function ψ inthe model (8.41) or (8.42) on the right side with β and fk treated asparameters. The argument func inputs a list of formulae, each specifyingbases φk1, . . . , φkpk
for Hk0 and RK Rk1 for Hk1 in the same way as thefunc argument in the nnr function. The argument params inputs alist of two-sided linear formulae specifying second-stage models for β.When there is no second-stage model for a parameter, it is specified as~1. The argument start inputs initial values for all parameters. Whenψ depends on functions fk nonlinearly, initial values for those functionsshould also be provided in the start argument.
An object of snr class is returned. The generic function summary
can be applied to extract further information. Predictions at covariatevalues can be computed using the predict function. Posterior meansand standard deviations for f can be computed using the intervals
function. See Sections 8.4.2–8.4.6 for examples.
Semiparametric Regression 247
8.4 Examples
8.4.1 Canadian Weather — Revisit
Consider the Canadian weather data again with annual temperatureand precipitation profiles from all 35 stations as functional data. Wenow investigate how the monthly logarithm of rainfall depends on cli-mate regions and temperature. Let y be the logarithm of prec, x1 bethe region indicator, and x2 be the month variable scaled into [0, 1].Consider the following FLM
yk,x1(x2) = f1(x1, x2) + wk,x1
(x2)f2(x2) + ǫk,x1(x2), (8.43)
where yk,x1(x2), wk,x1
(x2), and ǫk,x1(x2) are profiles of log precipitation,
residual temperature after removing the region effect, and random errorof station k in climate region x1, respectively. Both annual log pre-cipitation and temperature profiles can be regarded as functional data.Therefore, model (8.43) is an example of situation (iii) in Section 2.10where both the independent and dependent variables involve functionaldata. We model the bivariate function f1(x1, x2) using the tensor prod-uct space R
4 ⊗W 22 (per). Then, as in (4.22), f1 admits the SS ANOVA
decomposition
f1(x1, x2) = µ+ f1,1(x1) + f1,2(x2) + f1,12(x1, x2).
Model (8.43) is the same as model (14.1) in Ramsay and Silverman(2005) with µ(x2) = µ + f1,2(x2), αx1
(x2) = f1,1(x1) + f1,12(x1, x2),and β(x2) = f2(x2). The function f2 is the varying coefficient functionfor the temperature effect. We model f2 using the periodic spline spaceW 2
2 (per). There are 12 monthly observations for each station. Collectall observations on y, x1, x2, and w for all 35 stations and denote themas {(yi, xi1, xi2, wi), i = 1, . . . , n} where n = 420. Denote the collectionof random errors as ǫ1, . . . , ǫn. Then model (8.43) can be rewritten as
yi = µ+ f1,1(xi1) + f1,2(xi2) + f1,12(xi1, xi2) + L2if2 + ǫi, (8.44)
where L2if2 = wif2(xi2). Model (8.44) is a special case of the semipara-metric linear regression model (8.4). Model spaces for f1,1, f1,2, andf1,12 are H1 = H1, H2 = H2, and H3 = H3, where H1, H2, and H3
are defined in Section 4.4.4. The RKs R1, R2, and R3 of H1, H2, andH3 can be calculated as products of the RKs of the involved marginalspaces. Define Σ1 = {R1(xi1, xj1)}n
i,j=1, Σ2 = {R2(xi2, xj2)}ni,j=1, and
Σ3 = {R3((xi1, xi2), (xj1, xj2))}ni,j=1. For f2, the model space H4 =
248 Smoothing Splines: Methods and Applications
W 22 (per) where the construction of W 2
2 (per) is given in Section 2.7.Specifically, write W 2
2 (per) = H40 ⊕ H41, where H40 = {1} and H41 =W 2
2 (per) ⊖ {1}. Denote φ4(x2) = 1 as the basis of H40 and R4 as theRK for H41. Then, T4 = (L21φ4, . . . ,L2nφ4)
T = (w1, . . . , wn)T , w,and Σ4 = {L2iLj2R4}n
i,j=1 = {wiwjR4(xi2, xj2)}ni,j=1 = wwT ◦Λ where
Λ = {R4(xi2, xj2)}ni,j=1 and ◦ represents elementwise multiplication of
two matrices. Therefore, model (8.44) can be fitted as follows:
> x1 <- rep(as.factor(region),rep(12,35))
> x2 <- (rep(1:12,35)-.5)/12
> y <- log(as.vector(monthlyPrecip))
> w <- canada.fit2$resi
> canada.fit3 <- ssr(y~w,
rk=list(shrink1(x1),
periodic(x2),
rk.prod(shrink1(x1),periodic(x2)),
rk.prod(kron(w),periodic(x2))))
Estimates of region effects αx1(x2) and coefficient function f2 evalu-
ated at grid points are computed as follows:
> xgrid <- seq(0,1,len=50)
> zone <- c(‘‘Atlantic’’,‘‘Pacific’’,
‘‘Continental’’,‘‘Arctic’’)
> grid <- data.frame(x1=rep(zone,rep(50,4)),
x2=rep(xgrid,4), w=rep(0,200))
> alpha <- predict(canada.fit3, newdata=grid,
terms=c(0,0,1,0,1,0))
> grid <- data.frame(x1=rep(zone[1],50), x2=xgrid,
w=rep(1,50))
> f2 <- predict(canada.fit3, newdata=grid,
terms=c(0,0,0,0,0,1))
Those estimates of αx1(x2) and f2 are shown in Figures 8.3 and 8.4,
respectively.Ramsay and Silverman (2005) also considered the following FLM
(model (16.1) in Ramsay and Silverman (2005)):
yk(x2) = f1(x2) +
∫ 1
0
wk(u)f2(u, x2)du + ǫk(x2),
k = 1, . . . , 35,
(8.45)
where yk(x2) is the logarithm of annual precipitation profile at stationk, f1(x2) plays the part of an intercept as in the standard regression,wk(u) is the temperature profile at station k, f2(u, x2) is an unknown
Semiparametric Regression 249
month
reg
ion
eff
ect
−1.0
−0.5
0.0
0.5
1.0
0 2 4 6 8 10 12
Continental Arctic
Atlantic
0 2 4 6 8 10 12
−1.0
−0.5
0.0
0.5
1.0
Pacific
FIGURE 8.3 Canadian weather data, plots of estimated region effectsto precipitation, and 95% Bayesian confidence intervals.
month
f 2−
0.0
3−
0.0
10.0
10.0
3
J F M A M J J A S O N D
FIGURE 8.4 Canadian weather data, estimate of the coefficient func-tion for temperature effect f2, and 95% Bayesian confidence intervals.
250 Smoothing Splines: Methods and Applications
weight function at month x2, and ǫk(x2) are random error processes.Comparing to model (8.44), the whole temperature profile is used topredict the current precipitation in (8.45). Note that wk(x2) in model(8.45) is the actual temperature, and wk(x2) in (8.44) is the residualtemperature after the removing the region effect.
The goal is to model and estimate functions f1 and f2. It is reasonableto assume that f1 and f2 are smooth periodic functions. Specifically, weassume that f1 ∈W 2
2 (per), and f2 ∈W 22 (per)⊗W 2
2 (per). Let H0 = {1}and H1 = W 2
2 (per) ⊖ {1}. Then, we have an one-way SS ANOVAdecomposition for W 2
2 (per),
W 22 (per) = H0 ⊕H1,
and a two-way SS ANOVA decomposition for W 22 (per) ⊗W 2
2 (per),
W 22 (per) ⊗W 2
2 (per)
={
H(1)0 ⊕H(1)
1
}
⊗{
H(2)0 ⊕H(2)
1
}
={
H(1)0 ⊗H(2)
0
}
⊕{
H(1)1 ⊗H(2)
0
}
⊕{
H(1)0 ⊗H(2)
1
}
⊕{
H(1)1 ⊗H(2)
1
}
, H0 ⊕H2 ⊕H3 ⊕H4, (8.46)
where H(1)0 = H(2)
0 = {1}, and H(1)1 = H(2)
1 = W 22 (per) ⊖ {1}. Equiva-
lently, we have the following SS ANOVA decomposition for f1 and f2:
f1(x2) = µ1 + f1,1(x2),
f2(u, x2) = µ2 + f2,1(u) + f2,2(x2) + f2,12(u, x2),
where f1,1 ∈ H1, f2,1 ∈ H2, f2,2 ∈ H3, and f2,12 ∈ H4. Then model(8.45) can be rewritten as
yk(x2) = µ1 + µ2zk + f1,1(x2) +
∫ 1
0
wk(u)f2,1(u)du + zkf2,2(x2)
+
∫ 1
0
wk(u)f2,12(u, x2)du + ǫk(x2), (8.47)
where zk =∫ 1
0wk(u)du. Let z = (z1, . . . , z35)
T and s = (s1, . . . , sn) ,
z⊗112, where 1k is a k-vector of all ones and ⊗ represents the Kroneckerproduct. Denote {(yi, x2i), i = 1, . . . , n} as the collection of all obser-vations on y and x2, and ǫ1, . . . , ǫn as the collection of random errors.Then model (8.47) can be rewritten as
yi = µ1 + µ2si + f1,1(xi2) +
∫ 1
0
w[i](u)f2,1(u)du + sif2,2(xi2)
+
∫ 1
0
w[i](u)f2,12(u, xi2)du+ ǫi, i = 1, . . . , n, (8.48)
Semiparametric Regression 251
where [i] represents the integer part of (11 + i)/12. Define a linearoperator L1i as the evaluational functional on H1 = W 2
2 (per)⊖{1} suchthat L1if1,1 = f1,1(xi2). Define linear operators L2i, L3i, and L4i onsubspaces H2, H3, and H4 in (8.46) such that
L2if2,1 =
∫ 1
0
w[i](u)f2,1(u)du,
L3if2,2 = sif2,2(xi2),
L4if2,12 =
∫ 1
0
w[i](u)f2,12(u, xi2)du.
Assume that functions wk(u) for k = 1, . . . , 35 are square integrable.Then L2i, L3i, and L4i are bounded linear functionals, and model (8.48)is a special case of the semiparametric linear regression model (8.4).Let ti = (i − 0.5)/12 for i = 1, . . . , 12 be the middle point of month i,x2 = (t1, . . . , t12)
T , wk = (wk(x12), . . . , wk(xm2))T , where m = 12, and
W = (w1, . . . ,w35). Let R1 be the RK of H1 = W 22 (per) ⊖ {1}. In-
troduce the notation R1(u,v) = {R1(uk, vl)}K Lk=1 l=1 for any vectors u =
(u1, . . . , uK)T and v = (v1, . . . , vL)T . It is easy to check that S = (1n, s),Σ1 = 135 ⊗ 1T
35 ⊗R1(x2,x2), and Σ3 = z ⊗ zT ⊗R1(x2,x2). A similarapproximation as in Section 2.10 leads to Σ2 ≈ {WTR1(x2,x2)W} ⊗112 ⊗1T
12/144. Note that the RK of H4 in (8.46) equals R1(s, t)R1(x, z).Then the (i, j)th element of the matrix Σ4
Σ4(i, j) = L4iL4jR1(u, v)R1(x, z)
= R1(xi2, xj2)
∫ 1
0
∫ 1
0
w[i](u)w[j](v)R1(u, v)dudv
≈ 1
144R1(xi2, xj2)w
Ti R1(x2,x2)wj .
Thus, Σ4 ≈ Σ1 ◦ Σ2. We fit model (8.48) as follows:
> W <- monthlyTemp; z <- apply(W,2,mean)
> s <- rep(z,rep(12,35)); x <- seq(0.5,11.5,1)/12
> y <- log(as.vector(monthlyPrecip))
> Q1 <- kronecker(matrix(1,35,35),periodic(x))
> Q2 <- kronecker(t(W)%*%periodic(x)%*%W,
matrix(1,12,12))/144
> Q3 <- kronecker(z%*%t(z),periodic(x))
> Q4 <- rk.prod(Q1,Q2)
> canada.fit4 <- ssr(y~s, rk=list(Q1,Q2,Q3,Q4))
252 Smoothing Splines: Methods and Applications
We now show how to compute the estimated functions evaluated at aset of points. From (8.6), the estimated functions are represented by
f1(x2) = d1 + θ1
n∑
i=1
ciR1(x2, xi2),
f2,1(u) = θ2
n∑
i=1
ci
∫ 1
0
w[i](v)R1(u, v)dv,
f2,2(x2) = θ3
n∑
i=1
ci
{∫ 1
0
w[i](u)du
}
R1(x2, xi2),
f2,12(u, x2) = θ4
n∑
i=1
ci
{∫ 1
0
w[i](v)R1(u, v)dv
}
R1(x2, xi2).
Let u0 and x0 be a set of points in [0, 1] for the variables u and x2,respectively. For simplicity, assume that both u0 and x0 have lengthn0. The following calculations can be extended to the case when u0 andx0 have different lengths. It is not difficult to check that
f1(x0) = d11n0+ θ1
n∑
i=1
ciR1(x0, xi2) = d11n0+ θ1S1c,
f2,1(u0) = θ2
n∑
i=1
ci
∫ 1
0
w[i](u)R1(u0, u)du
≈ 1
12θ2
n∑
i=1
ci
12∑
j=1
R1(u0, xj2)w[i](xj2)
=1
12θ2
n∑
i=1
ciR1(u0,x2)w[i] = θ2S2c,
f2,2(x0) = θ3
n∑
i=1
ci
{∫ 1
0
w[i](u)du
}
R1(x0, xi2) = θ3S3c,
where S1 = 1T35 ⊗R1(x0,x2), S2 = {R1(u0,x2)W} ⊗ 1T
12/12, and S3 =zT ⊗R1(x0,x2). The interaction f2,12 is a bivariate function. Thus, weevaluate it at a bivariate grid {(u0k, x0l) : k, l = 1, . . . , n0}:
f2,12(u0k, x0l) = θ4
n∑
i=1
ci
{∫ 1
0
w[i](v)R1(u0k, v)dv
}
R1(x0l, xi2)
≈ θ4
n∑
i=1
ciS2[k, i]S1[l, i].
Semiparametric Regression 253
Then (f2,12(u01, x01), . . . , f2,12(u01, x0n0), . . . , f2,12(u0n0
, x01), . . . ,
f2,12(u0n0, x0n0
)) = θ4S4c, where S4 is an n20 × n matrix with elements
S4[(k − 1)n0 + l, i] = S2[k, i]S1[l, i] for k, l = 1, . . . , n0 and i = 1, . . . , n.
> ngrid <- 40; xgrid <- seq(0,1,len=ngrid)
> S1 <- kronecker(t(rep(1,35)), periodic(xgrid,x))
> S2 <- kronecker(periodic(xgrid,x)%*%W, t(rep(1,12)))/12
> S3 <- kronecker(t(z), periodic(xgrid,x))
> S4 <- NULL
> for (k in 1:ngrid) {
for (l in 1:ngrid) S4 <- rbind(S4, S1[l,]*S2[k,])}
> the <- 10^canada.fit4$rkpk.obj$theta
> f1 <- canada.fit4$coef$d[1]
+the[1]*S1%*%canada.fit4$coef$c
> mu2 <- canada.fit4$coef$d[2]
> f21 <- the[2]*S2%*%canada.fit4$coef$c
> f22 <- the[3]*S3%*%canada.fit4$coef$c
> f212 <- the[4]*S4%*%canada.fit4$coef$c
> f2 <- mu2+rep(f21,rep(ngrid,ngrid))+rep(f22,ngrid)+f212
Figures 8.5 displays the estimates of f1 and f2.
x2
f 1
1.8
2.0
2.2
2.4
2.6
2.8
0.0 0.2 0.4 0.6 0.8 1.0
0.00.2
0.40.6
0.8
0.00.2
0.40.6
0.8
−4
−2
0
2
4
u
x2
f2
FIGURE 8.5 Canadian weather data, estimates of f1 (left) and f2(right).
254 Smoothing Splines: Methods and Applications
8.4.2 Superconductivity Magnetization Modeling
In this section we use the superconductivity data to illustrate how tocheck a nonlinear regression model. Figure 8.6 displays magnetizationversus time on logarithmic scale.
8 9 10 11 12
−3
4.5
−3
3.5
−3
2.5
logarithm of time (min)
ma
gn
etiza
tio
n (
am
2/k
g)
o
o
o
o
oo
oo
oo
oo
oooooooooooooo
oooooo
ooooooo
oooooooo
oooooooooo
oooooooooooo
ooooooooooooooo
ooooooooooooooooo
ooooooooooooooooooooo
ooooooooooooooooooooooooo
ooooooo
NLR
cubic spline
NPS
L−spline
FIGURE 8.6 Superconductivity data, observations (circles), and thefits by nonlinear regression (NLR), cubic spline, nonlinear partial spline(NPS), and L-spline.
It seems that a straight line (Anderson and Kim model) can fit datawell (Yeshurun, Malozemoff and Shaulov 1996). Let y be magnetismvalues, and x be logarithm of time scaled into the interval [0, 1]. Tocheck the Anderson and Kim model, we fit a cubic spline with the GMLchoice of the smoothing parameter:
> library(NISTnls)
> a <- Bennett5; x <- ident(a$x); y <- a$y
> super.cub <- ssr(y~x, cubic(x), spar=‘‘m’’)
> anova(super.cub, simu.size=1000)
Testing H_0: f in the NULL space
test.value simu.size simu.p-value approximate.p-value
LMP 0.1512787 1000 0
GML 0.01622123 1000 0 0
Let f1 be the projection onto the space W 22 [0, 1]⊖{1, x−0.5} which rep-
Semiparametric Regression 255
resents the systematic departure from the straight line model. Both theLMP and GML tests for the hypothesisH0 : f1(x) = 0 conclude that thedeparture from the straight line model is statistically significant. Figure8.7(a) shows the estimate of f1 with 95% Bayesian confidence intervals.Those confidence intervals also indicate that, though small, the depar-ture from a straight line is statistically significant. The deviation fromthe Anderson and Kim model has been noticed for high-temperature su-perconductors, and the following “interpolation formula” was proposed(Bennett, Swartzendruber, Blendell, Habib and Seyoum 1994, Yeshurunet al. 1996):
y = β1(β2 + x)− 1
β3 + ǫ. (8.49)
The nonlinear regression model (8.49) is fitted as follows:
> b10 <- -1500*(max(a$x)-min(a$x))**(-1/.85)
> b20 <- (45+min(a$x))/(max(a$x)-min(a$x))
> super.nls <- nls(y~b1*(b2+x)**(-1/b3),
start=list(b1=b10,b2=b20,b3=.85))
The initial values were computed based on one set of initial values pro-vided in the help file of Bennett5. The fit of the above NLR model isshown in Figure 8.6. To check the “interpolation formula”, we can fit anonlinear partial spline model
y = β1(β2 + x)− 1
β3 + f2(x) + ǫ, (8.50)
with f2 ∈ W 22 [0, 1]. Model (8.50) is a special case of the SNR model,
which can be fitted as follows:
> bh <- coef(super.nls)
> super.snr <- snr(y~b1*(b2+x)**(-1/b3)+f(x),
params=list(b1+b2+b3~1),
func=f(u)~list(~I(u-.5), cubic(u)),
start=list(params=bh), spar=‘‘m’’)
> summary(super.snr)
...
Coefficients:
Value Std.Error t-value p-value
b1 -466.4197 32.05006 -14.55285 0
b2 11.2296 0.21877 51.33033 0
b3 0.9322 0.01719 54.22932 0
...
GML estimate(s) of smoothing spline parameter(s):
0.0001381760
Equivalent Degrees of Freedom (DF) for spline function:
256 Smoothing Splines: Methods and Applications
10.96524
Residual standard error: 0.001703358
We used the estimates of parameters in (8.49), bh, as initial values forβ1, β2, and β3. The fit of model (8.50) is shown in Figure 8.6. To check ifthe departure from model (8.49), f2, is significant, we compute posteriormeans and standard deviations for f2:
> super.snr.pred <- intervals(super.snr)
The estimate of f2 and its 95% Bayesian confidence intervals are shownin Figure 8.7(b). The magnitude of f2 is very small. The zero line isoutside the confidence intervals in some regions.
8 9 10 11 12
−0.0
4−
0.0
20.0
00.0
2
logarithm of time (min)
f 1
(a)
8 9 10 11 12
−0.0
4−
0.0
20.0
00.0
2
logarithm of time (min)
f 2
(b)
8 9 10 11 12
−0.0
4−
0.0
20.0
00.0
2
logarithm of time (min)
f 3
(c)
FIGURE 8.7 Superconductivity data, (a) estimate of the departurefrom the straight line model, (b) estimate of the departure from the“interpolation formula” based on the nonlinear partial spline, and (c)estimate of the departure from the “interpolation formula” based on theL-spline. Shaded regions are 95% Bayesian confidence intervals.
An alternative approach to checking the NLR model (8.49) is to usethe L-spline introduced in Section 2.11. Consider the regression model(2.1) where f ∈ W 2
2 [0, 1]. For fixed β2 and β3, it is clear that the spacecorresponding to the “interpolation formula”
H0 = span{(β2 + x)− 1
β3 }
is the kernel of the differential operator
L = D +1
β3(β2 + x).
Semiparametric Regression 257
Let H1 = W 22 [0, 1]⊖H0. The Green function
G(x, s) =
(β2+sβ2+x
) 1β3, s ≤ x,
0, s > x.
Therefore, the RK of H1
R1(x, z) =
∫ 1
0
(β2 + s
β2 + x
) 1
β3
(β2 + s
β2 + z
) 1
β3
ds
= C(β2 + x)−1
β3 (β2 + z)−1
β3 , (8.51)
where C =∫ 1
0(β2 + s)
2β3 ds. We fit the L-spline model as follows:
> ip.rk <- function(x,b2,b3) ((b2+x)%o%(b2+x))**(-1/b3)
> super.l <- ssr(y~I((bh[2]+x)**(-1/bh[3]))-1,
rk=ip.rk(x,bh[2],bh[3]), spar=‘‘m’’)
The function ip.rk computes the RK R1 in (8.51) with the constantC being ignored since it can be absorbed by the smoothing parameter.The values of β2 and β3 are fixed as the estimates from the nonlinear re-gression model (8.49). Let f3 be the projection onto the space H1 whichrepresents the systematic departure from the “interpolation formula”.Figure 8.7(c) shows the estimate of f3 with 95% Bayesian confidenceintervals. The systematic departure from the “interpolation formula” isessentially zero.
8.4.3 Oil-Bearing Rocks
The rock data set contains measurements on four cross sections of eachof 12 oil-bearing rocks. The aim is to predict permeability (perm) fromthree other measurements: the total area (area), total perimeter (peri)and a measure of “roundness” of the pores in the rock cross section(shape). Let y = log(perm), x1 = area/10000, x2 = peri/10000 andx3 = shape. A full multivariate nonparametric model such as an SSANOVA model is not desirable in this case since there are only 48 ob-servations. Consider the projection pursuit regression model (8.1) withr = 2. Let βk = (βk1, βk2, βk3)
T for k = 1, 2. For identifiability, we usespherical coordinates βk1 = sin(αk1) cos(αk2), βk2 = sin(αk1) sin(αk2),and βk3 = cos(αk1), which satisfy the side condition β2
k1 +β2k2 +β2
k3 = 1.Then we have the following SNR model:
y = β0 + f1(sin(α11) cos(α12)x1 + sin(α11) sin(α12)x2 + cos(α11)x3
)
+f2(sin(α21) cos(α22)x1 + sin(α21) sin(α22)x2 + cos(α21)x3) + ǫ.
(8.52)
258 Smoothing Splines: Methods and Applications
Note that the domains of f1 and f2 are not fixed intervals since theydepend on unknown parameters. We model f1 and f2 using the one-dimensional thin-plate spline space W 2
2 (R). To make f1 and f2 identi-fiable with the constant β0, we remove the constant functions from themodel space. That is, we assume that f1, f2 ∈ W 2
2 (R) ⊖ {1}. Randomerrors are assumed to be iid. Bounds on parameters αkj are ignored forsimplicity.
> attach(rock)
> y <- log(perm)
> x1 <- area/10000; x2 <- peri/10000; x3 <- shape
> rock.ppr <- ppr(y~x1+x2+x3, nterms=2, max.terms=5)
> b.ppr <- rock.ppr$alpha
> a11ini <- acos(b.ppr[3,1]/sqrt(sum(b.ppr[,1]**2)))
> a12ini <- atan(b.ppr[2,1]/b.ppr[1,1])
> a21ini <- acos(b.ppr[3,2]/sqrt(sum(b.ppr[,2]**2)))
> a22ini <- atan(b.ppr[2,2]/b.ppr[1,2])
> rock.snr <- snr(y~b0
+f1(sin(a11)*cos(a12)*x1+sin(a11)*sin(a12)*x2
+cos(a11)*x3)
+f2(sin(a21)*cos(a22)*x1+sin(a21)*sin(a22)*x2
+cos(a21)*x3),
func=list(f1(u)~list(~u-1,tp(u)),
f2(v)~list(~v-1,tp(v))),
params=list(b0+a11+a12+a21+a22~1), spar=‘‘m’’,
start=list(params=c(mean(y),a11ini,a12ini,
a21ini,a22ini)),
control=list(prec.out=1.e-3,maxit.out=50))
> summary(rock.snr)
...
Coefficients:
Value Std.Error t-value p-value
b0 5.341747 0.31181572 17.13110 0.0000
a11 1.574432 0.04597406 34.24610 0.0000
a12 -1.221826 0.01709925 -71.45495 0.0000
a21 0.836215 0.26074376 3.20704 0.0025
a22 -1.010785 0.02919986 -34.61607 0.0000
...
GML estimate(s) of smoothing spline parameter(s):
1.173571e-05 1.297070e-05
Equivalent Degrees of Freedom (DF) for spline function:
8.559332
Residual standard error: 0.7337079
Semiparametric Regression 259
> a <- rock.snr$coef[-1]
> u <- sin(a[1])*cos(a[2])*x1+sin(a[1])*sin(a[2])*x2
+cos(a[1])*x3
> v <- sin(a[3])*cos(a[4])*x1+sin(a[3])*sin(a[4])*x2
+cos(a[3])*x3
> ugrid <- seq(min(u),max(u),len=50)
> vgrid <- seq(min(v),max(v),len=50)
> rock.snr.ci <- intervals(rock.snr,
newdata=data.frame(u=ugrid,v=vgrid))
We fitted the projection pursuit regression model using the ppr func-tion in R first (Venables and Ripley 2002) and used the estimates tocompute initial values for spherical coordinates. The estimates and pos-terior standard deviations of f1 and f2 were computed at grid pointsusing the intervals function. Figure 8.8 shows the estimates of f1 andf2 with 95% Bayesian confidence intervals. It is interesting to note thatthe overall shapes of the estimated functions from ppr (Figure 8.9 inVenables and Ripley (2002)) and snr are comparable even though theestimation methods are quite different.
−0.10 0.00 0.10 0.20
−4
−2
01
23
term 1
f 1
0.1 0.2 0.3 0.4 0.5
−4
−2
01
23
term 2
f 2
FIGURE 8.8 Rock data, estimates of f1 (left) and f2 (right). Shadedregions are 95% Bayesian confidence intervals.
8.4.4 Air Quality
The air quality data set contains daily measurements of the ozone con-centration (Ozone) in parts per million and three meteorological vari-ables: wind speed (Wind) in miles per hour, temperature (Temp) in de-
260 Smoothing Splines: Methods and Applications
grees Fahrenheit, and solar radiation (Solar.R) in Langleys. The goalis to investigate how the air pollutant ozone concentration depends onthree meteorological variables. Let y = Ozone1/3, x1 = Wind, x2 = Temp,and x3 = Solar.R. Yu and Ruppert (2002) considered the following par-tially linear single index model
y = f1(β1x1 + β2x2 +
√
1 − β21 − β2
2x3
)+ ǫ. (8.53)
Note that, for identifiability,√
1 − β21 − β2
2 is used to represent the co-efficient for x3 such that it is positive and the summation of all squaredcoefficients equals 1. Random errors are assumed to be iid. We modelf1 using the one-dimensional thin-plate spline space W 2
2 (R). The singleindex model (8.53) is an SNR model that can be fitted as follows:
> air <- na.omit(airquality)
> attach(air)
> y <- Ozone^(1/3)
> x1 <- Wind; x2 <- Temp; x3 <- ident(Solar.R)
> air.snr.1 <- snr(y~f1(b1*x1+b2*x2+sqrt(1-b1^2-b2^2)*x3),
func=f1(u)~list(~u, rk=tp(u)),
params=list(b1+b2~1), spar=‘‘m’’,
start=list(params=c(-0.8,.5)),
control=list(maxit.out=50))
The condition β21 + β2
2 < 1 is ignored for simplicity. The estimate of f1and its 95% Bayesian confidence intervals are shown in Figure 8.9(a).The estimate of f1 is similar to that in Yu and Ruppert (2002).
Yu and Ruppert (2002) also considered the following partially linearsingle index model
y = f2(β1x1 +
√
1 − β21x2
)+ β2x3 + ǫ. (8.54)
The following statement fits model (8.54) with f2 ∈W 22 (R):
> air.snr.2 <- snr(y~f2(b1*x1+sqrt(1-b1^2)*x2)+b2*x3,
func=f2(u)~list(~u, rk=tp(u)),
params=list(b1+b2~1), spar=‘‘m’’,
start=list(params=c(-0.8,1.3))
The estimates of f2 and β2x3 are shown in Figure 8.9(b)(c). The effect ofradiation may be nonlinear. So, we further fit the following SNR model
y = f3(β1x1 +
√
1 − β21x2
)+ f4(x3) + ǫ. (8.55)
Again, we model f3 using the space W 22 (R). Since the domain of x3
is [0, 1], we model f4 using the cubic spline space W 22 [0, 1] ⊖ {1} where
constant functions are removed for identifiability.
Semiparametric Regression 261
15 25 35 45
−1
01
23
45
index
f 1
(a)
20 30 40 50
−1
01
23
45
index
f 2 &
f3
(b)
0 100 200 300
−1
01
23
45
radiation
f 4
(c)
FIGURE 8.9 Air quality data, (a) observations (dots), and the esti-mate of f1 (solid line) with 95% Bayesian confidence intervals; (b) partialresiduals after removing the radiation effect (dots), the estimates of f2(dashed line) and f3 (solid line), and 95% Bayesian confidence intervals(shaded region) for f3; and (c) partial residuals after removing the indexeffects based on wind speed and temperature (dots), the estimates ofβ2x3 in model (8.54) (dashed line) and f4 (solid line), and 95% Bayesianconfidence intervals (shaded region) for f4.
> air.snr.3 <- snr(y~f3(a1*x1+sqrt(1-a1^2)*x2)+f4(x3),
func=list(f3(u)~list(~u, rk=tp(u)),
f4(v)~list(~v-1, rk=cubic(v))),
params=list(a1~1), spar=‘‘m’’,
start=list(params=c(-0.8)))
Estimates of f3 and f4 are shown in Figure 8.9(b)(c). The estimate of f4increases with increasing radiation until a value of about 250 Langleys,after which it is flat. The difference between f4 and the linear estimateis not significant based on confidence intervals, perhaps due to the smallsample size. To compare three models (8.53), (8.54), and (8.55), wecompute AIC, BIC, and GCV criteria:
> n <- 111
> rss <- c(sum((y-air.snr.1$fitted)**2),
sum((y-air.snr.2$fitted)**2),
sum((y-air.snr.3$fitted)**2))/n
> df <- c(air.snr.1$df$f+air.snr.1$df$para,
air.snr.2$df$f+air.snr.2$df$para,
air.snr.3$df$f+air.snr.3$df$para)
> gcv <- rss/(1-df/n)**2
> aic <- n*log(rss)+2*df
> bic <- n*log(rss)+log(n)*df
262 Smoothing Splines: Methods and Applications
> print(round(rbind(aic, bic, gcv),4))
aic -156.9928 -176.5754 -177.3056
bic -139.6195 -157.8198 -155.7608
gcv 0.2439 0.2046 0.2035
AIC and GCV select model (8.55), while BIC selects model (8.54).
8.4.5 The Evolution of the Mira Variable R Hydrae
The star data set contains magnitude (brightness) of the Mira variable RHydrae during 1900–1950. Figure 8.10 displays observations over time.The Mira variable R Hydrae is well known for its declining period andamplitude. We will consider three SNR models in this section to inves-tigate the pattern of the decline.
0 5000 10000 15000
45
67
89
time (day)
magnitude
FIGURE 8.10 Star data, plot of observations (points), and the fitbased on model (8.58) (solid line).
Let y = magnitude and x = time. We first consider the followingSNR model (Genton and Hall 2007)
y = a(x)f1(t(x)) + ǫ, (8.56)
where y is the magnitude on day x, a(x) = 1 + β1x is the amplitude
Semiparametric Regression 263
function, f1 is the common periodic shape function with unit period,t(x) = log(1 + β3x/β2)/β3 is a time transformation function, and ǫ isa random error. Random errors are assumed to be iid. The function1/t′(x) can be regarded as the period function. Therefore, model (8.56)assumes that the amplitude and period evolve linearly. Since f1 is closeto a sinusoidal function, we model f1 using the trigonometric spline withL = D{D2 + (2π)2} (m = 2 in (2.67)) and f1 ∈ W 3
2 (per). Model (8.56)can be fitted as follows:
> data(star); attach(star)
> star.fit.1 <- snr(y~(1+b1*x)*f1(log(1+b3*x/b2)/b3),
func=list(f1(u)~list(~sin(2*pi*u)+cos(2*pi*u),
lspline(u,type=‘‘sine1’’))),
params=list(b1+b2+b3~1), spar=‘‘m’’,
start=list(params=c(0.0000003694342,419.2645,
-0.00144125)))
> summary(star.fit.1)
...
Coefficients:
Value Std.Error t-value p-value
b1 0.0000 0.00000035 1.0163 0.3097
b2 419.2485 0.22553623 1858.8966 0.0000
b3 -0.0014 0.00003203 -44.8946 0.0000
...
GML estimate(s) of smoothing spline parameter(s):
0.9999997
Equivalent Degrees of Freedom (DF) for spline function:
3.000037
Residual standard error: 0.925996
> grid <- seq(0,1,len=100)
> star.p.1 <- intervals(star.fit.1,
newdata=data.frame(u=grid),
terms=list(f1=matrix(c(1,1,1,1,0,0,0,1),
ncol=4,byrow=T)))
> co <- star.fit.1$coef
> tx <- log(1+co[3]*x/co[2])/co[3]
> xfold <- tx-floor(tx)
> yfold <- y/(1+co[1]*x)
We computed posterior means and standard deviations of f1 and its pro-jection onto the subspaceW 3
2 (per)⊖{1, sin 2πx, cos 2πx}, P1f1, using theintervals function. Estimate of f1 and its 95% Bayesian confidence in-tervals are shown in Figure 8.11(a). We computed folded observations at
264 Smoothing Splines: Methods and Applications
day x as y = y/(1+ β1x) and x = t(x)−⌊t(x)⌋, where z−⌊z⌋ representsthe fractional part of z. The folded observations are shown in Figure8.11(a). The projection P1f1 represents the departure from the sinu-soidal model space span{1, sin 2πx, cos 2πx}. Figure 8.11(b) indicatesthat the function f1 is not significantly different from a sinusoidal modelsince P1f1 is not significantly different from zero.
0.0 0.4 0.8
45
67
89
x
f 1
(a)
0.0 0.4 0.8
−0
.01
00
.00
00
.01
0
x
P1f 1
(b)
FIGURE 8.11 Star data, (a) folded observations (points), estimateof f1 (solid line), and 95% Bayesian confidence intervals (shaded region);and (b) estimate of P1f1 (solid line) and 95% Bayesian confidence inter-vals (shaded region).
Assuming that the periodic shape function can be well modeled bya sinusoidal function, we can investigate the evolving amplitude andperiod nonparametrically. We first consider the following SNR model
y = β1 + exp{f2(x)} sin[2π{β2 + log(1 + β4x/β3)/β4}] + ǫ, (8.57)
where β1 is a parameter of the mean, exp{f2(x)} is the amplitude func-tion, and β2 is a phase parameter. Note that the exponential transfor-mation is used to enforce the positive constraint on the amplitude. Wemodel f2 using the cubic spline model space W 2
2 [0, b], where b equals themaximum of x. Model (8.57) can be fitted as follows:
> star.fit.2 <- snr(y~b1+
exp(f2(x))*sin(2*pi*(b2+log(1+b4*x/b3)/b4)),
func=list(f2(u)~list(~u,cubic2(u))),
params=list(b1+b2+b3+b4~1), spar=‘‘m’’,
Semiparametric Regression 265
start=list(params=c(7.1726,.4215,419.2645,-0.00144125),
f=list(f2=log(diff(range(star.p.1$f1$fit[,1]))))),
control=list(prec.out=.001,
rkpk.control=list(limnla=c(12,14))))
> summary(star.fit.2)
...
Coefficients:
Value Std.Error t-value p-value
b1 7.2004 0.0273582 263.1886 0
b2 0.3769 0.0143172 26.3278 0
b3 416.8002 0.5796450 719.0612 0
b4 -0.0011 0.0000617 -18.4201 0
GML estimate(s) of smoothing spline parameter(s):
1.000000e+12
Equivalent Degrees of Freedom (DF) for spline function:
9.829385
Residual standard error: 0.9093264
We used the logarithm of the range of f1 in model (8.56) as initial valuesfor f2. We also limited the search range for log10(nλ), where n is thenumber of observations. To look at the general pattern of the ampli-tudes, we compute the amplitude for each period by fitting a sinusoidalmodel for each period of the folded data containing more than five ob-servations. The logarithm of these estimated amplitudes and estimatesof the amplitude functions based on models (8.56) and (8.57) are shownin Figure 8.12(a). Apparently, the amplitudes are underestimated basedon these models.
Finally, we consider the following SNR model
y = β + exp{f3(x)} sin{2πf4(x)} + ǫ, (8.58)
where both the amplitude and period functions are modeled nonpara-metrically. We model f3 and f4 using the cubic spline model spaceW 2
2 [0, b]. Model (8.58) can be fitted as follows:
> co <- star.fit.2$coef
> f4ini <- co[2]+log(1+co[4]*x/co[3])/co[4]
> star.fit.3 <- snr(y~b+exp(f3(x))*sin(2*pi*f4(x)),
func=list(f3(u)+f4(u)~list(~u,cubic2(u))),
params=list(b~1), spar=‘‘m’’,
start=list(params=c(7.1726),
f=list(f3=star.fit.2$funcFitted,f4=f4ini)),
control=list(prec.out=.001,
266 Smoothing Splines: Methods and Applications
0 5000 10000 15000
0.0
0.2
0.4
0.6
0.8
1.0
time (day)
f 2
(a)
o
o
oo
oooo
o
o
ooo
o
o
o
oo
oooo
o
oo
o
o
oooo
o
oo
ooo
oo
0 5000 10000 15000
0.0
0.2
0.4
0.6
0.8
1.0
time (day)
f 3
(b)
o
o
oo
oooo
o
o
ooo
o
o
o
oo
oooo
o
oo
o
o
oooo
o
oo
ooo
oo
0 5000 10000 15000
010
20
30
40
time (day)
f 4
(c)
0 5000 10000 15000
200
300
400
500
600
time (day)
period
(d)
oo
o
ooooo
o
oo
o
o
ooo
o
o
o
oooo
o
oooo
oooo
o
oooo
o
o
FIGURE 8.12 Star data, (a) estimated amplitudes based on foldeddata (circles), estimate of the amplitude function a(x) in model (8.56)rescaled to match the amplitude function in model (8.57) (dashed line),estimate of the amplitude function exp(f2(x)) in model (8.57) (solidline), and 95% Bayesian confidence intervals (shaded region), all on log-arithmic scale; (b) estimated amplitudes based on folded data (circles),estimate of the amplitude function exp(f3(x)) in model (8.58) (solidline), and 95% Bayesian confidence intervals (shaded region), all on log-arithmic scale; (c) estimate of f4 based on model (8.58) (solid line) and95% Bayesian confidence intervals (shaded region); and (d) estimatedperiods based on the CLUSTER method (circles) and estimate of theperiod function 1/f ′
4(x).
rkpk.control=list(limnla=c(12,14))))
> n <- length(y)
> rss <- c(sum((y-star.fit.1$fitted)**2),
sum((y-star.fit.2$fitted)**2),
sum((y-star.fit.3$fitted)**2))/n
Semiparametric Regression 267
> df <- c(star.fit.1$df$f+star.fit.1$df$para,
star.fit.2$df$f+star.fit.2$df$para,
star.fit.3$df$f+star.fit.3$df$para)
> aic <- n*log(rss)+2*df
> bic <- n*log(rss)+log(n)*df
> gcv <- rss/(1-df/n)**2
> print(round(rbind(aic, bic, gcv),4))
aic -164.0240 -202.6215 -1520.0960
bic -134.0823 -133.6093 -1197.6866
gcv 0.8598 0.8299 0.2476
We used the fitted values of the corresponding components from models(8.57) and (8.56) as initial values for f3 and f4. We also computed AIC,BIC, and GCV criteria for models (8.56), (8.57), and (8.58). Model(8.58) fits data much better, and the overall fit is shown in Figure 8.10.AIC, BIC, and GCV all select model (8.58). Estimates of functions f3and f4 are shown in Figure 8.12(b)(c). The confidence intervals for f4are so narrow that they are indistinguishable from the estimate of f4.The amplitude function f3 fits data much better. To look at the gen-eral pattern of the periods, we first identify peaks using the CLUSTERmethod (Yang, Liu and Wang 2005). Observed periods are estimatedas the lengths between peaks, and they are shown in Figure 8.12(d).The estimate of period function 1/f ′
4 in Figure 8.12(d) indicates thatthe evolution of the period may be nonlinear. By allowing a nonlinearperiod function, the model (8.58) leads to a much improved overall fitwith less biased estimate of the amplitude function.
8.4.6 Circadian Rhythm
Many biochemical, physiological, and behavioral processes of living mat-ters follow a roughly 24-hour cycle known as the circadian rhythm. Weuse the hormone data to illustrate how to fit an SIM to investigate cir-cadian rhythms. The hormone data set contains cortisol concentrationmeasured every 2 hours for a period of 24 hours from multiple subjects.In this section we use observations from normal subjects only. Cortisolconcentrations on the log10 scale from nine normal subjects are shownin Figure 8.13.
It is usually assumed that there is a common shape function for allindividuals. The time axis may be shifted and the magnitude of variationmay differ between subjects; that is, there may be phase and amplitudedifferences between subjects. Therefore, we consider the following SIM
concij = βi1 + exp(βi2)f(timeij − alogit(βi3)) + ǫij ,
i = 1, . . . ,m, j = 1, . . . , ni, (8.59)
268 Smoothing Splines: Methods and Applications
time
cort
isol concentr
ation o
n log s
cale
0
1
2
3
0.0 0.4 0.8
8007 8008
0.0 0.4 0.8
8009
8004 8005
0
1
2
3
8006
0
1
2
3
8001
0.0 0.4 0.8
8002 8003
FIGURE 8.13 Hormone data, plots of cortisol concentrations (cir-cles), and the fitted curves (solid lines) based on model (8.59). Subjects’ID are shown in the strip.
Semiparametric Regression 269
where m is the total number of subjects, ni is the number of obser-vations from subject i, concij is the cortisol concentration (on log10
scale) of the ith subject at the jth time point timeij , βi1 is the 24-hourmean of the ith subject, exp(βi2) is the amplitude of the ith subject,alogit(βi3) = exp(β3i)/{1 + exp(βi3)} is the phase of the ith subject,and ǫij are random errors. Note that the variable time is transformedinto the interval [0, 1], the exponential transformation is used to enforcepositive constraint on the amplitude, and the inverse logistic transfor-mation alogit is used such that the phase is inside the interval [0, 1].Comparing with the SIM model (8.33), there is no scale parameter β4
in model (8.59) since the period is fixed to be 1. The function f is thecommon shape function. Since it is a periodic function with period 1,we model f using the trigonometric spline with L = D2 + (2π)2 (m = 2in (2.70)) and f ∈ W 2
2 (per)⊖{1} where constant functions are removedfrom the model space to make f identifiable with βi1. In order to makeβi2 and βi3 identifiable with f , we add constraints: β21 = β31 = 0.Model (8.59) is an SNR model for clustered data. Assuming randomerrors are iid, model (8.59) can be fitted as follows:
> data(horm.cort)
> nor <- horm.cort[horm.cort$type==‘‘normal’’,]
> M <- model.matrix(~as.factor(ID), data=nor)
> nor.snr.fit1 <- snr(conc~b1+exp(b2)*f(time-alogit(b3)),
func=f(u)~list(~sin(2*pi*u)+cos(2*pi*u)-1,
lspline(u,type=‘‘sine0’’)),
params=list(b1~M-1, b2+b3~M[,-1]-1),
start=list(params=c(mean(nor$conc),rep(0,24))),
data=nor, spar=‘‘m’’,
control=list(prec.out=0.001,converg=‘‘PRSS’’))
Note that the second-stage models for parameters were specified by theparams argument. We removed the first column in the design matrixM to satisfy the side condition β21 = β31 = 0. We used the optionconverg=‘‘PRSS’’ instead of the default converg=‘‘COEF’’ becausethis option usually requires fewer number of iterations. We computefitted curves for all subjects evaluated at grid points:
> nor.grid <- data.frame(ID=rep(unique(nor$ID),rep(50,9)),
time=rep(seq(0,1,len=50),9))
> M <- model.matrix(~as.factor(ID), data=nor.grid)
> nor.snr.p <- predict(nor.snr.fit1,newdata=nor.grid)
Note that the matrix M needs to be generated again for the grid points.The fitted curves for all subjects are shown in Figure 8.13. We fur-
270 Smoothing Splines: Methods and Applications
ther compute the posterior means and standard deviations of f and itsprojection onto W 2
2 (per) ⊖ {1, sin 2πx, cos 2πx}, P1f , as follows:
> grid <- seq(0,1,len=100)
> nor.snr.p.f <- intervals(nor.snr.fit1,
newdata=data.frame(u=grid),
terms=list(f=matrix(c(1,1,1,0,0,1),nrow=2,byrow=T)))
The estimates of f and its projection P1f are shown in Figure 8.14. Itis obvious that P1f is significantly different from zero. Thus a simplesinusoidal model may not be appropriate for this data.
0.0 0.4 0.8
−1.0
0.0
1.0
x
f
(a)
0.0 0.4 0.8
−1.0
0.0
1.0
x
P1f
(b)
FIGURE 8.14 Hormone data, (a) estimate of f (solid line), and 95%Bayesian confidence intervals (shaded region); and (b) estimate of P1f(solid line) and 95% Bayesian confidence intervals (shaded region).
Random errors in model (8.59) may be correlated. In the followingwe fit with an AR(1) within-subject correlation structure:
> M <- model.matrix(~as.factor(subject))
> nor.snr.fit2 <- update(nor.snr.fit1,
cor=corAR1(form=~1|subject))
> summary(nor.snr.fit2)
...
Correlation Structure: AR(1)
Formula: ~1 | subject
Parameter estimate(s):
Phi
-0.1557776
...
Semiparametric Regression 271
The lag 1 autocorrelation coefficient is small. We will further discuss howto deal with possible correlation within each subject in Section 9.4.5.
This page intentionally left blankThis page intentionally left blank
Chapter 9
Semiparametric Mixed-EffectsModels
9.1 Linear Mixed-Effects Models
Mixed-effects models include both fixed effects and random effects, whererandom effects are usually introduced to model correlation within a clus-ter and/or spatial correlations. They provide flexible tools to model boththe mean and the covariance structures simultaneously.
The simplest mixed-effects model is perhaps the classical two-waymixed model. Suppose A is a fixed factor with a levels, B is a randomfactor with b levels, and the design is balanced. The two-way mixedmodel assumes that
yijk = µ+ αi + βj + (αβ)ij + ǫijk,
i = 1, . . . , a; j = 1, . . . , b; k = 1, . . . ,m,(9.1)
where yijk is the kth observation at level i of factor A and level j offactor B, µ is the overall mean, αi and βj are main effects, (αβ)ij is theinteraction, and ǫijk are random errors. Since factor B is random, βj
and (αβ)ij are random effects. It is usually assumed that βjiid∼ N(0, σ2
b ),
(αβ)ijiid∼ N(0, σ2
ab), ǫijkiid∼ N(0, σ2), and they are mutually independent.
Another simple mixed-effects model is the linear growth curve model.Suppose a response variable is measured repeatedly over a period of timeor a sequence of doses from multiple individuals. Assume that responsesacross time or doses for each individual can be described by a simplestraight line, while the intercepts and slopes for different individualsmay differ. Then, one may consider the following linear growth curvemodel
yij = β1 + β2xij + b1i + b2ixij + ǫij ,
i = 1, . . . ,m; j = 1, . . . , ni,(9.2)
where yij is the observation from individual i at time (or does) xij , β1
and β2 are population intercept and slope, b1i and b2i are random effects
273
274 Smoothing Splines: Methods and Applications
representing individual i’s departures in intercept and slope from thepopulation parameters, and ǫij are random errors. Let bi = (b1i, b2i)
T .
It is usually assumed that biiid∼ N(0, σ2D) for certain covariance matrix
D, and random effects and random errors are mutually independent.A general linear mixed-effects (LME) model assumes that
y = Sβ + Zb+ ǫ, (9.3)
where y is an n-vector of observations on the response variable, S andZ are design matrices for fixed and random effects, respectively, β isa q1-vector of unknown fixed effects, b is a q2-vector of unobservablerandom effects, and ǫ is an n-vector of random errors. We assume thatb ∼ N(0, σ2D), ǫ ∼ N(0, σ2Λ), and b and ǫ are independent. Note thaty ∼ N(Sβ, σ2(ZDZT +Λ)). Therefore, the mean structure is modeled bythe fixed effects, and the covariance structure is modeled by the randomeffects and random errors.
As discussed in Sections 3.5, 4.7, 5.2.2, and 7.3.3, smoothing splineestimates can be regarded as the BLUP estimates of the correspondingLME and NLME (nonlinear mixed-effects) models. Furthermore, theGML (generalized maximum likelihood) estimates of smoothing param-eters are the REML (restricted maximum likelihood) estimates of vari-ance components in the corresponding LME and NLME models. Theseconnections between smoothing spline models and mixed-effects modelswill be utilized again in this chapter. In particular, details for the con-nection between semiparametric linear mixed-effects models and LMEmodels will be given in Section 9.2.2.
9.2 Semiparametric Linear Mixed-Effects Models
9.2.1 The Model
A semiparametric linear mixed-effects (SLM) model assumes that
yi = sTi β +
r∑
k=1
Lkifk + zTi b+ ǫi, i = 1, . . . , n, (9.4)
where s and z are independent variables for fixed and random effectsrespectively, β is a q1-vector of parameters, b is a q2-vector of randomeffects, Lki are bounded linear functionals, fk are unknown functions,and ǫi are random errors. For k = 1, . . . , r, denote Xk as the domain offk and assume that fk ∈ Hk, where Hk is an RKHS on Xk.
Semiparametric Mixed-Effects Models 275
Let y = (y1, . . . , yn)T , S = (s1, . . . , sn)T , Z = (z1, . . . ,zn)T , f =(f1, . . . , fr), γ(f ) = (
∑rk=1 Lk1fk, . . . ,
∑rk=1 Lknfk)T , and ǫ = (ǫ1, . . . ,
ǫn)T . Then model (9.4) can be written in the vector form
y = Sβ + γ(f ) + Zb+ ǫ. (9.5)
We assume that b ∼ N(0, σ2D), ǫ ∼ N(0, σ2Λ), and they are mutuallyindependent. It is clear that model (9.5) is an extension of the LMEmodel with an additional term for nonparametric fixed effects. We notethat the random effects are general. Stochastic processes including thosebased on smoothing splines can be used to construct models for randomeffects. Therefore, in a sense, the random effects may also be modelednonparametrically. See Section 9.2.4 for an example.
For clustered data, an SLM model assumes that
yij = sTijβ+
r∑
k=1
Lkijfk+zTijbi+ǫij , i = 1, . . . ,m; j = 1, . . . , ni, (9.6)
where yij is the jth observation in cluster i, and bi are random effects forcluster i. Let n =
∑mi=1 ni, yi = (yi1, . . . , yini
)T , y = (yT1 , . . . ,y
Tm)T ,
Si = (si1, . . . , sini)T , S = (ST
1 , . . . , STm)T , Zi = (zi1, . . . ,zini
)T , Z =diag(Z1, . . . , Zm), b = (bT
1 , . . . , bTm)T , γi(f ) = (
∑rk=1 Lki1fk, . . . ,∑r
k=1 Lkinifk)T , γ(f ) = (γT
1 (f ), . . . ,γTm(f))T , ǫi = (ǫi1, . . . , ǫini
)T ,and ǫ = (ǫT
1 , . . . , ǫTm)T . Then model (9.6) can be written in the same
vector form as (9.5).Similar SLM models may be constructed for multiple levels of grouping
and other situations.
9.2.2 Estimation and Inference
For simplicity we present the estimation and inference procedures forthe SLM model (9.4). The same methods apply to the SLM model (9.6)with a slight modification of notation.
We assume that the covariance matrices D and Λ depend on anunknown vector of covariance parameters τ . Our goal is to estimateβ, f , b, τ , and σ2. Assume that Hk = Hk0 ⊕ Hk1, where Hk0 =span{φk1, . . . , φkpk
} and Hk1 is an RKHS with RK Rk1. Let η(β,f) =Sβ + γ(f) and W−1 = ZDZT + Λ. The marginal distribution of y isy ∼ N(η(β,f), σ2W−1). For fixed τ , we estimate β and f as minimizersof the PWLS (penalized weighted least squares)
1
n(y − η(β,f))TW (y − η(β,f)) + λ
r∑
k=1
θ−1k ‖Pk1fk‖2, (9.7)
276 Smoothing Splines: Methods and Applications
where Pk1 is the projection operator onto Hk1 in Hk, and λθ−1k are
smoothing parameters. Note that the PWLS (9.7) has the same form as(8.5). Therefore, the results in Section 8.2.2 hold for the SLM models.In particular,
fk =
pk∑
ν=1
dkνφkν + θk
n∑
i=1
ciξki, k = 1, . . . , r, (9.8)
where ξki(x) = Lki(z)Rk1(x, z).For k = 1, . . . , r, let dk = (dk1, . . . , dkpk
)T , Tk = {Lkiφkν}n pk
i=1 ν=1,
and Σk = {LkiLkjRk1}ni,j=1. Let c = (c1, . . . , cn)T , d = (dT
1 , . . . ,dTr )T ,
α = (βT ,dT )T , T = (T1, . . . , Tr), X = (S T ), θ = (θ1, . . . , θr), and
Σθ =∑r
k=1 θkΣk. We haveLkifk =∑pk
ν=1 dkνLkiφkν+θk
∑nj=1 cjLkiξkj
and fk = (Lk1fk, . . . ,Lknfk)T = Tkdk + θkΣkc. Consequently,
γ(f ) =
r∑
k=1
(Tkdk + θkΣkc) = Td+ Σθc
and
η(β,f) = Sβ + Td+ Σθc = Xα+ Σθc.
Furthermore,
r∑
k=1
θ−1k ‖Pk1fk‖2 =
r∑
k=1
θkcT Σkc = cT Σθc.
Then the PWLS (9.7) reduces to
(y −Xα− Σθc)TW (y −Xα− Σθc) + nλcT Σθc. (9.9)
Differentiating (9.9) with respect to α and c, we have
XTWXα+XTWΣθc = XTWy,
ΣθWXα+ (ΣθWΣθ + nλΣθ)c = ΣθWy.(9.10)
We estimate random effects by (Wang 1998a)
b = DZTW (y −Xα− Σθc). (9.11)
Henderson (hierarchical) likelihood is often used to justify the estima-tion of fixed and random effects in an LME model (Robinson 1991, Lee,Nelder and Pawitan 2006). We now show that the above PWLS esti-mates of β, f , and b can also be derived from the following penalized
Semiparametric Mixed-Effects Models 277
Henderson (hierarchical) likelihood of y and b
(y−η(β,f )−Zb)T Λ−1(y−η(β,f )−Zb)+bTD−1b+nλ
r∑
k=1
θ−1k ‖Pk1fk‖2,
(9.12)where, ignoring some constant terms, the first two components in (9.12)equal twice the negative logarithm of the joint density function of y andb. Again, the solution of f in (9.12) can be represented by (9.8). Thenthe penalized Henderson (hierarchical) likelihood (9.12) reduces to
(y−Xα−Σθc−Zb)T Λ−1(y−Xα−Σθc−Zb)+bTD−1b+nλcT Σθc.(9.13)
Differentiating (9.13) with respect to α, c, and b, we have
XT Λ−1Xα+XT Λ−1Σθc+XT Λ−1Zb = XT Λ−1y,
ΣθΛ−1Xα+ (ΣθΛ−1Σθ + nλΣθ)c+ ΣθΛ−1Zb = ΣθΛ−1y,
ZT Λ−1Xα+ ZT Λ−1Σθc+ (D−1 + ZT Λ−1Z)b = ZT Λ−1y.
(9.14)
It is easy to check that
W = Λ−1 − Λ−1Z(ZT Λ−1Z +D−1)−1ZT Λ−1. (9.15)
Since I = (ZDZT + Λ)W = ZDZTW + ΛW , then
ZDZTW = I − Λ{Λ−1 − Λ−1Z(D−1 + ZT Λ−1Z)−1ZT Λ−1}= Z(D−1 + ZT Λ−1Z)−1ZT Λ−1. (9.16)
From (9.11) and (9.16), we have
Zb = Z(D−1 + ZT Λ−1Z)−1ZT Λ−1(y −Xα− Σθc). (9.17)
From (9.15) and (9.17), the first equation in (9.10) is equivalent to
0 = XTWXα+XTWΣθc−XTWy
= XT Λ−1Xα+XT Λ−1Σθc −XT Λ−1y
+XT Λ−1Z(D−1 + ZT Λ−1Z)−1ZT Λ−1(y −Xα− Σθc)
= XT Λ−1Xα+XT Λ−1Σθc −XT Λ−1y +XT Λ−1Zb,
which is the same as the first equation in (9.14). Similarly, the sec-ond equation in (9.10) is equivalent to the second equation in (9.14).Equation (9.11) is equivalent to
D−1b = ZT Λ−1(y −Xα− Σθc)
−ZT Λ−1Z(D−1 + ZT Λ−1Z)−1ZT Λ−1(y −Xα− Σθc)
= ZT Λ−1(y −Xα− Σθc) − ZT Λ−1Zb,
278 Smoothing Splines: Methods and Applications
which is the same as the third equation in (9.14). Therefore, the PWLSestimates of β, f , and b based on (9.7) and (9.11) can be regarded asthe penalized Henderson (hierarchical) likelihood estimates.
Let Zs = (In, . . . , In), where In denotes an n × n identity matrix.Consider the following LME model
y = Sβ + Td+
r∑
k=1
uk + Zb+ ǫ = Xα+ Zsu+ Zb+ ǫ, (9.18)
where β and d are fixed effects, α = (βT ,dT )T , uk are random effectswith uk ∼ N(0, σ2θkΣk/nλ), u = (uT
1 , . . . ,uTr )T , b are random effects
with b ∼ N(0, σ2D), ǫ are random errors with ǫ ∼ N(0, σ2Λ), andrandom effects and random errors are mutually independent. Let Ds =diag(θ1Σ1, . . . , θrΣr) and b = (uT , bT )T . Write
σ−2Cov(b) = diag((nλ)−1Ds, D
)
={Inr+q2
}{diag(Ds, D)
}{diag
((nλ)−1Inr, Iq2
) }.
Then equation (3.3) in Harville (1976) can be written as
XT Λ−1Xα+XT Λ−1ZsDsφ1 +XT Λ−1ZDφ
= XT Λ−1y,
DsZTs Λ−1Xα+ (nλDs +DsZ
Ts Λ−1ZsDs)φ1 +DsZ
Ts Λ−1ZDφ
= DsZTs Λ−1y,
DZT Λ−1Xα+DZT Λ−1ZsDsφ1 + (D +DZT Λ−1ZD)φ
= DZT Λ−1y.
(9.19)
Suppose α, c, and b are solutions to (9.14). Note that ZsDsZTs = Σθ.
When Σθ is invertible, multiplying DsZTs Σ−1
θto both side of the second
equation in (9.14), it is not difficult to see that α, φ1 = ZTs c, and
φ = D−1b are solutions to (9.19). From Theorem 2 in Harville (1976),the linear system (9.19) is consistent, and the BLUP estimate of u isu = Dsφ1 = DsZ
Ts c = (θ1c
T Σ1, . . . , θrcT Σr)
T . Therefore, the PWLSestimate of each component, uk = θkΣkc, is a BLUP. When Σθ isnot invertible, consider Zsu =
∑rk=1 uk instead of each individual uk.
Letting b = (uTZTs , b
T )T and following the same arguments, it can beshown that the overall fit Σθc is a BLUP. Similarly, the estimate of b isalso a BLUP.
We now discuss how to estimate smoothing parameters λ and θ andvariance–covariance parameters σ2 and τ . Let the QR decomposition of
Semiparametric Mixed-Effects Models 279
X be
X = (Q1 Q2)
(R0
)
.
Consider the LME model (9.18) and orthogonal contrast w1 = QT2 y.
Then w1 ∼ N(0, δQT2MQ2), where δ = σ2/nλ and M = Σθ + nλW−1.
The restricted likelihood based on w1 is given in (3.37) with a differentM defined in this Section. Following the same arguments as in Section3.6, we have the GML criterion for λ, θ and τ as
GML(λ,θ, τ ) =wT
1 (QT2MQ2)
−1w1
{det(QT2MQ2)−1} 1
n−q1−p
, (9.20)
where p =∑r
k=1 pk. Similar to (3.41), the REML (GML) estimate of σ2
is
σ2 =nλwT
1 (QT2 MQ2)
−1w1
n− q1 − p, (9.21)
where M = Σˆθ+nλW−1, and W−1 is the estimate of W−1 with τ being
replaced by τ .For clustered data, the leaving-out-one-cluster approach presented in
Section 5.2.2 may also be used to estimate the smoothing and variance-covariance parameters.
The SLM model (9.4) reduces to the semiparametric linear regres-sion model (8.4) when the random effects are combined with randomerrors. Therefore, the methods described in Section 8.2.2 can be used todraw inference about β and f . Specifically, posterior means and stan-dard deviations can be calculated using formulae (8.17) and (8.18) withW−1 = ZDZT + Λ fixed at its estimate. Bayesian confidence intervalsfor the overall functions and their components can then be constructedas in Section 8.2.2. Covariances for random effects can be computedusing Theorem 1 in Wang (1998a). The bootstrap method may also beused to construct confidence intervals.
9.2.3 The slm Function
The connections between SLM models and LME models suggest a rel-atively simple approach to fitting SLM models using existing softwarefor LME models. Specifically, for k = 1, . . . , r, let Σk = ZskZ
Tsk be
the Cholesky decomposition, where Zsk is a n ×mk matrix with mk =rank(Σk). Consider the following LME model
y = Sβ + Td+
r∑
k=1
Zskbsk + Zb+ ǫ, (9.22)
280 Smoothing Splines: Methods and Applications
where β and d are fixed effects, bsk are random effects with bsk ∼N(0, σ2θkImk
/nλ), b are random effects with b ∼ N(0, σ2D), ǫ are ran-dom errors with ǫ ∼ N(0, σ2Λ), and random effects and random errorsare mutually independent. Following the same arguments in Section9.2.2, it can be shown that the PWLS estimates of parameters and non-parametric functions in the SLM model (9.5) correspond to the BLUPestimates in the LME model (9.22). Furthermore, the GML estimates ofsmoothing and covariance parameters in (9.5) correspond to the REMLestimates of covariance parameters in (9.22). Therefore, the PWLS es-timates of β and f , and GML estimates of λ, θ and τ in (9.5), can becalculated by fitting the LME model (9.22) with covariance parameterscalculated by the REML method. This approach is implemented by theslm function in the assist package where the LME model was fittedusing the lme function in the nlme library.
A typical call to the slm function is
slm(formula, rk, random)
where formula and rk serve the same purposes as those in the ssr func-tion. Combined, they specify the fixed effects. The random argumentspecifies the random effects the same way as in lme. An object of slmclass is returned. The generic function summary can be applied to extractfurther information. Predictions on different levels of random effects canbe computed using the predict function where the nonparametric func-tions are treated as part of fixed effects. Posterior means and standarddeviations for β and f can be computed using the intervals function.Examples can be found in Section 9.4.
9.2.4 SS ANOVA Decomposition
We have shown how to build multiple regression models using SS ANOVAdecompositions in Chapter 4. The resulting SS ANOVA models havecertain modular structures that parallel the classical ANOVA decom-positions. In this section we show how to construct similar SS ANOVAdecompositions with modular structures that parallel the classical mixedmodels. We illustrate how to construct SS ANOVA decompositionsthrough two examples. More examples of SS ANOVA decompositions in-volving random effects can be found in Wang (1998a), Wang and Wahba(1998), and Section 9.4.3. As a general approach, the SS ANOVA de-composition may be employed to build mixed-effects models for othersituations.
It is instructive to see how the classical two-way mixed model (9.1) canbe derived via an SS ANOVA decomposition. In general, the factor Bis considered random since the levels of the factor are chosen at random
Semiparametric Mixed-Effects Models 281
from a well-defined population of all factor levels. It is of interest todraw an inference about the general population using information fromthese observed (chosen) levels. Let X1 = {1, . . . , a} be the domain offactor A, and X2 be the population from which the levels of the randomfactor B are drawn. Assume the following model
yiwk = f(i, w) + ǫiwk, i ∈ X1; w ∈ X2; k = 1, . . . ,m, (9.23)
where f(i, w) is a random variable since w is a random sample from X2.f(i, j) for j = 1, . . . , b are realizations of the true mean function definedon X1×X2. Let P be the sampling distribution on X2. Define averaging
operators A(1)1 on X1 and A(2)
1 on X2 as
A(1)1 f =
1
a
a∑
i=1
f(i, ·),
A(2)1 f =
∫
X2
f(·, w)dP.
A(2)1 computes population average with respect to the sampling distri-
bution. Let A(1)2 = I − A(1)
1 and A(2)2 = I − A(2)
1 . An SS ANOVAdecomposition can be defined as
f ={A(1)
1 + A(1)2
}{A(2)
1 + A(2)2
}f
= A(1)1 A(2)
1 f + A(1)2 A(2)
1 f + A(1)1 A(2)
2 f + A(1)2 A(2)
2 f
, µ+ αi + βw + (αβ)iw .
Therefore, the SS ANOVA decomposition leads to the same structure asthe classical two-way mixed model (9.1).
Next we discuss how to derive the SS ANOVA decomposition for re-peated measures data. As in Section 9.1, suppose we have repeatedmeasurements on a response variable over a period of time or a sequenceof doses from multiple individuals. Suppose that individuals are selectedat random from a well-defined population X1. Without loss of generality,denote the time period or dose range as X2 = [0, 1]. The linear growthcurve model (9.2) assumes that responses across time or doses for eachindividual can be well described by a simple straight line, which may betoo restrictive for some applications. Assume the following model
ywj = f(w, xwj) + ǫwj, w ∈ X1; xwj ∈ X2, j = 1, . . . , nw, (9.24)
where f(w, xwj) are random variables since w are random samples fromX1. f(i, xij), i = 1, . . . ,m, j = 1, . . . , ni are realizations of the truemean functions defined on X1 × X2. Suppose we want to model the
282 Smoothing Splines: Methods and Applications
mean function nonparametrically using the cubic spline space W 22 [0, 1]
under the construction in Section 2.2. Let P be the sampling distributionon X1. Define averaging operators
A(1)1 f =
∫
X1
f(w, ·)dP,
A(2)1 f = f(·, 0),
A(2)2 f = f ′(·, 0)x.
Let A(1)2 = I −A(1)
1 and A(2)3 = I −A(2)
1 −A(2)2 . Then
f ={A(1)
1 + A(1)2
}{A(2)
1 + A(2)2 + A(2)
3
}f
= A(1)1 A(2)
1 f + A(1)1 A(2)
2 f + A(1)1 A(2)
3 f+
+ A(1)2 A(2)
1 f + A(1)2 A(2)
2 f + A(1)2 A(2)
3 f
= β1 + β2x+ f2(x) + b1w + b2wx+ f1,2(w, x), (9.25)
where the first three terms are fixed effects representing the populationmean function, and the last three terms are random effects representingthe departure of individual w from the population mean function. Sinceboth the first and the last three terms are orthogonal components inW 2
2 [0, 1], then both the population mean function and the individualdeparture are modeled by cubic splines.
Based on the SS ANOVA decomposition (9.25), we may consider thefollowing model for observed data
yij = β1 + β2xij + f2(xij) + b1i + b2ixij + f1,2(i, xij) + ǫij ,
i = 1, . . . ,m; j = 1, . . . , ni.
Let bi = (b1i, b2i)T and assume that bi
iid∼ N(0, σ2D). One possible modelfor the nonparametric random effects f1,2 is to assume that f1,2(i, x) isa stochastic process on [0, 1] with mean zero and covariance functionσ2
1R1(x, y), where R1(x, y) is the RK of W 22 [0, 1] ⊖ {1, x} defined in
(2.4). It is obvious that the linear growth curve model (9.2) is a specialcase of model (9.25) with f2 = 0 and σ2
1 = 0.
Semiparametric Mixed-Effects Models 283
9.3 Semiparametric Nonlinear Mixed-EffectsModels
9.3.1 The Model
Nonlinear mixed-effects (NLME) models extend LME models by allowingthe regression function to depend on fixed and random effects througha nonlinear function. Consider the following NLME model proposed byLindstrom and Bates (1990) for clustered data:
yij = ψ(φij ;xij) + ǫij , i = 1, . . . ,m; j = 1, . . . , ni,
φij = Sijβ + Zijbi, biiid∼ N(0, σ2D),
(9.26)
where m is the number of clusters, ni is the number of observationsfrom the ith cluster, yij is the jth observation in cluster i, ψ is a knownfunction of a covariate x, φij is a q-vector of parameters, ǫij are randomerrors, Sij and Zij are design matrices for fixed and random effectsrespectively, β is a q1-vector of population parameters (fixed effects), andbi is a q2-vector of random effects for cluster i. Let ǫi = (ǫi1, . . . , ǫini
)T .We assume that ǫi ∼ N(0, σ2Λi), bi and ǫi are mutually independent,and observations from different clusters are independent.
The first-stage model in (9.26) relates the conditional mean of the re-sponse variable to the covariate x and parameters φij . The second-stagemodel relates parameters φij to fixed and random effects. Covariate ef-fects can be incorporated into the second-stage model.
As an extension of the NLME model, Ke and Wang (2001) proposedthe following class of semiparametric nonlinear mixed-effects (SNM) mod-els:
yij = Nij(φij ,f) + ǫij , i = 1, . . . ,m; j = 1, . . . , ni,
φij = Sijβ + Zijbi, biiid∼ N(0, σ2D),
(9.27)
where φij is a q-vector of parameters, β is a q1-vector of fixed effects, bi isa q2-vector of random effects for cluster i, f = (f1, . . . , fr) are unknownfunctions, fk belongs to an RKHS Hk on an arbitrary domain Xk for k =1, . . . , r, and Nij are known nonlinear functionals on R
q ×H1×· · ·×Hr.As the NLME model, we assume that ǫi = (ǫi1, . . . , ǫini
)T ∼ N(0, σ2Λi),bi and ǫi are mutually independent, and observations from differentclusters are independent.
It is clear that the SNM model (9.27) is an extension of the SNRmodel (8.30) with an additional mixed-effects second-stage model. Sim-ilar to the SNR model, certain constraints may be required to make an
284 Smoothing Splines: Methods and Applications
SNM model identifiable, and often these constraints can be achieved byremoving certain components from the model spaces for parameters andf . The SLM model (9.26) is a special case when N is linear in both φand f .
Let n =∑m
i=1 ni, yi = (yi1, . . . , yini)T , y = (yT
1 , . . . ,yTm)T , φi =
(φTi1, . . . ,φ
Tini
)T , φ = (φT1 , . . . ,φ
Tm)T , ηi(φi,f) = (Ni1(φi1,f), . . . ,
Nini(φini
,f ))T , η(φ,f ) = (ηT1 (φ1,f), . . . ,ηT
m(φm,f))T , ǫ = (ǫT1 , . . . ,
ǫTm)T , b = (bT
1 , . . . , bTm)T , Λ = diag(Λ1, . . . ,Λm), S = (ST
11, . . . , ST1n1
,. . . , ST
m1, . . . , STmnm
)T , Zi = (ZTi1, . . . , Z
Tini
)T , Z = diag(Z1, . . . , Zm), and
D = diag(D, . . . , D). Then model (9.27) can be written in the vectorform
y = η(φ,f) + ǫ, ǫ ∼ N(0, σ2Λ),
φ = Sβ + Zb, b ∼ N(0, σ2D).(9.28)
Note that model (9.28) is more general than (9.27) in the sense thatother SNM models may also be written in this form. We will discussestimation and inference procedures for the general model (9.28).
9.3.2 Estimation and Inference
Suppose Hk = Hk0 ⊕Hk1, where Hk0 = span{φk1, . . . , φkpk} and Hk1 is
an RKHS with RK Rk1. Assume that D and Λ depend on an unknownparameter vector τ . We need to estimate β, f , τ , σ2, and b. Themarginal likelihood based on model (9.28)
L(β,f , τ , σ2) = (2πσ2)−mq2+n
2 |D|− 12 |Λ|− 1
2
∫
exp
{
− 1
σ2g(b)
}
db,
where
g(b) =1
2
{
(y − η(Sβ + Zb,f))T Λ−1(y − η(Sβ + Zb,f)) + bTD−1b}
.
For fixed τ and σ2, we estimate β and f as the minimizers of the fol-lowing penalized likelihood (PL)
l(β,f , τ , σ2) +nλ
σ2
r∑
k=1
θ−1k ‖Pk1fk‖2, (9.29)
where l(β,f , τ , σ2) = −2 logL(β,f , τ , σ2), Pk1 is the projection opera-tor onto Hk1 in Hk, and λθ−1
k are smoothing parameters.The integral in the marginal likelihood is usually intractable because
η may depend on b nonlinearly. We now derive an approximation to the
Semiparametric Mixed-Effects Models 285
log-likelihood using the Laplace method. Let G = ∂η(Sβ+Zb,f)/∂bT
and b be the solution to
∂g(b)
∂b= −GT Λ−1(y − η(Sβ + Zb,f)) +D−1b = 0. (9.30)
Approximating the Hessian by ∂2g(b)/∂b∂bT ≈ GT Λ−1G + D−1 andapplying the Laplace method for integral approximation, the function lcan be approximated by
l(β,f , τ , σ2) = n log 2πσ2 + log |Λ| + log |I + GT Λ−1GD|
+1
σ2(y − η(Sβ + Zb,f))T Λ−1(y − η(Sβ + Zb,f))
+1
σ2b
TD−1b, (9.31)
where G = G|b=˜b
. Replacing l by l, ignoring the dependence of G on β
and f , and dropping constant terms, the PL (9.29) reduces to
(y − η(Sβ + Zb,f))T Λ−1(y − η(Sβ + Zb,f ))
+ bTD−1b+ nλ
r∑
k=1
θ−1k ‖Pk1fk‖2.
(9.32)
It is not difficult to see that estimating b by equation (9.30), and β andf by equation (9.32), is equivalent to estimating b, β, and f jointly asminimizers of the following penalized Henderson (hierarchical) likelihood
(y − η(Sβ + Zb,f))T Λ−1(y − η(Sβ + Zb,f ))
+ bTD−1b+ nλ
r∑
k=1
θ−1k ‖Pk1fk‖2.
(9.33)
Denote the estimates of β and f as β and f . We now discuss theestimation of τ and σ2 with β, f , and b fixed at β, f , and b. LetW−1 = Λ + GDGT , η = η(Sβ +Zb,f), and U = (D−1 + GT Λ−1G)−1.Since W = Λ−1 − Λ−1GUGT Λ−1 and GT Λ−1(y − η) = D−1b, then
(y − η + Gb)T W (y − η + Gb)
= (y − η + Gb)T Λ−1(y − η + Gb)
−(y − η + Gb)T Λ−1GUGT Λ−1(y − η + Gb)
= (y − η + Gb)T Λ−1(y − η + Gb)
−(y − η + Gb)T Λ−1GU(D−1 + GT Λ−1G)b
= (y − η + Gb)T Λ−1(y − η + Gb) − (y − η + Gb)T Λ−1Gb
= (y − η)T Λ−1(y − η) + bTGT Λ−1(y − η)
= (y − η)T Λ−1(y − η) + bTD−1b.
286 Smoothing Splines: Methods and Applications
Note that, based on Theorem 18.1.1 in Harville (1997), |W−1| = |Λ +GDGT | = |Λ||I + GT Λ−1GD|. Then we can reexpress l in (9.31) as
l(β, f, τ , σ2) = n log 2π + log |σ2W−1| + 1
σ2eT We, (9.34)
where e = y − η(Sβ + Zb,f) + Gb. Plugging-in estimates of β andf in (9.34), profiling with respect to σ2, and dropping constant terms,we estimate τ as minimizers of the following approximate negative log-likelihood
log |W−1| + log(eT W e), (9.35)
where e = y − η(Sβ + Zb, f) + Gb.The resulting estimates of τ are denoted as τ . To account for the loss
of degrees of freedom for estimating β and f , we estimate σ2 by
σ2 =eT W (τ )e
n− q1 − df(f ), (9.36)
where q1 is the dimension of β, and df(f) is a properly defined degreeof freedom for estimating f . As in Section 8.3.3, we set df(f ) = tr(H∗),where H∗ is the hat matrix computed at convergence.
Assume priors (8.12) for fk, k = 1, . . . , r. Usually the posterior distri-bution does not have a closed form. Expanding η at b, we consider thefollowing approximate Bayes model
y ≈ η(Sβ + Zb,f) − Gb+ Gb+ ǫ, (9.37)
where priors for f are given in (8.12), and priors for β are N(0, κIq1).
Ignoring the dependence of G on β and f and combining random effectswith random errors, model (9.37) reduces to an SNR model. Then meth-ods discussed in Section 8.3.3 can be used to draw an inference aboutβ, τ , and f . In particular, approximate Bayesian confidence intervalscan be constructed for f . The bootstrap approach may also be used todraw inference about β, τ , and f .
9.3.3 Implementation and the snm Function
Since f may interact with β and b in a complicated way, it is usuallyimpossible to solve (9.33) and (9.35) directly. The following iterativeprocedure will be used.
Algorithm for SNM models
1. Initialize: Set initial values for β, f , b, and τ .
Semiparametric Mixed-Effects Models 287
2. Cycle: Alternate between (a), (b), and (c) until convergence:
(a) Conditional on current estimates of β, b, and τ , update f bysolving (9.33).
(b) Conditional on current estimates of f and τ , update β andb by solving (9.33).
(c) Conditional on current estimates of f , β, and b, update τ bysolving (9.35).
Note that step (b) corresponds to the pseudo-data step, and step (c)corresponds to part of the LME step in the Lindstrom–Bates algorithm(Lindstrom and Bates 1990). Consequently, steps (b) and (c) can beimplemented by the nlme function in the nlme library. We now discussthe implementation of step (a). Note that β, b, and τ are fixed as thecurrent estimates, say, β−, b− and τ−. To update f at step (a), weneed to fit the first-stage model
y = η(φ−,f) + ǫ, ǫ ∼ N(0, σ2Λ−), (9.38)
where φ− = Sβ−+Zb−, and Λ− is the covariance matrix with τ fixed atτ−. First consider the special case when Nij in (9.27) are linear in f andinvolve evaluational functionals. Then model (9.27) can be rewritten as
yij = α(φ−;xij) +
r∑
k=1
δk(φ−;xij)fk(γk(φ−;xij)) + ǫij , (9.39)
which has the same form as the general SEMOR (self-modeling nonlinearregression) model (8.32). Since φ− and Λ− are fixed, the proceduredescribed in Section 8.2.2 can be used to update f . In particular, thesmoothing parameters may be estimated by the UBR, GCV or GMLmethod at this step. When η is nonlinear in f , the EGN (extendedGauss–Newton) procedure in Section 8.3.3 can be used to update f .
The snm function in the assist library implements the algorithmwhen Nij in (9.27) are linear in f and involve evaluational functionals.A typical call to the snm function is
snm(formula, func, fixed, random, start)
where the arguments formula and func serve the same purposes as thosein the nnr function and are specified in the same manner. Followingsyntax in nlme, the fixed and random arguments specify the fixed, andrandom effects models in the second-stage model. The option start
specifies initial values for all parameters in the fixed effects. An objectof snm class is returned. The generic function summary can be applied
288 Smoothing Splines: Methods and Applications
to extract further information. Predictions at the population level canbe computed using the predict function. At convergence, approximateBayesian confidence intervals can be constructed as in Section 8.3.3.Posterior means and standard deviations for f can be computed usingthe intervals function. Examples can be found in Section 9.4.
9.4 Examples
9.4.1 Ozone in Arosa — Revisit
We have fitted SS ANOVA models in Section 4.9.2 and trigonometricspline models with heterogeneous and correlated errors in Section 5.4.2to the Arosa data. An alternative approach is to consider observationsas a long time series. Define a time variable as t = (month− 0.5)/12 +year − 1 with domain [0, b], where b = 45.46 is the maximum valueof t. Denote {(yi, ti), i = 1, . . . , 518} as observations on the responsevariable thick and time t. Consider the following semiparametric linearregression model
yi = β1 + β2 sin(2πti) + β3 cos(2πti) + β4ti + f(ti) + ǫi, (9.40)
where sin(2πt) and cos(2πt) model the seasonal trend, β4t+f(t) modelsthe long-term trend, and ǫi are random errors. Note that the space forthe parametric component, H0 = {1, t, cos 2πt, sin 2πt}, is the same asthe null space of the linear-periodic spline defined in Section 2.11.4 withτ = 2π. Therefore, we model the nonparametric function f using themodel space W 4
2 [0, b]⊖H0 (see Section 2.11.4 for more details).Observations close in time are likely to be correlated. We consider
the exponential correlation structure with a nugget effect. Specifically,write ǫi = ǫ(ti) and assume that ǫ(t) is a zero-mean stochastic processwith correlation structure
Cov(ǫ(s), ǫ(t)) =
{σ2(1 − c0) exp(−|s− t|/ρ), s 6= t,σ2, s = t,
(9.41)
where ρ is the range parameter and c0 is the nugget effect.In the following we fit model (9.40) with correlation structure (9.41)
and compute the overall fit as well as estimates of the seasonal andlong-term trends:
> Arosa$t <- (Arosa$month-0.5)/12+Arosa$year-1
> arosa.ls.fit3 <- ssr(thick~t+sin(2*pi*t)+cos(2*pi*t),
Semiparametric Mixed-Effects Models 289
rk=lspline(2*pi*t,type=‘‘linSinCos’’), spar=‘‘m’’,
corr=corExp(form=~t,nugget=T), data=Arosa)
> summary(arosa.ls.fit3)
...
GML estimate(s) of smoothing parameter(s) : 19461.83
Equivalent Degrees of Freedom (DF): 4.529238
Estimate of sigma: 17.32503
Correlation structure of class corExp representing
range nugget
0.3529361 0.6366844
> tm <- matrix(c(1,1,1,1,1,1,0,1,1,0,0,1,0,0,1), ncol=5,
byrow=T)
> grid3 <- data.frame(t=seq(0,max(Arosa$t)+0.001,len=500))
> arosa.ls.fit3.p <- predict(arosa.ls.fit3, newdata=grid3,
terms=tm)
Figure 9.1 shows the overall fit as well as estimates of the seasonal andlong-term trends. We can see that the long-term trend is not significantlydifferent from zero.
Sometimes it is desirable to use a stochastic process to model theautocorrelation and regard this process as part of the signal. Then weneed to separate this process from other errors and predict it at desiredpoints. Specifically, consider the following SLM model
yi = β1 + β2 sin(2πti) + β3 cos(2πti) + β4ti + f(ti) + u(ti) + ǫi, (9.42)
where u(t) is a stochastic process independent of ǫi with mean zero andCov(u(s), u(t)) = σ2
1 exp(−|s− t|/ρ) with range parameter ρ.Let t = (t1, . . . , t518)
T be the vector of design points for the variable t.Let u = (u(t1), . . . , u(t518))
T be the vector of the u process evaluated atdesign points. Then u are random effects, and u ∼ N(0, σ2
1D), where Dis a covariance matrix with the (i, j)th element equals exp(−|ti − tj |/ρ).The SLM model (9.42) cannot be fitted directly using the slm functionsince D depends on the unknown range parameter ρ nonlinearly. We fitmodel (9.42) in two steps. We first regard u as part of random errors andestimate the range parameter. This is accomplished in arosa.ls.fit3.Then we calculate the estimate ofD without the nugget effect and regardit as the true covariance matrix. We calculate the Cholesky decompo-sition of D as D = ZZT and transform the random effects u = Zb,where b ∼ N(0, σ2
1I). Then we fit the transformed SLM model and com-pute the overall fit, estimate of the seasonal trend, and estimate of thelong-term trend as follows:
> tau <- coef(arosa.ls.fit3$cor.est, F)
290 Smoothing Splines: Methods and Applications
o
o
o
o
o
o
ooooo
o
o
o
o
oo
ooo
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
o
o
o
o
oooo
o
oo
o
oo
o
o
o
ooo
oo
o
o
oo
o
oo
oooo
o
o
o
oo
o
o
o
o
o
o
o
o
oo
o
o
o
o
o
oo
o
o
oo
o
o
o
o
oo
o
oo
o
o
o
o
oo
o
oo
o
oo
o
o
o
o
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
ooo
o
o
o
oo
oo
o
o
o
o
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
oo
o
o
o
o
o
o
oo
o
o
ooo
o
o
oo
oo
o
oo
o
oo
o
o
o
o
oo
o
o
o
o
oo
o
o
oo
oo
o
o
o
o
o
o
o
o
o
o
o
ooo
o
o
oo
o
o
o
o
o
o
o
o
o
o
o
oo
o
oo
oo
o
oo
ooo
oo
o
o
o
o
o
o
o
oo
o
o
oo
o
oo
o
o
oo
oo
oo
oo
oo
o
o
o
o
o
o
o
o
o
o
o
o
o
ooo
o
o
o
o
o
o
oo
oo
o
o
o
o
o
o
oo
o
o
ooo
o
oo
o
oo
o
o
o
o
o
o
o
o
o
oo
oo
o
o
ooo
o
ooo
oo
o
o
o
o
oo
o
o
ooo
o
o
o
o
oo
o
o
o
oo
o
o
oo
o
oo
oo
o
o
o
o
o
o
o
oo
o
o
o
o
o
ooo
o
o
o
o
oo
oo
o
o
o
o
ooo
o
o
o
o
o
oooo
oo
o
o
oo
o
o
o
o
oo
o
o
o
oo
ooo
o
o
o
oo
o
oo
o
o
ooo
oo
oo
o
oo
oo
o
o
o
o
o
oo
o
o
o
o
o
ooo
o
o
oo
o
oo
o
oo
o
ooo
o
oo
o
o
o
0 10 20 30 40
300
350
400
time
thic
kness
1926 1935 1944 1953 1962 1971
(a) observations and overall fit
0 10 20 30 40
300
320
340
360
380
time
thic
kness
1926 1935 1944 1953 1962 1971
(b) seasonal trend
0 10 20 30 40
−5
05
10
15
time
thic
kness
1926 1935 1944 1953 1962 1971
(c) long−term trend
FIGURE 9.1 Arosa data, plots of (a) observations and the overallfit, (b) estimate of the seasonal trend, and (c) estimate of the long-termtrend with 95% Bayesian confidence intervals (shaded region). Estimatesare based on the fit to model (9.40) with correlation structure (9.41).
Semiparametric Mixed-Effects Models 291
> D <- corMatrix(Initialize(corExp(tau[1],form=~t),
data=Arosa))
> Z <- chol.new(D)
> arosa.ls.fit4 <- slm(thick~t+sin(2*pi*t)+cos(2*pi*t),
rk=lspline(2*pi*t,type=‘‘linSinCos’’),
random=list(pdIdent(~Z-1)), data=Arosa)
> arosa.ls.fit4.p <- intervals(arosa.ls.fit4,
newdata=grid3, terms=tm)
where chol.new is a function in the assist library for Cholesky de-composition. Suppose we want to predict u on grid points and denotev as the vector of the u process evaluated at these grid points. LetR = Cov(v,u). Note that Z is a square invertible matrix since D is
invertible. Then v = RD−1u = RZ−T b. We compute the prediction asfollows:
> newdata <- data.frame(t=c(Arosa$t,grid3$t))
> RD <- corMatrix(Initialize(corExp(tau[1],form=~t),
data=newdata))
> R <- RD[(length(Arosa$t)+1):length(newdata$t),
1:length(Arosa$t)]
> b <- as.vector(arosa.ls.fit4$lme.obj$coef$random[[2]])
> u.new <- R%*%t(solve(Z))%*%b
Figure 9.2 shows the overall fit, estimate of the seasonal trend, es-timate of the long-term trend, and prediction of u(t) (local stochastictrend). The bump during 1940 in Figure 4.10 shows up in the predictionof local stochastic trend.
9.4.2 Lake Acidity — Revisit
We have shown how to investigate geological location effect using anSS ANOVA model in Section 5.4.4. We now describe an alternativeapproach using random effects. We use the same notations defined inSection 5.4.4. Consider the following SLM model:
pH(xi1,xi2) = f(xi1) + β1xi21 + β2xi22 + u(xi2) + ǫ(xi1,xi2),(9.43)
where f ∈ W 22 (R), u(x2) is a spatial process, and ǫ(x1,x2) are random
errors independent of the spatial process. Model (9.43) separates thecontribution of the spatial correlation from random errors and regardsthe spatial process as part of the signal. Assume that u(x2) is a zero-mean process independent of ǫ(x1,x2) with an exponential correlationstructure Cov(u(x2), u(z2)) = σ2
1 exp{−d(x2, z2)/ρ} with range param-eter ρ, where d(x2, z2) represents the Euclidean distance.
292 Smoothing Splines: Methods and Applications
oo
oo
o
o
ooooo
oo
oo
oo
ooo
ooo
o
ooo
oo
o
o
o
o
o
o
o
oo
o
o
oo
o
o
o
o
o
ooo
ooooo
oo
o
oo
o
o
oooo
ooo
o
ooo
oo
oooo
o
o
o
oo
oo
o
oo
oo
o
oooo
o
oooooo
oo
oo
o
o
oo
o
oo
o
o
o
o
oo
ooo
o
ooo
o
o
o
o
o
oo
o
o
o
o
o
oo
oo
o
o
o
oo
oo
o
o
o
ooo
ooo
oooo
o
o
o
o
o
o
o
o
oo
o
o
oo
o
o
oo
o
oooo
o
o
o
o
ooo
oo
ooo
o
o
oooo
o
oo
ooo
o
o
o
o
ooo
oo
o
oo
o
ooo
ooooo
oo
o
o
o
o
o
o
ooo
oo
oo
oo
oo
oo
o
oo
oo
oo
ooo
oo
o
oo
ooo
oo
o
o
oo
o
oo
oo
o
o
oo
o
oo
o
o
oo
oo
oo
oooo
o
oo
ooo
o
o
o
oo
o
o
ooo
o
o
o
ooooo
ooo
o
o
o
oo
oo
o
o
ooo
o
ooo
oo
o
oo
o
o
o
o
oo
oo
oo
o
o
ooo
o
ooo
oo
o
oo
o
oo
o
o
ooo
o
oo
ooo
o
o
o
oo
oo
oo
o
oooo
oo
o
o
ooo
oo
o
oo
o
o
ooo
o
o
o
ooo
oo
oo
o
o
ooo
ooo
o
o
oooo
oo
oo
oo
o
o
o
o
oo
o
o
o
oo
ooo
o
oo
oo
o
oo
o
o
ooo
oooo
o
oo
oo
o
o
o
oo
oo
oo
o
oo
ooo
o
o
oo
ooo
o
oo
o
ooo
o
oo
ooo
0 10 20 30 40
300
400
time
thic
kness
1926 1935 1944 1953 1962 1971
(a) observations and overall fit
0 10 20 30 40
280
320
360
time
thic
kness
1926 1935 1944 1953 1962 1971
(b) seasonal trend
0 10 20 30 40
−20
020
40
time
thic
kness
1926 1935 1944 1953 1962 1971
(c) long−term trend
0 10 20 30 40
−20
020
time
thic
kness
1926 1935 1944 1953 1962 1971
(d) local stochastic trend
FIGURE 9.2 Arosa data, plots of (a) observations and the overallfit, (b) estimate of the seasonal trend, (c) estimate of the long-termtrend with 95% Bayesian confidence intervals (shaded region), and (d)prediction of local stochastic trend. Estimates are based on the fit tomodel (9.42).
Semiparametric Mixed-Effects Models 293
Let u be the vector of the u process evaluated at design points. Thenu are random effects and u ∼ N(0, σ2D), where the covariance matrixD = {exp(−d(xi2,xj2)/ρ)}n
i,j=1 depends on the unknown parameter ρnonlinearly. Therefore, again, we fit model (9.43) in two steps. We firstregard u as part of random errors, estimate the range parameter, andcalculate the estimated covariance matrix without the nugget effect:
> temp <- ssr(ph~x1+x21+x22, rk=tp(x1), data=acid,
corr=corExp(form=~x21+x22, nugget=T), spar=‘‘m’’)
> tau <- coef(temp$cor.est, F)
> D <- corMatrix(Initialize(corExp(tau[1],form=~x21+x22),
data=acid))
Consider the estimated D as the true covariance matrix. Then wecalculate the Cholesky decomposition of D as D = ZZT and transformthe random effects u = Zb, where b ∼ N(0, σ2
1I). Now we are ready tofit the transformed SLM model:
> Z <- chol.new(D)
> acid.slm.fit <- slm(ph~x1+x21+x22, rk=tp(x1), data=acid,
random=list(pdIdent(~Z-1)))
We then calculate the estimated effect of calcium:
> grid1 <- data.frame(
x1=seq(min(acid$x1),max(acid$x1),len=100),
x21=min(acid$x21), x22=min(acid$x22))
> acid.slm.x1.p <- intervals(acid.slm.fit, newdata=grid1,
terms=c(0,1,0,0,1))
Let v be the vector of the u process evaluated at grid points at whichwe wish to predict the location effect. Let R = Cov(v,u). Then v =
RD−1u = RZ−T b.
> grid2 <- expand.grid(
x21=seq(min(acid$x21)-.001,max(acid$x21)+.001,len=20),
x22=seq(min(acid$x22)-.001,max(acid$x22)+.001,len=20))
> newdata <- data.frame(z1=c(acid$x21,grid2$x21),
z2=c(acid$x22,grid2$x22))
> RD <- corMatrix(Initialize(corExp(tau[1], form=~z1+z2),
data=newdata))
> R <- RD[(length(acid$x21)+1):length(newdata$z1),
1:length(acid$x21)]
> b <- as.vector(acid.slm.fit$lme.obj$coef$random[[2]])
> u.new <- R%*%t(solve(Z))%*%b
> acid.slm.x2.p <- u.new+
294 Smoothing Splines: Methods and Applications
acid.slm.fit$lme.obj$coef$fixed[3]*grid2$x21+
acid.slm.fit$lme.obj$coef$fixed[4]*grid2$x22
Figure 9.3 plots the estimated calcium (x1) effect and prediction ofthe location (x2) effect.
calcium(log10 mg/L)
pH
6.0
6.5
7.0
7.5
8.0
−0.5 0.0 0.5 1.0 1.5
−0.020.00
0.02−0.02
0.00
0.02
−0.2
0.0
0.2
latitude
long
itude
pH
FIGURE 9.3 Lake acidity data, the left panel includes observationsand the estimate of the calcium effect f (the constant plus main effectof x1), and the right panel includes prediction of the location effect
β1x21 + β22x22 + u(x2).
9.4.3 Coronary Sinus Potassium in Dogs
The dog data set contains measurements of coronary sinus potassiumconcentration from dogs in four groups: control, extrinsic cardiac den-ervation three weeks prior to coronary occlusion, extrinsic cardiac den-ervation immediately prior to coronary occlusion, and bilateral thoracicsympathectomy and stellectomy three weeks prior to coronary occlusion.We are interested in (i) estimating the group (treatment) effects, (ii) es-timating the group mean concentration as a function of time, and (iii)predicting response over time for each dog. There are two categorical co-variates, group and dog, and a continuous covariate, time. We code thegroup factor as 1 to 4, and the observed dog factor as 1 to 36. Coronarysinus potassium concentrations for all dogs are shown in Figure 9.4.
Let t be the time variable transformed into [0, 1]. We treat group andtime as fixed factors. From the design, the dog factor is nested withinthe group factor. We treat dog as a random factor. For group k, denote
Semiparametric Mixed-Effects Models 295
3.0
3.5
4.0
4.5
5.0
5.5
6.0
time (min)
pota
ssiu
m
1 3 5 7 9 11 13
0 00
0 0
0
0
11
1 1
1
1
1
22
2 2
2
2
2
3
3
3
3
3
3
3
4
4
4
4
4 4
4
5
5
5
55
55
6
6 6
6
6
6
67
7
7
7
7
7
78 8
8
8
8
8
8
Group 1
3.0
3.5
4.0
4.5
5.0
5.5
6.0
time (min)
pota
ssiu
m
1 3 5 7 9 11 13
0 00
0 0
0
0
1
1
1 11
11
22
2
2
2
2 23
3 3 33
3 3
44
4
4
4 44
5
5
55
5 5 5
6 6 66
6
6 6
7
7
77
7
7 78
88 8
8
8
8
99
9
9
9
9 9
Group 23.0
3.5
4.0
4.5
5.0
5.5
6.0
time (min)
pota
ssiu
m
1 3 5 7 9 11 13
00
0 0
0
0
0
11 1
1 11
1
2
22
22
2 2
3
33
3
3
3
34 4
4
4
4
4 4
5
5
5
5
5
5
5
6
6
6
6
6
6
6
77
7
7
7
7
7
Group 33.0
3.5
4.0
4.5
5.0
5.5
6.0
time (min)
pota
ssiu
m
1 3 5 7 9 11 13
0
0 0
0
0 0
01
1
11 1
1
1
2
2
2
2
2
22
3 33
33
3
3
44
44
4
44
5
5
55
5
5
5
6
6
6
6
66
6
7 7
7
77
77
88 8
8
88
8
Group 4
FIGURE 9.4 Dog data, coronary sinus potassium concentrations overtime for each dog. Solid thick lines link within group average concentra-tions at each time point.
Bk as the population from which the dogs in group k were drawn, andPk as the sampling distribution. Assume the following model
ykwj = f(k, w, tj) + ǫkwj , k = 1, . . . , 4; w ∈ Bk; tj ∈ [0, 1], (9.44)
where ykwj is the observed potassium concentration at time tj from dog
w in the population Bk, f(k, w, tj) is the true concentration at time tj ofdog w in the population Bk, and ǫkwj are random errors. The functionf(k, w, tj) is defined on {{1}⊗B1, {2}⊗B2, {3}⊗B3, {4}⊗B4}⊗ [0, 1].Note that f(k, w, j) is a random variable since w is a random samplefrom Bk. What we observe are realizations of this true mean functionplus random errors. We use label i to denote dogs we actually observe.
Suppose we want to model the time effect using the cubic spline modelspaceW 2
2 [0, 1] under the construction in Section 2.6, and the group effect
296 Smoothing Splines: Methods and Applications
using the classical one-way ANOVA model space R4 under the construc-
tion in Section 4.3.1. Define the following four averaging operators:
A2f =
∫
Bk
f(k, w, t)dPk(w),
A1f =1
4
4∑
k=1
A2f(k, t),
A3f =
∫ 1
0
f(k, w, t)dt,
A4f =
(∫ 1
0
∂f(k, w, t)
∂tdt
)
(t− 0.5).
Then we have the following SS ANOVA decomposition:
f = {A1 + (A2 −A1) + (I −A2)}{A3 + A4 + (I −A3 −A4)}f= A1A3f + A1A4f + A1(I −A3 −A4)f
+ (A2 −A1)A3f + (A2 −A1)A4f + (A2 −A1)(I −A3 −A4)f
+ (I −A2)A3f + (I −A2)A4f + (I −A2)(I −A3 −A4)f
= µ+ β(t− 0.5) + s1(t) + ξk + δk(t− 0.5) + s2(k, t)
+ αw(k) + γw(k)(t− 0.5) + s3(k, w, t), (9.45)
where µ is a constant, β(t−0.5) is the linear main effect of time, s1(t) isthe smooth main effect of time, ξk is the main effect of group, δk(t−0.5)is the smooth-linear interaction between time and group, s2(k, t) is thesmooth-smooth interaction between time and group, αw(k) is the maineffect of dog, γw(k)(t−0.5) is the smooth-linear interaction between time
and dog, and s3(k, w, t) is the smooth-smooth interaction between time
and dog. The overall main effect of time equals β(t − 0.5) + s1(t), theoverall interaction between time and group equals δk(t− 0.5)+ s2(k, t),and the overall interaction between time and dog equals γw(k)(t−0.5)+s3(k, w, t). The first six terms are fixed effects. The last three terms arerandom effects since they depend on the random variable w. Dependingon time only, the first three terms represent the mean curve for all dogs.The middle three terms measure the departure of the mean curve for aparticular group from the mean curve for all dogs. The last three termsmeasure the departure of a particular dog from the mean curve of apopulation from which the dog was chosen.
Based on the SS ANOVA decomposition (9.45), we will fit the follow-ing three models:
• Model 1 includes the first seven terms in (9.45). It has a differentpopulation mean curve for each group plus a random intercept for
Semiparametric Mixed-Effects Models 297
each dog. We assume that αiiid∼ N(0, σ2
1), ǫkijiid∼ N(0, σ2), and the
random effects and random errors are mutually independent.
• Model 2 includes the first eight terms in (9.45). It has a differentpopulation mean curve for each group plus a random intercept
and a random slope for each dog. We assume that (αi, γi)iid∼
N(0, σ2D1), whereD1 is an unstructured covariance matrix, ǫkijiid∼
N(0, σ2), and the random effects and random errors are mutuallyindependent.
• Model 3 includes all nine terms in (9.45). It has a different popula-tion mean curve for each group plus a random intercept, a randomslope, and a smooth random effect for each dog. We assume that
(αi, γi)iid∼ N(0, σ2D1), s3(k, i, t) are stochastic processes that are
independent between dogs with mean zero and covariance func-tion σ2
2R1(s, t), where R1 is the cubic spline RK given in Table
2.2, ǫkijiid∼ N(0, σ2), and the random effects and random errors
are mutually independent.
Model 1 and Model 2 can be fitted as follows:
> data(dog)
> dog.fit1 <- slm(y~time,
rk=list(cubic(time), shrink1(group),
rk.prod(kron(time-.5),shrink1(group)),
rk.prod(cubic(time),shrink1(group))),
random=list(dog=~1), data=dog)
> dog.fit1
Semi-parametric linear mixed-effects model fit by REML
Model: y ~ time
Data: dog
Log-restricted-likelihood: -180.4784
Fixed: y ~ time
(Intercept) time
3.8716210 0.4339335
Random effects:
Formula: ~1 | dog
(Intercept) Residual
StdDev: 0.4980355 0.3924478
GML estimate(s) of smoothing parameter(s) : 0.0002338082
0.0034079441 0.0038490075 0.0002048518
298 Smoothing Splines: Methods and Applications
Equivalent Degrees of Freedom (DF): 13.00259
Estimate of sigma: 0.3924478
Number of Observations: 252
> dog.fit2 <- update(dog.fit1, random=list(dog=~time))
> dog.fit2
Semi-parametric linear mixed-effects model fit by REML
Model: y ~ time
Data: dog
Log-restricted-likelihood: -166.4478
Fixed: y ~ time
(Intercept) time
3.8767107 0.4196788
Random effects:
Formula: ~time | dog
Structure: General positive-definite,
Log-Cholesky parametrization
StdDev Corr
(Intercept) 0.4188186 (Intr)
time 0.5592910 0.025
Residual 0.3403256
GML estimate(s) of smoothing parameter(s) : 1.674916e-04
3.286885e-03 5.781563e-03 8.944897e-05
Equivalent Degrees of Freedom (DF): 13.83309
Estimate of sigma: 0.3403256
Number of Observations: 252
To fit Model 3, we need to find a way to specify the smooth (non-parametric) random effect s3. Let t = (t1, . . . , t7)
T be the time pointsat which measurements were taken for each dog (note that time pointsare the same for all dogs), ui = (s3(k, i, t1), . . . , s3(k, i, t7))
T , and u =
(uT1 , . . . ,u
T36)
T . Then uiiid∼ N(0, σ2
2D2), where D2 is the RK of a cubicspline evaluated at the design points t. Let D2 = HHT be the Choleskydecomposition of D2, D = diag(D2, . . . , D2), and Z = diag(H, . . . ,H).Then ZZT = D. We can write u = Zb2, where b2 ∼ N(0, σ2
2In) andn = 252. Then we can specify the random effects u using the matrix Z.
> D2 <- cubic(dog$time[1:7])
> H <- chol.new(D2)
Semiparametric Mixed-Effects Models 299
> Z <- kronecker(diag(36), H)
> dog$all <- rep(1,36*7)
> dog.fit3 <- update(dog.fit2,
random=list(all=pdIdent(~Z-1),dog=~time))
> summary(dog.fit3)
Semi-parametric Linear Mixed Effects Model fit
Model: y ~ time
Data: dog
Linear mixed-effects model fit by REML
Data: dog
AIC BIC logLik
322.1771 360.9131 -150.0885
Random effects:
Formula: ~Z - 1 | all
Structure: Multiple of an Identity
Z1 Z2 Z3 Z4 Z5 Z6
StdDev: 3.90843 3.90843 3.90843 3.90843 3.90843 3.90843
...
Formula: ~time | dog %in% all
Structure: General positive-definite, Log-Cholesky
parametrization
StdDev Corr
(Intercept) 0.4671544 (Intr)
time 0.5716972 -0.083
Residual 0.2383448
Fixed effects: y ~ time
Value Std.Error DF t-value p-value
(Intercept) 3.885270 0.08408051 215 46.20892 0e+00
time 0.404652 0.10952837 215 3.69449 3e-04
Correlation:
(Intr)
time -0.224
...
Smoothing spline:
GML estimate(s) of smoothing parameter(s): 8.775858e-05
1.561206e-03 3.351111e-03 2.870198e-05
Equivalent Degrees of Freedom (DF): 15.37740
The dimension of b2 associated with the smooth random effects s3
300 Smoothing Splines: Methods and Applications
equals the total number of observations in the above construction. There-fore, computation and/or memory required for larger sample size can beprohibitive. One approach to stabilize and speed up the computation isto use the low rank approximation (Wood 2003). For generality, supposetime points for different dogs may be different. Let tij for j = 1, . . . , ni bethe time points for dog i, ui = (s3(k, i, ti1), . . . , s3(k, i, tini
))T and u =(uT
1 , . . . ,uTm)T where m is the number of dogs. Then ui ∼ N(0, σ2
2Πi),where Πi is the RK of a cubic spline evaluated at the design pointsti = (ti1, . . . , tini
). Let Πi = UiΓiUTi be the eigendecomposition where
Ui is an ni × ni orthogonal matrix, Γi = diag(γi1, . . . , γini) and γi1 ≤
γi2 ≤ . . . ≤ γiniare eigenvalues. Usually some of the eigenvalues are
much smaller than others. Discard ni − ki smallest eigenvalues and letHi1 = UiΓi1 where Γi1 is an ni ×ki matrix with diagonal elements equal√γi1, . . . ,
√γiki
and all other elements equal zero. Then Πi ≈ Hi1HTi1
and D = diag(Π1, . . . ,Πm) ≈ Z1ZT1 where Z1 = diag(H11, . . . , Hm1).
We can approximate Model 3 using u1 = Z1b2 where b2 ∼ N(0, σ22IK)
and K =∑m
i=1 ki. The dimension K can be much smaller than n. Lowrank approximations for other situations can be constructed similarly.The above approximation procedure is implemented to the dog data asfollows:
> chol.new1 <- function(Q, cutoff) {
tmp <- eigen(Q)
num <- sum(tmp$values<cutoff)
k <- ncol(Q)-num
t(t(as.matrix(tmp$vector[,1:k]))*sqrt(tmp$values[1:k]))
}
> H1 <- chol.new1(D2, 1e-3)
> Z1 <- kronecker(diag(36), H1)
> dog$all1 <- rep(1,nrow(Z1))
> dog.fit3.1 <- update(dog.fit2,
random=list(all1=pdIdent(~Z1-1), dog=~time))
> summary(dog.fit3.1)
Semi-parametric Linear Mixed Effects Model fit
Model: y ~ time
Data: dog
Linear mixed-effects model fit by REML
Data: dog
AIC BIC logLik
323.4819 362.2180 -150.7409
Random effects:
Semiparametric Mixed-Effects Models 301
Formula: ~Z1 - 1 | all1
Structure: Multiple of an Identity
Z11 Z12 Z13 Z14 Z15 Z16
StdDev: 3.65012 3.65012 3.65012 3.65012 3.65012 3.65012
...
Formula: ~time | dog %in% all1
Structure: General positive-definite, Log-Cholesky
parametrization
StdDev Corr
(Intercept) 0.4635580 (Intr)
time 0.5690967 -0.07
Residual 0.2527408
Fixed effects: y ~ time
Value Std.Error DF t-value p-value
(Intercept) 3.883828 0.08418382 215 46.13509 0e+00
time 0.407229 0.11063259 215 3.68092 3e-04
Correlation:
(Intr)
time -0.228
...
Smoothing spline:
GML estimate(s) of smoothing parameter(s): 9.964788e-05
1.764225e-03 3.716363e-03 3.200021e-05
Equivalent Degrees of Freedom (DF): 15.39633
where eigenvalues smaller than 10−3 are discarded and ki = 2. TheR function chol.new1 computes the truncated Cholesky decomposition.The fitting criteria and parameter estimates are similar to those from thefull model. Another fit based on low rank approximation with the cutoffvalue 10−3 being replaced by 10−4 produces almost identical results asthose from the full model.
As discussed in Section 9.2.2, Models 1, 2, and 3 are connected withthree LME models and these connections are used to fit the SLM models.We can compare these three corresponding LME models as follows:
> anova(dog.fit1$lme.obj, dog.fit2$lme.obj,
dog.fit3$lme.obj)
Model df AIC BIC logLik
dog.fit1$lme.obj 1 8 376.9568 405.1285 -180.4784
dog.fit2$lme.obj 2 10 352.8955 388.1101 -166.4478
dog.fit3$lme.obj 3 11 322.1771 360.9131 -150.0885
302 Smoothing Splines: Methods and Applications
Test L.Ratio p-value
dog.fit1$lme.obj
dog.fit2$lme.obj 1 vs 2 28.06128 <.0001
dog.fit3$lme.obj 2 vs 3 32.71845 <.0001
Even though they do not compare three SLM models directly, these com-parison results seem to indicate that Model 3 is more favorable. Moreresearch on model selection and inference for SLM and SNM models isnecessary.
We can calculate estimates of the population mean curves for fourgroups based on Model 3 as follows:
> dog.grid <- data.frame(time=rep(seq(0,1,len=50),4),
group=as.factor(rep(1:4,rep(50,4))))
> e.dog.fit3 <- intervals(dog.fit3, newdata=dog.grid,
terms=rep(1,6))
Figure 9.5 shows the estimated mean curves and 95% Bayesian confi-dence intervals based on the fit dog.fit3.
time (min)
3.5
4.0
4.5
5.0
2 4 6 8 10 12
1
2 4 6 8 10 12
2
2 4 6 8 10 12
3
2 4 6 8 10 12
4
FIGURE 9.5 Dog data, estimates of the group mean response curves(solid lines) with 95% Bayesian confidence intervals (dotted lines) basedon the fit dog.fit3. Squares are within group average concentrations.
We have shrunk the group mean curves toward the overall mean curve.That is, we have penalized the group main effect ξk and the smooth-linear group-time interaction δk(t− 0.5) in the SS ANOVA decomposi-tion (9.45). From Figure 9.5 we can see that the estimated populationmean curve for group 2 is biased upward, while the estimated population
Semiparametric Mixed-Effects Models 303
mean curves for group 1 is biased downward. This is because responsesfrom group 2 are smaller, while responses from group 1 are larger thanthose from groups 3 and 4. Thus, their estimates are pulled toward theoverall mean. Shrinkage estimates in this case may not be advantageoussince group only has four levels. One may want to leave ξk and δk(t−0.5)terms unpenalized to reduce biases. We can rewrite the fixed effects in(9.45) as
fk(t) , µ+ β(t− 0.5) + s1(t) + ξk + δk(t− 0.5) + s2(k, t)
= {µ+ ξk} + {β(t− 0.5) + δk(t− 0.5)} + {s1(t) + s2(k, t)}= ξk + δk(t− 0.5) + s2(k, t), (9.46)
where fk(t) is the mean curve for group k. Assume that fk ∈ W 22 [0, 1].
Define penalty as∫ 1
0 (f ′′k (t))2dt = ||s2(k, t)||2. Then the constant term
ξk and the linear term δk(t− 0.5) are not penalized. We can refit Model1, Model 2, and Model 3 under this new form of penalty as follows:
> dog.fit4 <- slm(y~group*time,
rk=list(rk.prod(cubic(time),kron(group==1)),
rk.prod(cubic(time),kron(group==2)),
rk.prod(cubic(time),kron(group==3)),
rk.prod(cubic(time),kron(group==4))),
random=list(dog=~1), data=dog)
> dog.fit5 <- update(dog.fit4, random=list(dog=~time))
> dog.fit6 <- update(dog.fit5,
random=list(all=pdIdent(~Z-1),dog=~time))
> e.dog.fit6 <- intervals(dog.fit6, newdata=dog.grid,
terms=rep(1,12))
Figure 9.6 shows the estimated mean curves and 95% Bayesian confi-dence intervals based on the fit dog.fit6. The estimated mean functionsare less biased.
We now show how to calculate predictions for all dogs based on thefit dog.fit6. Predictions based on other models may be derived simi-larly. For a particular dog i in group k, its prediction at time z can be
computed asˆξk +
ˆδk(z − 0.5) + ˆs2(k, z) + αi + γi(z − 0.5) + s3(k, i, z).
Prediction of the fixed effects can be computed using the prediction
function. Predictions of random effects αi and γi can be extractedfrom the fit. Therefore, we only need to compute s3(k, i, z). Sup-pose we want to predict s3 for dog i in group k on a vector of pointszi = (zi1, . . . , zigi
)T . Let vi = (s3(k, i, zi1), . . . , s3(k, i, zigi))T and Ci =
Cov(vi,ui) = {R1(zik, tj)}gi 7k=1 j=1 for i = 1, . . . , 36, where R1(z, t) is
the cubic spline RK given in Table 2.2. Let v = (vT1 , . . . ,v
T36)
T , R =
304 Smoothing Splines: Methods and Applications
time (min)
3.0
3.5
4.0
4.5
5.0
5.5
2 4 6 8 10 12
1
2 4 6 8 10 12
2
2 4 6 8 10 12
3
2 4 6 8 10 12
4
FIGURE 9.6 Dog data, estimates of the group mean response curves(solid lines) with 95% Bayesian confidence intervals (dotted lines) basedon the fit dog.fit6. Squares are within group average responses.
diag(C1, . . . , C36), and u be the prediction of u. We then can computethe prediction for all dogs as v = RD−1u. The smallest eigenvalue of Dis close to zero, and thus D−1 cannot be calculated precisely. We willuse an alternative approach that does not require inverting D. Note thatu = Zb2 and denote the estimate of b2 as b2. If we can find a vector r(need not to be unique) such that
ZTr = b2, (9.47)
then
v = RD−1u = RD−1Zb2 = RD−1ZZTr = Rr.
So the task now is to solve (9.47). Let
Z = (Q1 Q2)
(V0
)
be the QR decomposition of Z. We consider r in the space spanned byQ1: r = Q1α. Then, from (9.47), α = V −T b2. Thus, r = Q1V
−T b2 isa solution to (9.47). This approach also applies to the situation when Dis singular. In the following we calculate predictions for all 36 dogs ona set of grid points. Note that groups 1, 2, 3, and 4 have 9, 10, 8, and 9dogs, respectively.
> dog.grid2 <- data.frame(time=rep(seq(0,1,len=50),36),
dog=rep(1:36,rep(50,36)))
Semiparametric Mixed-Effects Models 305
> R <- kronecker(diag(36),
cubic(dog.grid2$time[1:50],dog$time[1:7]))
> b1 <- dog.fit6$lme.obj$coef$random$dog
> b2 <- as.vector(dog.fit6$lme.obj$coef$random$all)
> Z.qr <- qr(Z)
> r <- qr.Q(Z.qr)%*%solve(t(qr.R(Z.qr)))%*%b2
> tmp1 <- c(rep(e.dog.fit6$fit[dog.grid$group==1],9),
rep(e.dog.fit6$fit[dog.grid$group==2],10),
rep(e.dog.fit6$fit[dog.grid$group==3],8),
rep(e.dog.fit6$fit[dog.grid$group==4],9))
> tmp2 <- as.vector(rep(b1[,1],rep(50,36)))
> tmp3 <- as.vector(kronecker(b1[,2],dog.grid2$time[1:50]))
> u.new <- as.vector(R%*%r)
> p.dog.fit6 <- tmp1+tmp2+tmp3+u.new
Predictions for dogs 1, 2, 26, and 27 are shown in Figure 9.7.
time (min)
3.0
3.5
4.0
4.5
5.0
5.5
2 4 6 8 10 12
1
2 4 6 8 10 12
2
2 4 6 8 10 12
26
2 4 6 8 10 12
27
FIGURE 9.7 Dog data, predictions for dogs 1, 2, 26, and 27. Plusesare observations and solid lines are predictions.
9.4.4 Carbon Dioxide Uptake
The carbon dioxide data set contains five variables: plant identifieseach plant; Type has two levels, Quebec and Mississippi, indicating theorigin of the plant; Treatment indicates two treatments, nonchilling andchilling; conc gives ambient carbon dioxide concentrations (mL/L); anduptake gives carbon dioxide uptake rates (umol/m2 sec). Figure 9.8
306 Smoothing Splines: Methods and Applications
shows the CO2 uptake rates for all plants.
The objective of the experiment was to evaluate the effect of planttype and chilling treatment on CO2 uptake. Pinheiro and Bates (2000)gave detailed analyses of this data set based on NLME models. Theyreached the following NLME model:
uptakeij = eφ1i{1 − e−eφ2i (concj−φ3i)} + ǫij ,
φ1i = β11 + β12Type+ β13Treatment+ β14Treatment:Type+ bi,
φ2i = β21,
φ3i = β31 + β32Type+ β33Treatment+ β34Treatment:Type,
i = 1, . . . , 12; j = 1, . . . , 7, (9.48)
where uptakeij denotes the CO2 uptake rate of plant i at CO2 ambientconcentration concj ; Type equals 0 for plants from Quebec and 1 forplants from Mississippi, Treatment equals 0 for chilled plants and 1 forcontrol plants; eφ1i , eφ2i , and φ3i denote, respectively, the asymptoticuptake rate, the uptake growth rate, and the maximum ambient CO2
concentration at which no uptake is verified for plant i; random effects
biiid∼ N(0, σ2
b ); and random errors ǫijiid∼ N(0, σ2). Random effects and
random errors are mutually independent. Note that we used exponentialtransformations to enforce the positivity constraints.
> data(CO2)
> co2.nlme <-
nlme(uptake~exp(a1)*(1-exp(-exp(a2)*(conc-a3))),
fixed=list(a1+a2~Type*Treatment,a3~1),
random=a1~1, groups=~Plant, data=CO2,
start=c(log(30),0,0,0,log(0.01),0,0,0,50))
> summary(co2.nlme)
Nonlinear mixed-effects model fit by maximum likelihood
Model: uptake ~ exp(a1)*(1-exp(-exp(a2)*(conc - a3)))
Data: CO2
AIC BIC logLik
393.2869 420.0259 -185.6434
...
Fits of model (9.48) are shown in Figure 9.8 as dotted lines. Based onmodel (9.48), one may conclude that the CO2 uptake is higher for plantsfrom Quebec, and that chilling, in general, results in lower uptake, andits effect on Mississippi plants is much larger than on Quebec plants.
We use this data set to demonstrate how to fit an SNM model and howto check if an NLME model is appropriate. As an extension of (9.48),
Semiparametric Mixed-Effects Models 307
ambient carbon dioxide concentration(uL/L)
CO
2 u
pta
ke r
ate
(um
ol/m
2 s
)
10
20
30
40
200 400 600 800 1000
Mc1 Mc2
200 400 600 800 1000
Mc3
Mn1 Mn2
10
20
30
40
Mn3
10
20
30
40
Qc1 Qc2 Qc3
Qn1
200 400 600 800 1000
Qn2
10
20
30
40
Qn3
FIGURE 9.8 Carbon dioxide data, plot of observations, and fittedcurve for each plant. Circles are CO2 uptake rates. Solid lines representSNM model fits from co2.smm. Dotted lines represent NLME model fitsfrom co2.nlme. Strip names represent IDs of plants with “Q” indicatingQuebec, “M” indicating Mississippi, “c” indicating chilled, “n” indicat-ing nonchilled, and “1”, “2”, “3” indicating the replicate numbers.
308 Smoothing Splines: Methods and Applications
we consider the following SNM model
uptakeij = eφ1if(eφ2i(concj − φ3i)) + ǫij ,
φ1i = β12Type+ β13Treatment+ β14Treatment:Type+ bi,
φ2i = β21,
φ3i = β31 + β32Type+ β33Treatment+ β34Treatment:Type,
i = 1, . . . , 12; j = 1, . . . , 7, (9.49)
where f ∈ W 22 [0, b] for some fixed b > 0, and the second-stage model is
similar to that in (9.48). In order to test if the parametric model (9.48) isappropriate, we use the exponential spline introduced in Section 2.11.2with γ = 1. Then H0 = span{1, exp(−x)}, and H1 = W 2
2 [0, b] ⊖ H0
with RK given in (2.58). Note that β11 in (9.48) is excluded from (9.49)to make f free of constraint on the vertical scale. We need the sideconditions that f(0) = 0 and f(x) 6= 0 for x 6= 0 to make β31 identifiablewith f . The first condition reduces H0 to H0 = span{1 − exp(−x)}and is satisfied by all functions in H2. We do not enforce the secondcondition because it is satisfied by all reasonable estimates. Thus themodel space for f is H0 ⊕ H1. It is clear that the NLME model is aspecial case of the SNM model with f ∈ H0. In the following we fit theSNM model (9.49) with initial values chosen from the NLME fit. Theprocedure converged after five iterations.
> M <- model.matrix(~Type*Treatment, data=CO2)[,-1]
> co2.snm <- snm(uptake~exp(a1)*f(exp(a2)*(conc-a3)),
func=f(u)~list(~I(1-exp(-u))-1,lspline(u,type=‘‘exp’’)),
fixed=list(a1~M-1,a3~1,a2~Type*Treatment),
random=list(a1~1), group=~Plant, verbose=T,
start=co2.nlme$coe$fixed[c(2:4,9,5:8)], data=CO2)
> summary(co2.snm)
Semi-parametric Nonlinear Mixed Effects Model fit
Model: uptake ~ exp(a1) * f(exp(a2) * (conc - a3))
Data: CO2
AIC BIC logLik
406.4865 441.625 -188.3760
...
GCV estimate(s) of smoothing parameter(s): 1.864814
Equivalent Degrees of Freedom (DF): 4.867183
Converged after 5 iterations
Fits of model (9.49) are shown in Figure 9.8 as solid lines. Since thedata set is small, different initial values may lead to different estimates.
Semiparametric Mixed-Effects Models 309
However, the overall fits are similar. We also fitted models with AR(1)within-subject correlations and covariate effects on φ3. None of thesemodels improve fits significantly. The estimates are comparable to theNLME fit, and the conclusions remain the same as those based on (9.48).
To check if the parametric NLME model (9.48) is appropriate, wecalculate the posterior means and standard deviations using the functionintervals. Note that the intervals function returns an object of classcalled “bCI” to which the generic function plot can be applied directly.
> co2.grid2 <- data.frame(u=seq(0.3, 11, len=50))
> co2.ci <- intervals(co2.snm, newdata=co2.grid2,
terms=matrix(c(1,1,1,0,0,1), ncol=2, byrow=T))
> plot(co2.ci,
type.name=c(‘‘overall’’,‘‘parametric’’,‘‘smooth’’))
x
est
imate
0
10
20
30
40
0 10 20 30 40 50
overall
0 10 20 30 40 50
parametric
0 10 20 30 40 50
smooth
FIGURE 9.9 Carbon dioxide data, estimate of the overall functionf (left), its projection onto H0 (center), and H1 (right). Solid linesare fitted values. Dash lines are approximate 95% Bayesian confidenceintervals.
Figure 9.9 shows the estimate of f and its projection onto H0 and H1.The zero line is inside the Bayesian confidence intervals for the projec-tion onto H1 (smooth component), which suggests that the parametricNLME model (9.48) is adequate.
310 Smoothing Splines: Methods and Applications
9.4.5 Circadian Rhythm — Revisit
We have fitted an SIM (8.59) for normal subjects in Section 8.4.6 whereparameters βi1, βi2, and βi3 are deterministic. The fixed effects SIM(8.59) has several drawbacks: (1) potential correlations among observa-tions from the same subject are ignored; (2) the number of deterministicparameters is large; and (3) it is difficult to investigate covariate effectson parameters and/or the common shape function. In this section weshow how to fit the hormone data using SNM models. More details canbe found in Wang, Ke and Brown (2003).
We first consider the following mixed-effects SIM for a single group
concij = µ+ b1i + exp(b2i)f(timeij − alogit(b3i)) + ǫij ,
i = 1, . . . ,m, j = 1, . . . , ni, (9.50)
where the fixed effect µ represents 24-hour mean of the population, therandom effects b1i, b2i, and b3i represent deviations in 24-hour mean, am-plitude, and phase of subject i. We assume that f ∈W 2
2 (per)⊖{1} and
bi = (b1i, b2i, b3i)T iid∼ N(0, σ2D), where D is an unstructured positive-
definite matrix. The assumption of zero population mean for amplitudeand phase, and the removal of constant functions from the periodic splinespace, take care of potential confounding between amplitude, phase, andthe nonparametric common shape function f . We fit model (9.50) tocortisol measurements from normal subjects as follows:
> nor <- horm.cort[horm.cort$type==‘‘normal’’,]
> nor.snm.fit <- snm(conc~b1+exp(b2)*f(time-alogit(b3)),
func=f(u)~list(periodic(u)),
data=nor, fixed=list(b1~1), random=list(b1+b2+b3~1),
start=c(mean(nor$conc)), groups=~ID, spar=‘‘m’’)
> summary(nor.snm.fit)
Semi-parametric Nonlinear Mixed Effects Model fit
Model: conc ~ b1 + exp(b2) * f(time - alogit(b3))
Data: nor
AIC BIC logLik
176.1212 224.1264 -70.07205
Random effects:
Formula: list(b1 ~ 1, b2 ~ 1, b3 ~ 1)
Level: ID
Structure: General positive-definite,
Log-Cholesky parametrization
StdDev Corr
Semiparametric Mixed-Effects Models 311
b1 0.2462385 b1 b2
b2 0.1803665 -0.628
b3 0.2486114 0.049 -0.521
Residual 0.3952836
Fixed effects: list(b1 ~ 1)
Value Std.Error DF t-value p-value
b1 1.661412 0.07692439 98 21.59799 0
GML estimate(s) of smoothing parameter(s): 0.0001200191
Equivalent Degrees of Freedom (DF): 9.988554
Converged after 10 iterations
We compute predictions for all subjects evaluated at grid points:
> nor.grid <- data.frame(ID=rep(unique(nor$ID),rep(50,9)),
time=rep(seq(0,1,len=50),9))
> nor.snm.p <- predict(nor.fit, newdata=nor.grid)
The predictions are shown in Figure 9.10. In the following we alsofit model (9.50) to depression and Cushing groups. Observations andpredictions based on model (9.50) for these two groups are shown inFigures 9.11 and 9.12, respectively.
> dep <- horm.cort[horm.cort$type==‘‘depression’’,]
> dep.snm.fit <- snm(conc~b1+exp(b2)*f(time-alogit(b3)),
func=f(u)~list(periodic(u)),
data=dep, fixed=list(b1~1), random=list(b1+b2+b3~1),
start=c(mean(dep$conc)), groups=~ID, spar=‘‘m’’)
> cush <- horm.cort[horm.cort$type==‘‘cushing’’,]
> cush.snm.fit <- snm(conc~b1+exp(b2)*f(time-alogit(b3)),
func=f(u)~list(periodic(u)),
data=cush, fixed=list(b1~1), random=list(b1+b2+b3~1),
start=c(mean(cush$conc)), groups=~ID, spar=‘‘m’’)
We calculate the posterior means and standard deviations of the com-mon shape functions for all three groups:
> ci.grid <- data.frame(time=seq(0,1,len=50))
> nor.ci <- intervals(nor.snm.fit, newdata=ci.grid)
> dep.ci <- intervals(dep.snm.fit, newdata=ci.grid)
> cush.ci <- intervals(cush.snm.fit, newdata=ci.grid)
The estimated common shape functions and 95% Bayesian confidenceintervals for three groups are shown in Figure 9.13.
312 Smoothing Splines: Methods and Applications
time
cort
isol concentr
ation o
n log s
cale
0
1
2
3
0.0 0.4 0.8
8007 8008
0.0 0.4 0.8
8009
8004 8005
0
1
2
3
8006
0
1
2
3
8001
0.0 0.4 0.8
8002 8003
FIGURE 9.10 Hormone data, normal subjects, plots of cortisol con-centrations (circles), and fitted curves based on model (9.50) (solid lines)and model (9.53) (dashed lines). Subjects’ ID are shown in the strip.
Semiparametric Mixed-Effects Models 313
time
cort
isol concentr
ation o
n log s
cale
0
1
2
3
0.0 0.4 0.8
122 123
0.0 0.4 0.8
124
117 118
0
1
2
3
119
0
1
2
3
113 115 116
111
0.0 0.4 0.8
0
1
2
3
112
FIGURE 9.11 Hormone data, depressed subjects, plots of cortisolconcentrations (circles), and fitted curves based on model (9.50) (solidlines) and model (9.53) (dashed lines). Subjects’ ID are shown in thestrip.
314 Smoothing Splines: Methods and Applications
time
cort
isol concentr
ation o
n log s
cale
2.5
3.0
3.5
4.0
0.0 0.4 0.8
3066 3067
0.0 0.4 0.8
3069 3075
3053 3056 3061
2.5
3.0
3.5
4.0
3064
2.5
3.0
3.5
4.0
3044 3045 3048 3049
3039
0.0 0.4 0.8
3040 3042
0.0 0.4 0.8
2.5
3.0
3.5
4.0
3043
FIGURE 9.12 Hormone data, subjects with Cushing’s disease, plotsof cortisol concentrations (circles), and fitted curves based on model(9.50) (solid lines). Subjects’ ID are shown in the strip.
Semiparametric Mixed-Effects Models 315
0.0 0.4 0.8
−2
−1
01
2
time
f
0.0 0.4 0.8
−2
−1
01
2
time
f
0.0 0.4 0.8
−2
−1
01
2
time
f
0.0 0.4 0.8
−2
−1
01
2
time
f
FIGURE 9.13 Hormone data, estimates of the common shape func-tion f (lines), and 95% Bayesian confidence intervals (shaded regions).The lefts three panels are estimates based on model (9.50) for normal,depression, and Cushing groups, respectively. The right panel is theestimate based on model (9.53) for combined data from normal and de-pression groups.
It is obvious that the common function for the Cushing group is almostzero, which suggests that, in general, circadian rhythms are lost forCushing patients. It seems that the shape functions for normal anddepression groups are similar. We now test the hypothesis that the shapefunctions for normal and depression groups are the same by fitting datafrom these two groups jointly. Consider the following model
concijk = µk + b1ik + exp(b2ik)f(k, timeijk − alogit(b3ik)) + ǫijk,
i = 1, . . . ,m, j = 1, . . . , ni, k = 1, 2, (9.51)
where k represents group factor with k = 1 and k = 2 correspondingto the depression and normal groups, respectively; fixed effect µk is thepopulation 24-hour mean of group k; random effects b1ik, b2ik, and b3ik
represent the ith subject’s deviation of 24-hour mean, amplitude, andphase. Note that subjects are nested within groups. We allow differ-ent variances for the random effects in each group. That is, we assume
that bik = (b1ik, b2ik, b3ik)T iid∼ N(0, σ2kD), where D is an unstructured
positive-definite matrix. We assume different common shape functionsfor each group. Thus f is a function of both group (denoted as k) andtime. Since f is periodic in time, we model f using the tensor productspace R
2 ⊗ W 22 (per). Specifically, consider the SS ANOVA decompo-
sition (4.22). The constant and main effect of group are removed foridentifiability with µk. Therefore, we assume the following model for f :
f(k, time) = f1(time) + f12(k, time), (9.52)
316 Smoothing Splines: Methods and Applications
where f1(time) is the main effect of time, and f12(k, time) is the in-teraction between group and time. The hypothesis H0 : f(1, time) =f(2, time) is equivalent to H0 : f12(k, time) = 0 for all values of thetime variable. Model (9.51) is fitted as follows:
> nordep <- horm.cort[horm.cort$type!=‘‘cushing’’,]
> nordep$type <- as.factor(as.vector(nordep$type))
> nordep.fit1 <- snm(conc~b1+exp(b2)*f(type,time-alogit(b3)),
func=f(g,u)~list(list(periodic(u),
rk.prod(shrink1(g),periodic(u)))),
data=nordep, fixed=list(b1~type), random=list(b1+b2+b3~1),
groups=~ID, weights=varIdent(form=~1|type),
spar=‘‘m’’, start=c(1.8,-.2))
> summary(nordep.fit1)
Semi-parametric Nonlinear Mixed Effects Model fit
Model: conc ~ b1 + exp(b2) * f(type, time - alogit(b3))
Data: nordep
AIC BIC logLik
441.4287 542.7464 -191.5463
Random effects:
Formula: list(b1 ~ 1, b2 ~ 1, b3 ~ 1)
Level: ID
Structure: General positive-definite,
Log-Cholesky parametrization
StdDev Corr
b1.(Intercept) 0.3403483 b1.(I) b2
b2 0.2936284 -0.781
b3 0.2962941 0.016 -0.159
Residual 0.4741110
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | type
Parameter estimates:
depression normal
1.000000 0.891695
Fixed effects: list(b1 ~ type)
Value Std.Error DF t-value p-value
b1.(Intercept) 1.8389689 0.08554649 218 21.496719 0.0000
b1.typenormal -0.1179035 0.12360018 218 -0.953911 0.3412
Correlation:
Semiparametric Mixed-Effects Models 317
b1.(I)
b1.typenormal -0.692
GML estimate(s) of smoothing parameter(s): 4.022423e-04
2.256957e+02
Equivalent Degrees of Freedom (DF): 19.16807
Converged after 15 iterations
The smoothing parameter for the interaction term f12(k, time) is large,indicating that the interaction is negligible. We compute posterior meanand standard deviation of the interaction term:
> u <- seq(0,1,len=50)
> nordep.inter <- intervals(nordep.fit1, terms=c(0,1),
newdata=data.frame(g=rep(c(‘‘normal’’,‘‘depression’’),
c(50,50)),u=rep(u,2)))
> range(nordep.inter$fit)
-9.847084e-06 1.349702e-05
> range(nordep.inter$pstd)
0.001422883 0.001423004
The posterior means are on the magnitude of 10−5, while the posteriorstandard deviations are on the magnitude of 10−3. The estimate of f12is essentially zero. Therefore, it is appropriate to assume the same shapefunction for normal and depression groups.
Under the assumption of one shape function for both normal anddepression groups, we now can investigate differences of 24-hour mean,amplitude, and phase between two groups. For this purpose, considerthe following model
concijk = µk + b1ik + exp(b2ik + δk,2d1) ×f(timeijk − alogit(b3ik + δk,2d2)) + ǫijk,
i = 1, . . . ,m, j = 1, . . . , ni, k = 1, 2, (9.53)
where δk,2 is the Kronecker delta, and parameters d1 and d2 accountfor the differences in amplitude and phase, respectively, between normaland depression groups.
> nordep.fit2 <- snm(conc~b1+exp(b2+d1*I(type==‘‘normal’’))
*f(time-alogit(b3+d2*I(type==‘‘normal’’))),
func=f(u)~list(periodic(u)), data=nordep,
fixed=list(b1~type,d1+d2~1), random=list(b1+b2+b3~1),
groups=~ID, weights=varIdent(form=~1|type),
318 Smoothing Splines: Methods and Applications
spar=‘‘m’’, start=c(1.9,-0.3,0,0))
> summary(nordep.fit2)
Semi-parametric Nonlinear Mixed Effects Model fit
Model: conc ~ b1 + exp(b2 + d1 * I(type == ‘‘normal’’)) *
f(time - alogit(b3 + d2 * I(type == ‘‘normal’’)))
Data: nordep
AIC BIC logLik
429.9391 503.4998 -193.7516
Random effects:
Formula: list(b1 ~ 1, b2 ~ 1, b3 ~ 1)
Level: ID
Structure: General positive-definite,
Log-Cholesky parametrization
StdDev Corr
b1.(Intercept) 0.3309993 b1.(I) b2
b2 0.2841053 -0.781
b3 0.2901979 0.030 -0.189
Residual 0.4655115
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | type
Parameter estimates:
depression normal
1.0000000 0.8908655
Fixed effects: list(b1 ~ type, d1 + d2 ~ 1)
Value Std.Error DF t-value p-value
b1.(Intercept) 1.8919482 0.08590594 216 22.023485 0.0000
b1.typenormal -0.2558220 0.14361590 216 -1.781293 0.0763
d1 0.2102017 0.10783207 216 1.949343 0.0525
d2 0.0281460 0.09878700 216 0.284916 0.7760
Correlation:
b1.(I) b1.typ d1
b1.typenormal -0.598
d1 0.000 -0.509
d2 0.000 0.023 -0.159
GML estimate(s) of smoothing parameter(s): 0.0004723142
Equivalent Degrees of Freedom (DF): 9.217902
Converged after 9 iterations
Semiparametric Mixed-Effects Models 319
The differences of 24-hour mean and amplitude are borderline significant,while the difference of phase is not. We refit without the d2 term:
> nordep.fit3 <- snm(conc~b1+
exp(b2+d1*I(type==‘‘normal’’))*f(time-alogit(b3)),
func=f(u)~list(periodic(u)), data=nordep,
fixed=list(b1~type,d1~1), random=list(b1+b2+b3~1),
groups=~ID, weights=varIdent(form=~1|type),
spar=‘‘m’’, start=c(1.9,-0.3,0))
> summary(nordep.fit3)
Semi-parametric Nonlinear Mixed Effects Model fit
Model: conc ~ b1 + exp(b2 + d1 * I(type == ‘‘normal’’)) *
f(time - alogit(b3))
Data: nordep
AIC BIC logLik
425.2350 495.3548 -192.4077
Random effects:
Formula: list(b1 ~ 1, b2 ~ 1, b3 ~ 1)
Level: ID
Structure: General positive-definite,
Log-Cholesky parametrization
StdDev Corr
b1.(Intercept) 0.3302233 b1.(I) b2
b2 0.2835421 -0.780
b3 0.2898165 0.033 -0.192
Residual 0.4647148
Variance function:
Structure: Different standard deviations per stratum
Formula: ~1 | type
Parameter estimates:
depression normal
1.0000000 0.8902948
Fixed effects: list(b1 ~ type, d1 ~ 1)
Value Std.Error DF t-value p-value
b1.(Intercept) 1.8919931 0.08574693 217 22.064849 0.0000
b1.typenormal -0.2567236 0.14327117 217 -1.791872 0.0745
d1 0.2148962 0.10620988 217 2.023316 0.0443
Correlation:
b1.(I) b1.typ
b1.typenormal -0.598
320 Smoothing Splines: Methods and Applications
d1 0.000 -0.512
GML estimate(s) of smoothing parameter(s): 0.0004742399
Equivalent Degrees of Freedom (DF): 9.20981
Converged after 8 iterations
The predictions based on the final fit are shown in Figures 9.10 and9.11. The estimate of the common shape function f is shown in the rightpanel of Figure 9.13. Data from two groups are pooled to estimate thecommon shape function, which leads to narrower confidence intervals.The final model suggests that the depressed subjects have their meancortisol level elevated and have less profound circadian rhythm thannormal subjects.
To take a closer look, we extract estimates of 24-hour mean levels andamplitudes for all subjects in normal and depression groups and performbinary recursive partitioning using these two variables:
> nor.mean <- nordep.fit3$coef$fixed[1]+
nordep.fit3$coef$fixed[2]+
nordep.fit3$coef$random$ID[12:20,1]
> dep.mean <- nordep.fit3$coef$fixed[1]+
nordep.fit3$coef$random$ID[1:11,1]
> nor.amp <- exp(nordep.fit3$coef$fixed[3]+
nordep.fit3$coef$random$ID[12:20,2])
> dep.amp <- exp(nordep.fit3$coef$random$ID[1:11,2])
> u <- c(nor.mean, dep.mean)
> v <- c(nor.amp, dep.amp)
> s <- c(rep(‘‘n’’,9),rep(‘‘d’’,11))
> library(rpart)
> prune(rpart(s~u+v), cp=.1)
n= 20
node), split, n, loss, yval, (yprob)
* denotes terminal node
1) root 20 9 d (0.5500000 0.4500000)
2) u>=1.639263 12 2 d (0.8333333 0.1666667) *
3) u< 1.639263 8 1 n (0.1250000 0.8750000) *
Figure 9.14 shows the estimated 24-hour mean levels plotted againstthe estimated amplitudes. There is a negative relationship between the24-hour mean and amplitude. The estimate of correlation between b1ik
and b2ik equals −0.781. The normal subjects and depressed patients canbe well separated by the 24-hour mean level.
Semiparametric Mixed-Effects Models 321
1.2 1.4 1.6 1.8 2.0 2.2 2.4
0.6
0.8
1.0
1.2
1.4
24−hour mean
am
plit
ude
n
n
n
n
n
n
n
n
n
d
d
d
d
d
dd
d
dd
d
FIGURE 9.14 Hormone data, plot of the estimated 24-hour meanlevels against amplitudes. Normal subjects and depressed patients aremarked as “n” and “d”, respectively. The dotted line represents partitionbased on the tree method.
This page intentionally left blankThis page intentionally left blank
Appendix A
Data Sets
In this appendix we describe data sets used for illustrations in this book.Table A.1 lists all data sets.
TABLE A.1 List of all data sets.
Air quality New York air quality measurementsArosa Monthly ozone measurements from ArosaBeveridge Beveridge wheat price indexBond Treasury and GE bondsCanadian weather Monthly temperature and precipitation
from 35 Canadian stationsCarbon dioxide Carbon dioxide uptake in grass plantsChickenpox Monthly chickenpox cases in New York CityChild growth Height of a child over one school yearDog Coronary sinus potassium in dogsGeyser Old Faithful geyser dataHormone Cortisol concentrationsLake acidity Acidity measurements from lakesMelanoma Melanoma incidence rates in ConnecticutMotorcycle Simulated motorcycle accident dataParamecium caudatum Growth of paramecium caudatum populationRock Measurements on petroleum rock samplesSeizure IEEG segments from a seizure patientStar Magnitude of the Mira variable R HydraeStratford weather Daily maximum temperatures in StratfordSuperconductivity Superconductivity magnetizationTexas weather Texas historical climate dataUltrasound Ultrasound imaging of the tongue shapeUSA climate Average winter temperatures in USAWeight loss Weight loss of an obese patientWESDR Wisconsin Epidemiological Study of Diabetic
RetinopathyWorld climate Global average winter temperature
323
324 Smoothing Splines: Methods and Applications
A.1 Air Quality Data
This data set contains daily air quality measurements in New York City,May to September of 1973. Four variables were measured: mean ozonein parts per billion from 1300 to 1500 hours at Roosevelt Island (denotedas Ozone), average wind speed in miles per hour at 0700 and 1000 hoursat LaGuardia Airport (denoted as Wind), maximum daily temperaturein degrees Fahrenheit at LaGuardia Airport (denoted as Temp), and solarradiation in Langleys in the frequency band 4000-7700 Angstroms from0800 to 1200 hours at Central Park (denoted as Solar.R). The data setis available in R with the name airquality.
A.2 Arosa Ozone Data
This is a data set in Andrews and Herzberg (1985) contains monthlymean ozone thickness (Dobson units) in Arosa, Switzerland, from 1926to 1971. It consists of 518 observations on three variables: thick forozone thickness, month, and year. The data set is available in libraryassist with the name Arosa.
A.3 Beveridge Wheat Price Index Data
The data set contains Beveridge wheat price index averaged over manylocations in western and central Europe from 1500 to 1869. The dataset is available in library tseries with the name bev.
A.4 Bond Data
144 GE (General Electric Company) bonds and 78 Treasury bonds werecollected from Bloomberg. The data set contains four variables: name ofa bond, current price, payment at future times, and type of the bond.The number of payments till due ranges from 1 to 27 with a median 3
Data Sets 325
for the GE bonds, and from 1 to 58 with a median 2.5 for Treasury. Themaximum time to maturity is 14.8 years for GE bonds and 28.77 yearsfor Treasury bonds. The data set is available in library assist with thename bond.
A.5 Canadian Weather Data
The data set contains mean monthly temperature and precipitation at35 Canadian weather stations (Ramsay and Silverman 2005). It consistsof 420 observations on five variables: temp for temperature in Celsius,prec for precipitation in millimeters, station code number, geologicalzone, and month. The data set is available in library fda with the nameCanadianWeather.
A.6 Carbon Dioxide Data
This data set comes from a study of cold tolerance of a C4 grass species,Echinochloa crus-galli. A total of 12 four-week-old plants were usedin the study. There were two types of plants: six from Quebec andsix from Mississippi. Two treatments, nonchilling and chilling, wereassigned to three plants of each type. Nonchilled plants were kept at26oC, and chilled plants were subject to 14 hours of chilling at 7oC.After 10 hours of recovery at 20oC, CO2 uptake rates (in umol/m
2s)
were measured for each plant at seven concentrations of ambient CO2
in increasing, consecutive order. More details can be found in Pinheiroand Bates (2000). The data set is available in library nlme with thename CO2.
A.7 Chickenpox Data
The data set, downloaded at http://robjhyndman.com/tsdldata/epi/chicknyc.dat, contains monthly number of reported cases of chick-enpox in New York City from 1931 to the first 6 months of 1972. It
326 Smoothing Splines: Methods and Applications
consists of 498 observations on three variables: count, month, and year.The data set is available in library assist with the name chickenpox.
A.8 Child Growth Data
Height measurements of a child were recorded over one school year (Ram-say 1998). The data set contains 83 observations on two variables:height in centimeters and day in a year. The data set is availablein library fda with the name onechild.
A.9 Dog Data
A total of 36 dogs were assigned to four groups: control, extrinsic car-diac denervation 3 weeks prior to coronary occlusion, extrinsic cardiacdenervation immediately prior to coronary occlusion, and bilateral tho-racic sympathectomy and stellectomy 3 weeks prior to coronary occlu-sion. Coronary sinus potassium concentrations (milliliter equivalents perliter) were measured on each dog every 2 minutes from 1 to 13 minutesafter occlusion. This data was originally presented by Grizzle and Allen(1969). It is available in library assist with the name dog.
A.10 Geyser Data
This data set contains 272 measurements from the Old Faithful geyserin Yellowstone National Park. Two variables were recorded: duration
as the eruption time in minutes, and waiting as the waiting time inminutes to the next eruption. The data set is available in R with thename faithful.
Data Sets 327
A.11 Hormone Data
In an experiment to study immunological responses in humans, bloodsamples were collected every two hours for 24 hours from 9 healthy nor-mal volunteers, 11 patients with major depression and 16 patients withCushing’s syndrome. These blood samples were analyzed for parame-ters that measure immune functions and hormones of the Hypothalamic-Pituitary-Adrenal axis (Kronfol, Nair, Zhang, Hill and Brown 1997). Wewill concentrate on hormone cortisol. The data set contains four vari-ables: ID for subject index, time for time points when blood sampleswere taken, type as a group indicator for subjects and conc for cortisolconcentration on log10 scale. The variable time is scaled into the in-terval [0, 1]. The data set is available in library assist with the namehorm.cort.
A.12 Lake Acidity Data
This data set was derived by Douglas and Delampady (1990) from theEastern Lakes Survey of 1984. The study involved measurements of1789 lakes in three Eastern US regions: Northeast, Upper Midwest, andSoutheast. We use a subset of 112 lakes in the southern Blue Ridgemountains area. The data set contains 112 observations on four vari-ables: ph for water pH level, t1 for calcium concentration in log10 mil-ligrams per liter, x1 for latitude, and x2 for longitude. The data set isavailable in library assist with the name acid.
A.13 Melanoma Data
This is a data set in Andrews and Herzberg (1985) that contains numbersof melanoma cases per 100,000 in the state of Connecticut during 1936–1972. It consists of 37 observations on two variables: cases for numbersof melanoma cases per 100,000, and year. The data set is available inlibrary fda with the name melanoma.
328 Smoothing Splines: Methods and Applications
A.14 Motorcycle Data
These data come from a simulated motorcycle crash experiment on theefficacy of crash helmets. The data set contains 133 measurements ontwo variables: accel as head acceleration in g of a subject and time astime after impact in milliseconds. The data set is available in libraryMASS with the name mcycle.
A.15 Paramecium caudatum Data
This is a data set in Gause (1934) that contains growth of parameciumcaudatum population in the medium of Osterhout. It consists of 25 ob-servations on two variables: days since the start of the experiment, anddensity representing the mean number of individuals in 0.5 milliliter ofmedium of four different cultures started simultaneously. The data setis available in library assist with the name paramecium.
A.16 Rock Data
This data set contains measurements on 48 rock samples collected froma petroleum reservoir. Four variables were measured: area of poresspace in pixels out of 256 by 256, perimeter in pixels (denoted as peri),shape in perimeter/area1/2, and permeability in milli-Darcies (denotedas perm). The data set is available in R with the name rock.
A.17 Seizure Data
This data set, provided by Li Qin, contains two 5-minute intracranialelectroencephalograms (IEEG) segments from a seizure patient: base in-cludes the baseline segment extracted at least 4 hours before the seizure’sonset, and preseizure includes the segment right before a seizure’s clin-ical onset. The sampling rate is 200 Hertz. Therefore, there are 60,000
Data Sets 329
time points in each segment. The data set is available in library assist
with the name seizure.
A.18 Star Data
This data set, provided by Marc G. Genton, contains magnitude (bright-ness) of the Mira variable R Hydrae during 1900–1950. It consists of twovariables: magnitude and time in days. The data set is available in li-brary assist with the name star.
A.19 Stratford Weather Data
This is part of a climate data set downloaded from the Carbon DioxideInformation Analysis Center at http://cdiac.ornl.gov/ftp/ndp070.Daily maximum temperatures from the station in Stratford, Texas, inthe year 1990 were extracted. The year was divided into 73 five-dayperiods, and measurements on the third day in each period were selectedas observations. Therefore, the data set consists of 73 observations ontwo variables: y as the observed maximum temperature in Fahrenheit,and x as the time scaled into [0, 1]. The data set is available in libraryassist with the name Stratford.
A.20 Superconductivity Data
The data come from a study involving superconductivity magnetiza-tion modeling conducted by the National Institute of Standard andTechnology. The data set contains 154 observations on two variables:magnetization in ampere×meter2/kilogram, and log time in minutes.Temperature was fixed at 10 degrees Kelvin. The data set is availablein library NISTnls with the name Bennett5.
330 Smoothing Splines: Methods and Applications
A.21 Texas Weather Data
The data set contains average monthly temperatures during 1961–1990from 48 weather stations in Texas. It also contains geological locationsof these stations in terms of longitude (long) and latitude (lat). Thedata set is available in library assist with the name TXtemp.
A.22 Ultrasound Data
Ultrasound imaging of the tongue provides real-time information aboutthe shape of the tongue body at different stages in articulation. Thisdata set comes from an experiment conducted in the Phonetics and Ex-perimental Phonology Lab of New York University led by Professor LisaDavidson. Three Russian speakers produced the consonant sequence,/gd/, in three different linguistic environments:
2words: the g was at the end of one word followed by d at the beginningof the next word. For example, the Russian phrase pabjeg damoj;
cluster: the g and d were both at the beginning of the same word. Forexample, the phrase xot gdamam;
schwa: the g and d were at the beginning of the same word but are sep-arated by the short vowel schwa (indicated by [∂]). For example,the phrase pr∂tajitatjg∂da’voj”.
Details about the ultrasound experiment can be found in Davidson(2006). We use a subset from a single subject, with three replicationsfor each environment, 15 points recorded from each of 9 slices of tonguecurves separated by 30 ms (milliseconds). The data set contains fourvariables: height as tongue height in mm (millimeters), length astongue length in mm, time as the time in ms and env as the envi-ronment with three levels: 2words, cluster, and schwa. The data set isavailable in library assist with the name ultrasound.
Data Sets 331
A.23 USA Climate Data
The data set contains average winter (December, January, and February)temperatures (temp) in 1981 from 1214 stations in the United States. Italso contains geological locations of these stations in terms of longitude(long) and latitude (lat). The data set is available in library assist
with the name USAtemp.
A.24 Weight Loss Data
The data set contains 52 observations on two variables, Weight in kilo-grams of a male obese patient, and Days since the start of a weightrehabilitation program. The data set is available in library MASS withthe name wtloss.
A.25 WESDR Data
Wisconsin Epidemiological Study of Diabetic Retinopathy (WESDR) isan epidemiological study of a cohort of diabetic patients receiving theirmedical care in an 11-county area in Southern Wisconsin. A number ofmedical, demographic, ocular, and other covariates were recorded at thebaseline and later examinations along with a retinopathy score for eacheye. Detailed descriptions of the study were given in Klein, Klein, Moss,Davis and DeMets (1988) and Klein, Klein, Moss, Davis and DeMets(1989). This subset contains 669 observations on five variables: num forsubject ID, dur for duration of diabetes at baseline, gly for glycosylatedhemoglobin, bmi for body mass index (weight in kilograms/(height inmeters)2), and prg for progression status of diabetic retinopathy at thefirst follow-up (1 for progression and 0 for nonprogression). The dataset is available in library assist with the name wesdr.
332 Smoothing Splines: Methods and Applications
A.26 World Climate Data
The data were obtained from the Carbon Dioxide Information and Anal-ysis Center at Oak Ridge National Laboratory. The data set containsaverage winter (December, January, and February) temperatures (temp)in 1981 from 725 stations around the globe, and geological locations ofthese stations in terms of longitude (long) and latitude (lat). The dataset is available in library assist with the name climate.
Appendix B
Codes for Fitting StrictlyIncreasing Functions
B.1 C and R Codes for Computing Integrals
The following functions k2 and k4 compute the scaled Bernoulli polyno-mials k2(x) and k4(x) in (2.27), and the function rc computes the RKof the cubic spline in Table 2.2:
static double
k2(double x) {
double value;
x = fabs(x);
value = x - 0.5;
value *= value;
value = (value-1./12.)/2.;
return(value);
}
static double
k4(double x) {
double val;
x = fabs(x);
val = x - 0.5;
val *= val;
val = (val * val - val/2. + 7./240.)/24.;
return(val);
}
static double
rc(double x, double y) {
double value;
value = k2(x) * k2(y)- k4 (x - y);
return(value);
}
333
334 Smoothing Splines: Methods and Applications
The following functions integral s, integral f, and integral 1
compute three-point Gaussian quadrature approximations to integrals∫ x
0 f(s)ds,∫ x
0 f(s)R1(s, y)ds, and∫ x
0
∫ y
0 f(s)f(t)R1(s, t)dsdt, respectively,where R1 is the RK of the cubic spline in Table 2.2:
void integral_s(double *f, double *x, long *n, double *res)
{
long i;
double sum=0.0;
for(i=0; i< *n; i++){
sum += (x[i+1]-x[i])*(0.2777778*(f[3*i]+f[3*i+2])+
0.4444444*f[3*i+1]);
res[i] = sum;
}
}
void integral_f(double *x, double *y, double *f,
long *nx, long *ny, double *res)
{
long i, j;
double x1, y1, sum=0.0;
for(i=0;i< *ny; i++){
sum = 0.0;
for(j=0; j< *nx; j++){
x1 = x[j+1]-x[j];
sum += x1*(0.2777778*(f[3*j]*
rc(x[j]+x1*0.1127017, y[i])+
f[3*j+2]*rc(x[j]+x1*0.8872983, y[i]))+
0.4444444*f[3*j+1]*rc(x[j]+x1*0.5,y[i]));
res[i*(*nx)+j] = sum;
}
}
}
void integral_1(double *x, double *y, double *f,
long *n1, long *n2, double *res)
{
long i, j, t, s;
double x1, y1, sum=0.0, sum_tmp;
for(i=0; i< *n1; i++){
Codes for Fitting Strictly Increasing Functions 335
x1 = x[i+1]-x[i];
sum = 0.0;
for(j=0; j< *n2; j++){
y1 = y[j+1]-y[j];
sum_tmp = 0.2777778*0.2777778*(f[3*i]*f[3*j])*
rc(x[i]+x1*0.1127017, y[j]+y1*0.1127017)+
0.2777778*0.4444444*((f[3*i]*f[3*j+1])*
rc(x[i]+x1*0.1127017, y[j]+y1*0.5)+
(f[3*i+1]*f[3*j])*rc(x[i]+x1*0.5,y[j]+y1*0.1127017));
sum_tmp += 0.4444444*0.4444444*((f[3*i+1]*f[3*j+1])*
rc(x[i]+x1*0.5,y[j]+y1*0.5))+
0.2777778*0.2777778*((f[3*i+2]*f[3*j+2])*
rc(x[i]+x1*0.8872983, y[j]+ y1*0.8872983));
sum_tmp += 0.2777778*0.2777778*((f[3*i]*f[3*j+2])*
rc(x[i]+x1*0.1127017, y[j]+y1*0.8872983)+
(f[3*i+2]*f[3*j])*
rc(x[i]+x1*0.8872983,y[j]+y1*0.1127017))+
0.4444444*0.2777778*((f[3*i+1]*f[3*j+2])*
rc(x[i]+x1*0.5, y[j]+y1*0.8872983)+
(f[3*i+2]*f[3*j+1])*rc(x[i]+x1*0.8872983, y[j]+y1*0.5));
sum += sum_tmp*x1*y1;
res[i*(*n2)+j] = sum;
}
}
}
The following R functions provide interface with the C functionsintegral s, integral f, and integral 1:
int.s <- function(f, x, low=0) {
n <- length(x)
x <- c(low, x)
.C(‘‘integral_s’’, as.double(f), as.double(x),
as.integer(n), val = double(n))$val
}
int.f <- function(x, y, f, low=0) {
nx <- length(x)
ny <- length(y)
x <- c(low, x)
res <- .C(‘‘integral_f’’, as.double(x),
as.double(y), as.double(f), as.integer(nx),
as.integer(ny), val=double(nx*ny))$val
matrix(res, ncol=ny, byrow=F)
336 Smoothing Splines: Methods and Applications
}
int1 <- function(x, f.val, low=0) {
n <- length(x)
x <- c(low, x)
if(length(f.val) != 3 * n) stop(‘‘input not match’’)
res <- matrix(.C(‘‘integral_1’’, as.double(x),
as.double(x), as.double(f.val), as.integer(n),
as.integer(n), val = double(n * n))$val, ncol = n)
apply(res, 1, cumsum)
}
B.2 R Function inc
The following R function implements the EGN procedure for model (7.3).
inc <- function(y, x, spar=‘‘v’’, grid=x, limnla=c(-6,0),
prec=1.e-6, maxit=50, verbose=F)
{
n <- length(x)
org.ord <- match(1:n, (1:n)[order(x)])
s.x <- sort(x)
s.y <- y[order(x)]
x1 <- c(0, s.x[-n])
x2 <- s.x-x1
q.x <- as.vector(rep(1,3)%o%x1+
c(0.1127017,0.5,0.8872983)%o%x2)
# function for computing derivatives
k1 <- function(x) x-.5
k2 <- function(x) ((x-.5)^2-1/12)/2
dk2 <- function(x) x-.5
dk4 <- function(x)
sign(x)*((abs(x)-.5)^3/6-(abs(x)-.5)/24)
drkcub <- function(x,z) dk2(x)%o%k2(z)-
dk4(x%o%rep(1,length(z))-rep(1,length(x))%o%z)
# compute starting value
ini.fit <- ssr(s.y~I(s.x-.5), cubic(s.x))
g.der <- ini.fit$coef$d[2]+drkcub(q.x,x)%*%ini.fit$coef$c
Codes for Fitting Strictly Increasing Functions 337
h.new <- abs(g.der)+0.005
# begin iteration
iter <- cover <- 1
h.old <- h.new
repeat {
if(verbose) cat(‘‘\n Iteration: ’’, iter)
yhat <- s.y-int.s(h.new*(1-log(h.new)),s.x)
smat <- cbind(int.s(h.new, s.x), int.s(h.new*q.x,s.x))
qmat <- int1(s.x,h.new)
fit <- ssr(yhat~smat, qmat, spar=spar, limnla=limnla)
if(verbose)
cat(‘‘\nSmoothing parameter: ’’, fit$rkpk$nlaht)
dd <- fit$coef$d
cc <- as.vector(fit$coef$c)
h.new <- as.vector(exp(cc%*%int.f(s.x,q.x,h.new)+
dd[2]+dd[3]*q.x))
cover <- mean((h.new-h.old)^2)
h.old <- h.new
if(verbose)
cat(‘‘\nConvergent Criterion: ‘‘, cover, ‘‘\n’’)
if(cover<prec || iter>(maxit-1)) break
iter <- iter + 1
}
if(iter>=maxit) print(‘‘convergence not achieved!’’)
y.fit <- (smat[,1]+fit$rkpk$d[1])[org.ord]
f.fit <- as.vector(cc%*%int.f(s.x,grid,h.new)+
dd[2]+dd[3]*grid)
x1 <- c(0, grid[-length(grid)])
x2 <- grid-x1
q.x <- as.vector(rep(1,3)%o%x1+
c(0.1127017,0.5,0.8872983)%o%x2)
h.new <- as.vector(exp(cc%*%int.f(s.x,q.x,h.new)+
dd[2]+dd[3]*q.x))
y.pre <- int.s(h.new,grid)+fit$rkpk$d[1]
sigma <- sqrt(sum((y-y.fit)^2)/(length(y)-fit$df))
list(fit=fit, iter=c(iter, cover),
pred=list(x=grid,y=y.pre,f=f.fit),
y.fit=y.fit, sigma=sigma)
}
where x and y are vectors of the independent and dependent variables;grid is a vector of grid points of the x variable used for assessing conver-
338 Smoothing Splines: Methods and Applications
gence and prediction; and options spar and limnla are similar to thosein the ssr function. Let h(x) = g′(x) = exp{f(x)}. To get the initialvalue for the function f , we first fit a cubic spline to model (7.1). De-note the fitted function as g0. We then use log(|g′0(x)|+ δ) as the initialvalue for f , where δ = 0.005 is a small positive number for numericalstability. Since g0(x) = d1φ1(x) + d2φ2(x) +
∑ni=1 ciR1(xi, x), we have
g′0(x) = d2 +∑n
i=1 ci∂R1(xi, x)/∂x, where ∂R1(xi, x)/∂x is computedby the function drkcub. Functions k1 and k2 compute scaled Bernoullipolynomials defined in (2.27), and functions dk2 and dk4 compute k′2(x)and k′4(x), respectively. As a by-product, the line starting with g.der
shows how to compute the first derivative for a cubic spline fit.
Appendix C
Codes for Term Structure ofInterest Rates
C.1 C and R Codes for Computing Integrals
The following rc function computes the RK of the cubic spline in Table2.1:
static double
rc(double x, double y) {
double val, tmp;
tmp = (x+y-fabs(x-y))/2.0;
val = (tmp)*(tmp)*(3.0*(x+y-tmp)-tmp)/6.0;
return(val);
}
In addition to the functions integral s, integral f, and integral 1
presented in Appendix B, we need the following function integral 2
for computing three-point Gaussian quadrature approximations to theintegral
∫ x
0
∫ y
0 f1(s)f2(t)R1(s, t)dsdt, where R1 is the RK of the cubicspline in Table 2.1:
void integral_2(double *x, double *y, double *fx,
double *fy, long *n1, long *n2, double *res)
{
long i,j, t, s;
double x1, y1, sum=0.0, sum_tmp;
for(i=0; i< *n1; i++){
x1 = x[i+1]-x[i];
sum = 0.0;
for(j=0; j< *n2; j++){
y1 = y[j+1]-y[j];
sum_tmp = 0.2777778*0.2777778*(fx[3*i]*fy[3*j])*
rc(x[i]+x1*0.1127017, y[j]+y1*0.1127017)+
339
340 Smoothing Splines: Methods and Applications
0.2777778*0.4444444*((fx[3*i]*fy[3*j+1])*
rc(x[i]+x1*0.1127017, y[j]+y1*0.5)+
(fx[3*i+1]*fy[3*j])*rc(x[i]+
x1*0.5, y[j]+y1*0.1127017));
sum_tmp += 0.4444444*0.4444444*((fx[3*i+1]*fy[3*j+1])*
rc(x[i]+x1*0.5, y[j]+y1*0.5)) + 0.2777778*0.2777778*
((fx[3*i+2]*fy[3*j+2])*
rc(x[i]+x1*0.8872983, y[j]+y1*0.8872983));
sum_tmp += 0.2777778*0.2777778*((fx[3*i]*fy[3*j+2])*
rc(x[i]+x1*0.1127017, y[j]+y1*0.8872983)+
(fx[3*i+2]*fy[3*j])*
rc(x[i]+x1*0.8872983, y[j]+y1*0.1127017))+
0.4444444*0.2777778*((fx[3*i+1]*fy[3*j+2])*
rc(x[i]+x1*0.5, y[j]+y1*0.8872983)+
(fx[3*i+2]*fy[3*j+1])*
rc(x[i]+x1*0.8872983, y[j]+y1*0.5));
sum += sum_tmp*x1*y1;
res[i*(*n2)+j] = sum;
}
}
}
Note that the rc function called inside integral f, integral 1, andintegral 2 in this Appendix computes the RK R1 of the cubic splinein Table 2.1.
The following R functions provide interface with the C functionintegral 2:
int2 <- function(x, y, fx, fy, low.x=0, low.y=0) {
nx <- length(x)
ny <- length(y)
if((length(fx) != 3 * nx) || (length(fy) != 3 * ny))
stop(‘‘input not match’’)
x <- c(low.x, x)
y <- c(low.y, y)
res <- matrix(.C(‘‘integral_2’’, as.double(x),
as.double(y), as.double(fx), as.double(fy),
as.integer(nx), as.integer(ny), val=double(nx*ny))$val,
ncol=ny, byrow=T)
apply(res, 2, cumsum)
}
Codes for Term Structure of Interest Rates 341
C.2 R Function for One Bond
The following one.bond function implements the EGN algorithm to fitmodel (7.35):
one.bond <- function(price, payment, time, name,
spar=‘‘m’’, limnla=c(-3,6)) {
# pre-processing the data
# the data has to be sorted by the name
group <- as.vector(table(name))
y <- price[cumsum(group)]
n.time <- length(time)
# create variables for 3-point Gaussian quadrature
s.time <- sort(time)
x1.y <- c(0, s.time[-n.time])
x2.y <- s.time-x1.y
org.ord <- match(1:n.time, (1:n.time)[order(time)])
q.time <- as.vector(rep(1,3)%o%x1.y+
c(0.1127017,0.5,0.8872983)%o%x2.y)
# initial values for f
f0 <- function(x) rep(0.04, length(x))
f.old <- f.val <- f0(q.time)
# create s and q matrices
S <- cbind(time,time*time/2.0)
Lambda <- int1(s.time,
rep(1,3*length(s.time)))[org.ord,org.ord]
Lint <- int.f(s.time,q.time,
rep(1,3*length(s.time)))[org.ord,]
# begin iteration
iter <- cover <- 1
repeat {
fint <- int.s(f.val,s.time)[org.ord]
X <- assist:::diagComp(matrix(payment*exp(-fint),nrow=1),
group)
ytilde <- X%*%(1+fint)-y
T <- X%*%S; Q <- X%*%Lambda%*%t(X)
fit <- ssr(ytilde~T-1, Q, spar=spar, limnla=limnla)
dd <- fit$coef$d; cc <- fit$coef$c
342 Smoothing Splines: Methods and Applications
f.val <- as.vector((cc%*%X)%*%Lint+dd[1]+dd[2]*q.time)
cover <- mean((f.val-f.old)^2)
if(cover<1.e-6 || iter>20) break
iter<- iter+1; f.old <- f.val
}
tmp <- -int.s(f.val,s.time)[org.ord]
yhat <- apply(assist:::diagComp(matrix(payment*exp(tmp),
nrow=1),group),1,sum)
sigma <- sqrt(sum((y-yhat)^2)/(length(y)-fit$df))
list(fit=fit, iter=c(iter, cover), call=match.call(),
f.val=f.val, q.time=q.time, dc=exp(tmp),
y=list(y=y,yhat=yhat), sigma=sigma)
}
where variable names are self-explanatory.
C.3 R Function for Two Bonds
The following two.bond function implements the nonlinear Gauss–Seidelalgorithm to fit model (7.36):
two.bond <- function(price, payment, time, name, type,
spar=‘‘m’’, limnla=c(-3,6), prec=1.e-6, maxit=20) {
# pre-processing the data
# the data in each group has to be sorted by the name
group1 <- as.vector(table(name[type==‘‘govt’’]))
y1 <- price[type==‘‘govt’’][cumsum(group1)]
time1 <- time[type==‘‘govt’’]
n1.time <- length(time1)
payment1 <- payment[type==‘‘govt’’]
group2 <- as.vector(table(name[type==‘‘ge’’]))
y2 <- price[type==‘‘ge’’][cumsum(group2)]
time2 <- time[type==‘‘ge’’]
n2.time <- length(time2)
payment2 <- payment[type==‘‘ge’’]
y <- c(y1, y2)
group <- c(group1, group2)
payment <- c(payment1, payment2)
error <- 0
# create variables for 3-point Gaussian quadrature
Codes for Term Structure of Interest Rates 343
s.time1 <- sort(time1)
x1.y1 <- c(0, s.time1[-n1.time])
x2.y1 <- s.time1-x1.y1
org.ord1 <- match(1:n1.time, (1:n1.time)[order(time1)])
q.time1 <- as.vector(rep(1,3)%o%x1.y1+
c(0.1127017,0.5,0.8872983)%o%x2.y1)
s.time2 <- sort(time2)
x1.y2 <- c(0, s.time2[-n2.time])
x2.y2 <- s.time2-x1.y2
org.ord2 <- match(1:n2.time, (1:n2.time)[order(time2)])
q.time2 <- as.vector(rep(1,3)%o%x1.y2+
c(0.1127017,0.5,0.8872983)%o%x2.y2)
# initial values for f
f10 <- function(x) rep(0.04, length(x))
f20 <- function(x) rep(0.01, length(x))
f1.val1 <- f10(q.time1)
f1.val2 <- f10(q.time2)
f1.old <- c(f1.val1,f1.val2)
f2.val2 <- f20(q.time2)
f2.old <- f2.val2
# create s and q matrices
S1 <- cbind(time1,time1*time1/2.0)
S2 <- cbind(time2,time2*time2/2.0)
L1 <- int1(s.time1,
rep(1,3*length(s.time1)))[org.ord1,org.ord1]
L2 <- int1(s.time2,
rep(1,3*length(s.time2)))[org.ord2,org.ord2]
L12 <- int2(s.time1,s.time2,rep(1,3*length(s.time1)),
rep(1,3*length(s.time2)))
L12 <- L12[org.ord1,org.ord2]
Lambda <- rbind(cbind(L1,L12),cbind(t(L12),L2))
L1int <- int.f(s.time1,c(q.time1,q.time2),
rep(1,3*length(s.time1)))[org.ord1,]
L2int <- int.f(s.time2,c(q.time1,q.time2),
rep(1,3*length(s.time2)))[org.ord2,]
Lint <- rbind(L1int,L2int)
L2int2 <- int.f(s.time2,q.time2,
rep(1,3*length(s.time2)))[org.ord2,]
# begin iteration
iter <- cover <- 1
344 Smoothing Splines: Methods and Applications
repeat {
# update f1
f1int1 <- int.s(f1.val1,s.time1)[org.ord1]
f1int2 <- int.s(f1.val2,s.time2)[org.ord2]
f2int2 <- int.s(f2.val2,s.time2)[org.ord2]
X <- assist:::diagComp(matrix(payment*
exp(-c(f1int1,f1int2+f2int2)),nrow=1),group)
ytilde1 <- X%*%(1+c(f1int1,f1int2))-y
T <- X%*%rbind(S1,S2)
Q <- X%*%Lambda%*%t(X)
fit1 <- try(ssr(ytilde1~T-1,Q,spar=spar,limnla=limnla))
if (class(fit1)==‘‘try-error’’) {error=1; break}
if (class(fit1)!=‘‘try-error’’) {
dd <- fit1$coef$d; cc <- fit1$coef$c
f1.val <- as.vector((cc%*%X)%*%Lint+dd[1]+
dd[2]*c(q.time1,q.time2))
f1.val1 <- f1.val[1:(3*n1.time)]
f1.val2 <- f1.val[-(1:(3*n1.time))]
}
# update f2
f1int2 <- int.s(f1.val2,s.time2)[org.ord2]
X2 <- assist:::diagComp(matrix(payment2*
exp(-f1int2-f2int2),nrow=1),group2)
ytilde2 <- X2%*%(1+f2int2)-y2
T2 <- X2%*%S2
Q22 <- X2%*%L2%*%t(X2)
fit2 <- try(ssr(ytilde2~T2-1,Q22,spar=spar,
limnla=limnla))
if (class(fit2)==‘‘try-error’’) {error=1; break}
if (class(fit2)!=‘‘try-error’’) {
dd <- fit2$coef$d; cc <- fit2$coef$c
f2.val2 <- as.vector((cc%*%X2)%*%L2int2+
dd[1]+dd[2]*q.time2)
}
cover <- mean((c(f1.val1,f1.val2,f2.val2)-
c(f1.old, f2.old))^2)
if(cover<prec || iter>maxit) break
iter<- iter + 1
f1.old <- c(f1.val1,f1.val2)
f2.old <- f2.val2
}
tmp1 <- -int.s(f1.val1,s.time1)[org.ord1]
Codes for Term Structure of Interest Rates 345
tmp2 <- -int.s(f1.val2+f2.val2,s.time2)[org.ord2]
yhat <- apply(assist:::diagComp(matrix(payment*
exp(c(tmp1,tmp2)),nrow=1),group),1,sum)
sigma <- NA
if (error==0) sigma <- sqrt(sum((y-yhat)^2)/
(length(y)-fit1$df-fit2$df))
list(fit1=fit1, fit2=fit2, iter=c(iter, cover, error),
call=match.call(),
f.val=list(f1=f1.val1,f2=f1.val2+f2.val2),
f2.val=f2.val2,
q.time=list(q.time1=q.time1,q.time2=q.time2),
dc=list(dc1=exp(tmp1),dc2=exp(tmp2)),
y=list(y=y,yhat=yhat), sigma=sigma)
}
The matrices S1, S2, T, T2, L1, L2, L12, and Q represent S1, S2, T , T2,Λ1, Λ2, Λ12, and Σ respectively, in the description about the Gauss–Seidel algorithm for model (7.36) in Section 7.6.3. In the output, f1and f2 in the list f.val contain estimated forward rates for two bondsevaluated at time points q.time1 and q.time2 in the list q.time; f2.valcontains estimated credit spread evaluated at time points q.time2; anddc1 and dc2 in the list dc contain estimated discount rates for two bondsevaluated at the observed time points.
This page intentionally left blankThis page intentionally left blank
References
Abramowitz, M. and Stegun, I. A. (1964). Handbook of Mathemati-cal Functions with Formulas, Graphs, and Mathematical Tables,Washington, DC: National Bureau of Standards.
Andrews, D. F. and Herzberg, A. M. (1985). Data: A Collection ofProblems From Many Fields for the Student and Research Worker,Springer, Berlin.
Aronszajn, N. (1950). Theory of reproducing kernels, Transactions ofthe American Mathematics Society 68: 337–404.
Bennett, L. H., Swartzendruber, L. J. Turchinskaya, M. J., Blendell,J. E., Habib, J. M. and Seyoum, H. M. (1994). Long-time mag-netic relaxation measurements on a quench melt growth YBCOsuperconductor, Journal of Applied Physics 76: 6950–6952.
Berlinet, A. and Thomas-Agnan, C. (2004). Reproducing Kernel HilbertSpaces in Probability and Statistics, Kluwer Academic, Norwell,MA.
Breslow, N. E. and Clayton, D. G. (1993). Approximate inference in gen-eralized linear mixed models, Journal of the American StatisticalAssociation 88: 9–25.
Carroll, R. J., Fan, J., Gijbels, I. and Wand, M. P. (1997). General-ized partial linear single-index models, Journal of the AmericanStatistical Association 92: 477–489.
Coddington, E. A. (1961). An Introduction to Ordinary DifferentialEquations, Prentice-Hall, NJ.
Cox, D. D., Koh, E., Wahba, G. and Yandell, B. (1988). Testing the(parametric) null model hypothesis in (semiparametric) partial andgeneralized spline model, Annals of Statistics 16: 113–119.
Cox, D. R. and Hinkley, D. V. (1974). Theoretical Statistics, Chapmanand Hall, London.
Craven, P. and Wahba, G. (1979). Smoothing noisy data with splinefunctions, Numerische Mathematik 31: 377–403.
347
348 References
Dalzell, C. J. and Ramsay, J. O. (1993). Computing reproducing kernelswith arbitrary boundary constraints, SIAM Journal on ScientificComputing 14: 511–518.
Davidson, L. (2006). Comparing tongue shapes from ultrasound imag-ing using smoothing spline analysis of variance., Journal of theAcoustical Society of America 120: 407–415.
Davies, R. B. (1980). The distribution of a linear combination of χ2
random variables, Applied Statistics 29: 323–333.
Debnath, L. and Mikusinski, P. (1999). Introduction to Hilbert Spaceswith Applications, Academic Press, London.
Douglas, A. and Delampady, M. (1990). Eastern Lake Survey — PhaseI: documentation for the data base and the derived data sets,SIMS Technical Report 160. Department of Statistics, Universityof British Columbia, Vancouver.
Duchon, J. (1977). Spline minimizing rotation-invariant semi-norms inSobolev spaces, pp. 85–100. In Constructive Theory of Functions ofSeveral Variables, W. Schemp and K. Zeller eds., Springer, Berlin.
Earn, D. J. D., Rohani, P., Bolker, B. M. and Gernfell, B. T. (2000).A simple model for complex dynamical transitions in epidemics,Science 287: 667–670.
Efron, B. (2001). Selection criteria for scatterplot smoothers, Annals ofStatistics 29: 470–504.
Efron, B. (2004). The estimation of prediction error: covariance penaltiesand cross-validation (with discussion), Journal of the AmericanStatistical Association 99: 619–632.
Eubank, R. (1988). Spline Smoothing and Nonparametric Regression,Dekker, New York.
Eubank, R. (1999). Nonparametric Regression and Spline Smoothing,2nd ed., Dekker, New York.
Evans, M. and Swartz, T. (2000). Approximating Integrals via MonteCarlo and Deterministic Methods, Oxford University Press, Ox-ford, UK.
Fisher, M. D., Nychka, D. and Zervos, D. (1995). Fitting the termstructure of interest rates with smoothing spline, Working Paper 95-1, Finance and Eonomics Discussion Series, Federal Reserve Board.
Flett, T. M. (1980). Differential Analysis, Cambridge University Press,London.
References 349
Friedman, J. H. and Stuetzle, W. (1981). Projection pursuit regression,Journal of the American Statistical Association 76: 817–823.
Gause, G. F. (1934). The Struggle for Existence, Williams & Wilkins,Baltimore, MD.
Genton, M. G. and Hall, P. (2007). Statistical inference for evolving pe-riodic functions, Journal of the Royal Statistical Society B 69: 643–657.
Green, P. J. and Silverman, B. W. (1994). Nonparametric Regressionand Generalized Linear Models: A Roughness Penalty Approach,Chapman and Hall, London.
Grizzle, J. E. and Allen, D. M. (1969). Analysis of growth and doseresponse curves, Biometrics 25: 357–381.
Gu, C. (1992). Penalized likelihood regression: A Bayesian analysis,Statistica Sinica 2: 255–264.
Gu, C. (2002). Smoothing Spline ANOVA Models, Springer, New York.
Guo, W., Dai, M., Ombao, H. C. and von Sachs, R. (2003). Smoothingspline ANOVA for time-dependent spectral analysis, Journal of theAmerican Statistical Association 98: 643–652.
Hall, P., Kay, J. W. and Titterington, D. M. (1990). Asymptoticallyoptimal difference-based estimation of variance in nonparametricregression, Biometrika 77: 521–528.
Harville, D. (1976). Extension of the Gauss-Markov theorem to includethe estimation of random effects, Annals of Statistics 4: 384–395.
Harville, D. A. (1997). Matrix Algebra From A Statistician’s Perspective,Springer, New York.
Hastie, T. and Tibshirani, R. (1990). Generalized Additive Models, Chap-man and Hall, London.
Hastie, T. and Tibshirani, R. (1993). Varying coefficient model, Journalof the Royal Statistical Society B 55: 757–796.
Heckman, N. (1997). The theory and application of penalized leastsquares methods or reproducing kernel Hilbert spaces made easy,University of British Columbia Statistics Department TechnicalReport number 216.
Heckman, N. and Ramsay, J. O. (2000). Penalized regression with model-based penalties, Canadian Journal of Statistics 28: 241–258.
350 References
Jarrow, R., Ruppert, D. and Yu, Y. (2004). Estimating the termstructure of corporate debt with a semiparametric penalized splinemodel, Journal of the American Statistical Association 99: 57–66.
Ke, C. and Wang, Y. (2001). Semi-parametric nonlinear mixed-effectsmodels and their applications (with discussion), Journal of theAmerican Statistical Association 96: 1272–1298.
Ke, C. and Wang, Y. (2004). Nonparametric nonlinear regression mod-els, Journal of the American Statistical Association 99: 1166–1175.
Kimeldorf, G. S. and Wahba, G. (1971). Some results on Tchebycheffianspline functions, Journal of Mathematical Analysis and Applica-tions 33: 82–94.
Klein, R., Klein, B. E. K., Moss, S. E., Davis, M. D. and DeMets, D. L.(1988). Glycosylated hemoglobin predicts the incidence and pro-gression of diabetic retinopathy, Journal of the American MedicalAssociation 260: 2864–2871.
Klein, R., Klein, B. E. K., Moss, S. E., Davis, M. D. and DeMets, D. L.(1989). Is blood pressure a predictor of the incidence or progressionof diabetic retinopathy, Archives of Internal Medicine 149: 2427–2432.
Kronfol, Z., Nair, M., Zhang, Q., Hill, E. and Brown, M. (1997).Circadian immune measures in healthy volunteers: Relationshipto hypothalamic-pituitary-adrenal axis hormones and sympatheticneurotransmitters, Psychosomatic Medicine 59: 42–50.
Lawton, W. H., Sylvestre, E. A. and Maggio, M. S. (1972). Self-modelingnonlinear regression, Technometrics 13: 513–532.
Lee, Y., Nelder, J. A. and Pawitan, Y. (2006). Generalized Linear Modelswith Random Effects: Unified Analysis via H-likelihood, Chapmanand Hall, London.
Li, K. C. (1986). Asymptotic optimality of CL and generalized cross-validation in ridge regression with application to spline smoothing,Annals of Statistics 14: 1101–1112.
Lindstrom, M. J. and Bates, D. M. (1990). Nonlinear mixed effectsmodels for repeated measures data, Biometrics 46: 673–687.
Liu, A. and Wang, Y. (2004). Hypothesis testing in smoothingspline models, Journal of Statistical Computation and Simulation74: 581–597.
References 351
Liu, A., Meiring, W. and Wang, Y. (2005). Testing generalized linearmodels using smoothing spline methods, Statistica Sinica 15: 235–256.
Liu, A., Tong, T. and Wang, Y. (2007). Smoothing spline estimation ofvariance functions, Journal of Computational and Graphical Statis-tics 16: 312–329.
Ma, X., Dai, B., Klein, R., Klein, B. E. K., Lee, K. and Wahba, G.(2010). Penalized likelihood regression in reproducing kernel Hilbertspaces with randomized covariate data, University of WisconsinStatistics Department Technical Report number 1158.
McCullagh, P. and Nelder, J. (1989). Generalized Linear Models, Chap-man and Hall, London.
Meinguet, J. (1979). Multivariate interpolation at arbitrary points madesimple, Journal of Applied Mathematics and Physics (ZAMP)30: 292–304.
Neal, D. (2004). Introduction to Population Biology, Cambridge Univer-sity Press, Cambridge, UK.
Nychka, D. (1988). Bayesian confidence intervals for smoothing splines,Journal of the American Statistical Association 83: 1134–1143.
Opsomer, J. D., Wang, Y. and Yang, Y. (2001). Nonparametric regres-sion with correlated errors, Statistical Science 16: 134–153.
O’Sullivan, F. (1986). A statistical perspective on ill-posed inverse prob-lems (with discussion), Statistical Science 4: 502–527.
Parzen, E. (1961). An approach to time series analysis, Annals of Math-ematical Statistics 32: 951–989.
Pinheiro, J. and Bates, D. M. (2000). Mixed-effects Models in S andS-plus, Springer, New York.
Qin, L. and Wang, Y. (2008). Nonparametric spectral analysis with ap-plications to seizure characterization using EEG time series, Annalsof Applied Statistics 2: 1432–1451.
Ramsay, J. O. (1998). Estimating smooth monotone functions, Journalof the Royal Statistical Society B 60: 365–375.
Ramsay, J. O. and Silverman, B. W. (2005). Functional Data Analysis,2nd ed., Springer, New York.
Rice, J. A. (1984). Bandwidth choice for nonparametric regression, An-nals of Statistics 12: 1215–1230.
352 References
Robinson, G. K. (1991). That BLUP is a good thing: The estimation ofrandom effects (with discussion), Statistical Science 6: 15–51.
Ruppert, D., Wand, M. P. and Carroll, R. J. (2003). SemiparametricRegression, Cambridge, New York.
Schumaker, L. L. (2007). Spline Functions: Basic Theory, 3rd ed., Cam-bridge University Press, Cambridge, UK.
Smith, M. and Kohn, R. (2000). Nonparametric seemingly unrelatedregression, Journal of Econometrics 98: 257–281.
Speckman, P. (1995). Fitting curves with features: semiparametricchange-point methods, Computing Science and Statistics 26: 257–264.
Stein, M. (1990). A comparison of generalized cross-validation andmodified maximum likelihood for estimating the parameters of astochastic process, Annals of Statistics 18: 1139–1157.
Tibshirani, R. and Knight, K. (1999). The covariance inflation crite-rion for adaptive model selection, Journal of the Royal StatisticalSociety B 61: 529–546.
Tong, T. and Wang, Y. (2005). Estimating residual variance in nonpara-metric regression using least squares, Biometrika 92: 821–830.
Venables, W. N. and Ripley, B. D. (2002). Modern Applied Statisticswith S, 4th ed., Springer, New York.
Wahba, G. (1980). Automatic smoothing of the log periodogram, Jour-nal of the American Statistical Association 75: 122–132.
Wahba, G. (1981). Spline interpolation and smoothing on the sphere,SIAM Journal on Scientific Computing 2: 5–16.
Wahba, G. (1983). Bayesian confidence intervals for the cross-validatedsmoothing spline, Journal of the Royal Statistical Society B45: 133–150.
Wahba, G. (1985). A comparison of GCV and GML for choosing thesmoothing parameters in the generalized spline smoothing prob-lem, Annals of Statistics 4: 1378–1402.
Wahba, G. (1987). Three topics in ill posed inverse problems, pp. 37–51.In Inverse and Ill-Posed Problems, M. Engl and G. Groetsch, eds.Academic Press, New York.
Wahba, G. (1990). Spline Models for Observational Data, SIAM,Philadelphia, PA. CBMS-NSF Regional Conference Series in Ap-plied Mathematics, Vol. 59.
References 353
Wahba, G. and Wang, Y. (1995). Behavior near zero of the distributionof GCV smoothing parameter estimates for splines, Statistics andProbability Letters 25: 105–111.
Wahba, G., Wang, Y., Gu, C., Klein, R. and Klein, B. E. K. (1995).Smoothing spline ANOVA for exponential families, with applica-tion to the Wisconsin Epidemiological Study of Diabetic Retinopa-thy, Annals of Statistics 23: 1865–1895.
Wang, Y. (1994). Smoothing Spline Analysis of Variance of DataFrom Exponential Families, Ph.D. Thesis, University of Wisconsin-Madison, Department of Statistics.
Wang, Y. (1997). GRKPACK: fitting smoothing spline analysis of vari-ance models to data from exponential families, Communications inStatistics: Simulation and Computation 26: 765–782.
Wang, Y. (1998a). Mixed-effects smoothing spline ANOVA, Journal ofthe Royal Statistical Society B 60: 159–174.
Wang, Y. (1998b). Smoothing spline models with correlated randomerrors, Journal of the American Statistical Association 93: 341–348.
Wang, Y. and Brown, M. B. (1996). A flexible model for human circadianrhythms, Biometrics 52: 588–596.
Wang, Y. and Ke, C. (2009). Smoothing spline semi-parametric non-linear regression models, Journal of Computational and GraphicalStatistics 18: 165–183.
Wang, Y. and Wahba, G. (1995). Bootstrap confidence intervalsfor smoothing splines and their comparison to Bayesian confi-dence intervals, Journal of Statistical Computation and Simulation51: 263–279.
Wang, Y. and Wahba, G. (1998). Discussion of “Smoothing Spline Mod-els for the Analysis of Nested and Crossed Samples of Curves” byBrumback and Rice, Journal of the American Statistical Associa-tion 93: 976–980.
Wang, Y., Guo, W. and Brown, M. B. (2000). Spline smoothing forbivariate data with applications to association between hormones,Statistica Sinica 10: 377–397.
Wang, Y., Ke, C. and Brown, M. B. (2003). Shape invariant modellingof circadian rhythms with random effects and smoothing splineANOVA decomposition, Biometrics 59: 804–812.
354 References
Wang, Y., Wahba, G., Chappell, R. and Gu, C. (1995). Simulation stud-ies of smoothing parameter estimates and Bayesian confidence in-tervals in Bernoulli SS ANOVA models, Communications in Statis-tics: Simulation and Computation 24: 1037–1059.
Wong, W. (2006). Estimation of the loss of an estimate. In Frontiersin Statistics, J. Fan and H. L. Koul eds. Imperial College Press,London.
Wood, S. N. (2003). Thin plate regression splines, Journal of the RoyalStatistical Society B 65: 95–114.
Xiang, D. and Wahba, G. (1996). A generalized approximate cross val-idation for smoothing splines with non-Gaussian data, StatisticaSinica 6: 675–692.
Yang, Y., Liu, A. and Wang, Y. (2005). Detecting pulsatile hormonesecretions using nonlinear mixed effects partial spline models, Bio-metrics pp. 230–238.
Ye, J. M. (1998). On measuring and correcting the effects of data miningand model selection, Journal of the American Statistical Associa-tion 93: 120–131.
Yeshurun, Y., Malozemoff, A. P. and Shaulov, A. (1996). Magnetic re-laxation in high-temperature superconductors, Reviews of ModernPhysics 68: 911–949.
Yorke, J. A. and London, W. P. (1973). Recurrent outbreaks ofmeasles, chickenpox and mumps, American Journal of Epidemi-ology 98: 453–482.
Yu, Y. and Ruppert, D. (2002). Penalized spline estimation for partiallylinear single index models, Journal of the American Statistical As-sociation 97: 1042–1054.
Yuan, M. and Wahba, G. (2004). Doubly penalized likelihood estima-tor in heteroscedastic regression, Statistics and Probability Letters69: 11–20.
A general class of powerful and flexible modeling techniques, spline smoothing has attracted a great deal of research attention in recent years and has been widely used in many application areas, from medicine to economics. Smoothing Splines: Methods and Applications covers basic smoothing spline models, including polynomial, periodic, spherical, thin-plate, L-, and partial splines, as well as more advanced models, such as smoothing spline ANOVA, extended and generalized smoothing spline ANOVA, vector spline, nonparametric nonlinear regression, semiparametric regression, and semiparametric mixed-effects models. It also presents methods for model selection and inference.
The book provides unified frameworks for estimation, inference, and software implementation by using the general forms of nonparametric/semiparametric, linear/nonlinear, and fixed/mixed smoothing spline models. The theory of reproducing kernel Hilbert space (RKHS) is used to present various smoothing spline models in a unified fashion. Although this approach can be technical and difficult, the author makes the advanced smoothing spline methodology based on RKHS accessible to practitioners and students. He offers a gentle introduction to RKHS, keeps theory at a minimum level, and explains how RKHS can be used to construct spline models.
Smoothing Splines offers a balanced mix of methodology, compu-tation, implementation, software, and applications. It uses R to per-form all data analyses and includes a host of real data examples from astronomy, economics, medicine, and meteorology. The codes for all examples, along with related developments, can be found on the book’s web page.
C7755
Sm
oothing Splines
Wang
Statistics
Smoothing Splines Methods and Applications
Yuedong Wang
Monographs on Statistics and Applied Probability 121121
C7755_Cover.indd 1 4/25/11 9:28 AM