Download - Basics of Modern Parametric Statistics - premolab.rupremolab.ru/pub_files/pub3/aSN6SmFy1b.pdfBasics of Modern Parametric Statistics Vladimir Spokoiny Weierstrass-Institute, Mohrenstr.

Basics of Modern Parametric Statistics

Vladimir Spokoiny

Weierstrass-Institute,

Mohrenstr. 39, 10117 Berlin, Germany

[email protected]

February 13, 2012

2 parametric statistics: modern view

Contents

Preface 9

I Basics 13

1 Basic notions 15

1.1 Example of a Bernoulli experiment . . . . . . . . . . . . . . . . . . . . . . 15

1.2 Least squares estimation in a linear model . . . . . . . . . . . . . . . . . . 18

1.3 General parametric model . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

1.4 Statistical decision problem . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1.5 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

2 Parameter estimation for an i.i.d. model 25

2.1 Empirical distribution. Glivenko-Cantelli Theorem . . . . . . . . . . . . . 25

2.2 Substitution principle. Method of moments . . . . . . . . . . . . . . . . . 29

2.2.1 Method of moments. Univariate parameter . . . . . . . . . . . . . 30

2.2.2 Method of moments. Multivariate parameter . . . . . . . . . . . . 31

2.2.3 Method of moments. Examples . . . . . . . . . . . . . . . . . . . . 31

2.3 Unbiased estimates, bias, and quadratic risk . . . . . . . . . . . . . . . . . 36

2.3.1 Univariate parameter . . . . . . . . . . . . . . . . . . . . . . . . . 36

2.3.2 Multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.4 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37

2.4.1 Root-n normality. Univariate parameter . . . . . . . . . . . . . . 38

2.4.2 Root-n normality. Multivariate parameter . . . . . . . . . . . . . 40

2.5 Some geometric properties of a parametric family . . . . . . . . . . . . . . 43

2.5.1 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . . . . 43

2.5.2 Hellinger distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

2.5.3 Regularity and the Fisher Information. Univariate parameter . . . 46

2.6 Cramer-Rao Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

3


2.6.1 Univariate parameter . . . . . . . . . . . . . . . . . . . . . . . . . 50

2.6.2 Exponential families and R-efficiency . . . . . . . . . . . . . . . . . 51

2.7 Cramer-Rao inequality. Multivariate parameter . . . . . . . . . . . . . . . 53

2.7.1 Regularity and Fisher Information. Multivariate parameter . . . . 53

2.7.2 Local properties of the Kullback-Leibler divergence and Hellinger

distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

2.7.3 Multivariate Cramer-Rao Inequality . . . . . . . . . . . . . . . . . 56

2.7.4 Exponential families and R-efficiency . . . . . . . . . . . . . . . . . 57

2.8 Maximum likelihood and other estimation methods . . . . . . . . . . . . . 59

2.8.1 Minimum distance estimation . . . . . . . . . . . . . . . . . . . . . 59

2.8.2 M -estimation and Maximum likelihood estimation . . . . . . . . . 59

2.9 Maximum Likelihood for some parametric families . . . . . . . . . . . . . 63

2.9.1 Gaussian shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

2.9.2 Variance estimation for the normal law . . . . . . . . . . . . . . . 65

2.9.3 Univariate normal distribution . . . . . . . . . . . . . . . . . . . . 66

2.9.4 Uniform distribution on [0, θ] . . . . . . . . . . . . . . . . . . . . . 66

2.9.5 Bernoulli or binomial model . . . . . . . . . . . . . . . . . . . . . . 66

2.9.6 Multinomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.9.7 Exponential model . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

2.9.8 Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

2.9.9 Shift of a Laplace (double exponential) law . . . . . . . . . . . . . 68

2.10 Quasi Maximum Likelihood approach . . . . . . . . . . . . . . . . . . . . 69

2.10.1 LSE as quasi likelihood estimation . . . . . . . . . . . . . . . . . . 69

2.10.2 LAD and robust estimation as quasi likelihood estimation . . . . . 71

2.11 Univariate exponential families . . . . . . . . . . . . . . . . . . . . . . . . 72

2.11.1 Natural parametrization . . . . . . . . . . . . . . . . . . . . . . . . 72

2.11.2 Canonical parametrization . . . . . . . . . . . . . . . . . . . . . . . 75

2.11.3 Deviation probabilities for the maximum likelihood . . . . . . . . . 78

3 Regression Estimation 85

3.1 Regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.1.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85

3.1.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

3.1.3 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87

3.1.4 Regression function . . . . . . . . . . . . . . . . . . . . . . . . . . 88

3.2 Method of substitution and M-estimation . . . . . . . . . . . . . . . . . . 89

3.2.1 Mean regression. Least squares estimate . . . . . . . . . . . . . . . 89

spokoiny, v. 5

3.2.2 Median regression. Least absolute deviation estimate . . . . . . . . 90

3.2.3 Maximum likelihood regression . . . . . . . . . . . . . . . . . . . . 91

3.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

3.3.1 Projection estimation . . . . . . . . . . . . . . . . . . . . . . . . . 94

3.3.2 Piecewise linear estimation . . . . . . . . . . . . . . . . . . . . . . 94

3.3.3 Spline estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3.3.4 Wavelet estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3.3.5 Kernel estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3.4 Density function estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3.4.1 Linear projection estimation . . . . . . . . . . . . . . . . . . . . . . 94

3.4.2 Wavelet density estimation . . . . . . . . . . . . . . . . . . . . . . 94

3.4.3 Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . 94

3.4.4 Estimation based on Fourier transformation . . . . . . . . . . . . . 94

3.5 Generalized regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94

3.6 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

3.6.1 Logit regression for binary data . . . . . . . . . . . . . . . . . . . . 97

3.6.2 Poisson regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 98

3.7 Quasi Maximum Likelihood estimation . . . . . . . . . . . . . . . . . . . . 98

4 Estimation in linear models 101

4.1 Modeling assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

4.2 Quasi maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . 102

4.2.1 Estimation under the homogeneous noise assumption . . . . . . . . 104

4.2.2 Linear basis transformation . . . . . . . . . . . . . . . . . . . . . . 104

4.2.3 Orthogonal and orthonormal design . . . . . . . . . . . . . . . . . 106

4.2.4 Spectral representation . . . . . . . . . . . . . . . . . . . . . . . . 107

4.3 Properties of the response estimate f . . . . . . . . . . . . . . . . . . . . 108

4.3.1 Decomposition into a deterministic and a stochastic component . . 109

4.3.2 Properties of the operator Π . . . . . . . . . . . . . . . . . . . . . 109

4.3.3 Quadratic loss and risk of the response estimation . . . . . . . . . 110

4.3.4 Misspecified “colored noise” . . . . . . . . . . . . . . . . . . . . . . 111

4.4 Properties of the MLE θ . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

4.4.1 Properties of the stochastic component . . . . . . . . . . . . . . . . 113

4.4.2 Properties of the deterministic component . . . . . . . . . . . . . . 114

4.4.3 Risk of estimation. R-efficiency . . . . . . . . . . . . . . . . . . . . 115

4.4.4 The case of a misspecified noise . . . . . . . . . . . . . . . . . . . . 118

4.5 Linear models and quadratic log-likelihood . . . . . . . . . . . . . . . . . . 119


4.6 Inference based on the maximum likelihood . . . . . . . . . . . . . . . . . 121

4.6.1 A misspecified LPA . . . . . . . . . . . . . . . . . . . . . . . . . . 124

4.6.2 A misspecified noise structure . . . . . . . . . . . . . . . . . . . . . 124

4.7 Ridge regression, projection, and shrinkage . . . . . . . . . . . . . . . . . 125

4.7.1 Regularization and ridge regression . . . . . . . . . . . . . . . . . . 126

4.7.2 Penalized likelihood. Bias and variance . . . . . . . . . . . . . . . 127

4.7.3 Inference for the penalized MLE . . . . . . . . . . . . . . . . . . . 130

4.7.4 Projection and shrinkage estimates . . . . . . . . . . . . . . . . . . 131

4.7.5 Smoothness constraints and roughness penalty approach . . . . . . 134

4.8 Shrinkage in a linear inverse problem . . . . . . . . . . . . . . . . . . . . . 134

4.8.1 Spectral cut-off and spectral penalization. Diagonal estimates . . . 135

4.8.2 Galerkin method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

4.9 Semiparametric estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 138

4.9.1 (θ,η) - and υ -setup . . . . . . . . . . . . . . . . . . . . . . . . . . 138

4.9.2 Orthogonality and product structure . . . . . . . . . . . . . . . . . 139

4.9.3 Partial estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

4.9.4 Profile estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142

4.9.5 Semiparametric efficiency bound . . . . . . . . . . . . . . . . . . . 145

4.9.6 Inference for the profile likelihood approach . . . . . . . . . . . . . 146

4.9.7 Plug-in method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

4.9.8 Two step procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 149

4.9.9 Alternating method . . . . . . . . . . . . . . . . . . . . . . . . . . 149

5 Bayes estimation 153

5.1 Bayes formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

5.2 Conjugated priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

5.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156

5.2.2 Exponential families and conjugated priors . . . . . . . . . . . . . 157

5.3 Linear Gaussian model and Gaussian priors . . . . . . . . . . . . . . . . . 157

5.3.1 Univariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157

5.3.2 Linear Gaussian model and Gaussian prior . . . . . . . . . . . . . 158

5.3.3 Homogeneous errors, orthogonal design . . . . . . . . . . . . . . . 161

5.4 Non-informative priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162

5.5 Bayes estimate and posterior mean . . . . . . . . . . . . . . . . . . . . . . 163

5.5.1 Posterior mean and ridge regression . . . . . . . . . . . . . . . . . 165

spokoiny, v. 7

6 Testing a statistical hypothesis 167

6.1 Testing problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

6.1.1 Simple hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 167

6.1.2 Composite hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 168

6.1.3 A test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

6.1.4 Errors of the first kind, test level . . . . . . . . . . . . . . . . . . . 169

6.1.5 A randomized test . . . . . . . . . . . . . . . . . . . . . . . . . . . 169

6.1.6 An alternative, error of the second kind, power of the test . . . . 170

6.2 Neyman-Pearson test for two simple hypotheses . . . . . . . . . . . . . . . 171

6.2.1 Neyman-Pearson test for an i.i.d. sample . . . . . . . . . . . . . . 173

6.3 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

6.3.1 Gaussian shift model . . . . . . . . . . . . . . . . . . . . . . . . . . 175

6.3.2 One-sided test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176

6.3.3 Testing the mean when the variance is unknown . . . . . . . . . . 177

6.3.4 LR-tests. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 178

6.4 Testing problem for a univariate exponential family . . . . . . . . . . . . . 178

6.4.1 Two-sided alternative . . . . . . . . . . . . . . . . . . . . . . . . . 178

6.4.2 One-sided alternative . . . . . . . . . . . . . . . . . . . . . . . . . 180

6.4.3 Interval hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 182

7 Testing in linear models 185

7.1 Likelihood ratio test for a simple null . . . . . . . . . . . . . . . . . . . . . 185

7.1.1 General errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185

7.1.2 I.i.d. errors, known variance . . . . . . . . . . . . . . . . . . . . . . 186

7.1.3 Smooth Wald test . . . . . . . . . . . . . . . . . . . . . . . . . . . 190

7.1.4 I.i.d. errors with unknown variance . . . . . . . . . . . . . . . . . . 190

7.2 Likelihood ratio test for a linear hypothesis . . . . . . . . . . . . . . . . . 192

8 Some other testing methods 197

8.1 Method of moments for an i.i.d. sample . . . . . . . . . . . . . . . . . . . 197

8.1.1 Series expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199

8.1.2 Chi-squared test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200

8.1.3 Testing a parametric hypothesis . . . . . . . . . . . . . . . . . . . 200

8.2 Minimum distance method for an i.i.d. sample . . . . . . . . . . . . . . . 201

8.2.1 Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . . . . . . . . 202

8.2.2 ω2 test (Cramer-Smirnov-von Mises) . . . . . . . . . . . . . . . . 204

8.3 Partially Bayes tests and Bayes testing . . . . . . . . . . . . . . . . . . . . 204

8.3.1 Quasi LR approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 204


8.3.2 Partial Bayes approach and Bayes tests . . . . . . . . . . . . . . . 205

8.3.3 Bayes approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205

9 Deviation probability for quadratic forms 207

9.1 Gaussian case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207

9.2 A bound for the `2 -norm . . . . . . . . . . . . . . . . . . . . . . . . . . . 210

9.3 A bound for a quadratic form . . . . . . . . . . . . . . . . . . . . . . . . . 214

9.4 Rescaling and regularity condition . . . . . . . . . . . . . . . . . . . . . . 217

9.5 A chi-squared bound with norm-constraints . . . . . . . . . . . . . . . . . 218

9.6 A bound for the `2 -norm under Bernstein conditions . . . . . . . . . . . . 221

Preface

This book was written on the basis of a graduate course on mathematical statistics given

at the mathematical faculty of the Humboldt University Berlin.

The classical theory of parametric estimation, since the seminal works by Fisher,

Wald and Le Cam, among many others, has now reached maturity and an elegant form.

It can be considered as more or less complete, at least for the so-called “regular case”.

The question of the optimality and efficiency of the classical methods has been rigorously

studied and typical results state the asymptotic normality and efficiency of the maximum

likelihood and/or Bayes estimates; see an excellent monograph by ? for a comprehensive

study.

In the time around 1984 when I started my own PhD at the Lomonosoff University,

a popular joke in our statistical community in Moscow was that all the problems in the

parametric statistical theory have been solved and described in a complete way in ?, there

is nothing to do any more for mathematical statisticians. If at all, only few nonparametric

problems remain open. After finishing my PhD I also moved to nonparametric statistics

for a while with the focus on local adaptive estimation. In the year 2005 I started to

write a monograph on nonparametric estimation using local parametric methods which

was supposed to systemize my previous experience in this area. The very first draft of

this book was available already in the autumn 2005, and it only included few sections

about basics of parametric estimation. However, attempts to prepare a more systematic

and more general presentation of the nonparametric theory led me back to the very

basic parametric concepts. In 2007 I extended significantly the part about parametric

methods. In the spring 2009 I taught a graduate course on parametric statistics at the

mathematical faculty of the Humboldt University Berlin. My intention was to present a

“modern” version of the theory which mainly addresses the following questions:

“what do you need to know from parametric statistics to work on modern

parametric and nonparametric methods?”

“what kind of results can be established about the quality of general para-

metric methods if the underlying parametric model is misspecified and if the

9


sample size does not tend to infinity?”

- “the borderline between parametric and nonparametric statistics”.

The classical viewpoint is that the parametric statistics deal with a fixed finite dimen-

sional parameter space while the nonparametric statistics consider either infinite dimen-

sional (functional) parameter space or the dimensionality of the parameter space grows

with the sample size. Unfortunately this approach is not very much useful and informa-

tive if we allow for model misspecification and finite samples. The book offers a slightly

different vision. In particular, many problems usually treated within the nonparametric

setup are included here as parametric ones. The examples are given by high dimensional

linear estimation, roughness penalty, posterior for high dimensional Gaussian priors, etc.

The starting point of “modern parametric” view can be stressed as follows:

- any model is parametric;

- any parametric model is wrong;

- even a wrong model can be useful.

The model mentioned in the first item can be understood as a set of assumptions describ-

ing the unknown distribution of the underlying data. This description is usually given

in terms of some parameters. The parameter space can be large or infinite dimensional,

however, the model is uniquely specified by the parameter value. In this sense “any

model is parametric”.

The second statement “any parametric model is wrong” means that any imaginary

model is only an idealization (approximation) of reality. It is unrealistic to assume

that the data exactly follow the parametric model, even if this model is flexible and

involves a lot of parameters. Model misspecification naturally leads to the notion of the

modeling bias measuring the distance between the underlying model and the selected

parametric family. It also indicates a borderline between parametric and nonparametric

approaches. The parametric approach focuses on “estimation within the model” ignoring

the modeling bias. The nonparametric approach attempts to account for the modeling

bias and to optimize the joint impact of two kinds of errors: estimation error within

the model and the modeling bias. This volume is limited to parametric estimation for

some special models like exponential families or linear models. However, it prepares some

important tools for doing the general parametric theory presented in the second volume.

The last statement “even a wrong model can be useful” introduces the notion of a

“useful” parametric specification. In some sense it indicates a change of a paradigm in

the parametric statistics. Trying to find the true model is hopeless anyway. Instead,

one aims at taking a potentially wrong parametric model which however, possesses some

useful properties. Among others, one can figure out the following “useful” features:

spokoiny, v. 11

- a nice geometric structure of the likelihood leading to a numerically efficient

estimation procedure;

- parameter identifiability.

Lack of identifiability in the considered model is just an indication that the considered

parametric model is poorly selected. A proper parametrization should involve a reason-

able regularization ensuring both features: numerical efficiency/stability and a proper

parameter identification. The present volume presents some examples of “useful mod-

els” like linear or exponential families. The second volume will extend such models to

a quite general regular case involving some smoothness and moment conditions on the

log-likelihood process of the considered parametric family.

This book does not pretend to systematically cover the scope of the classical paramet-

ric theory. Some very important and even fundamental issues are not considered at all in

this book. One characteristic example is given by the notion of sufficiency, which can be

hardly combined with model misspecification. At the same time, much more attention is

paid to the questions of nonasymptotic inference under model misspecification including

concentration and confidence sets, dimensionality of the parameter space.

The first volume of the book presents some basic issues and concepts of the statistical

theory and illustrates them in details for exponential families and linear models. A special

focus on linear models can be explained by their role in the general theory in which a linear

model naturally arises from local approximation of a general regular model. This volume

can be used as text-book for a graduate course in mathematical statistics. It assumes

that the reader is familiar with the basic notions of the probability theory including the

Lebesgue measure, Radon-Nycodim derivative, etc. Knowledge of basic statistics is not

required. I tried to be as self-contained as possible; the most of the presented results are

proved in a rigorous way. Sometimes the details are left to the reader as exercises, in

those cases some hints are given. The volume is structured as follows. The first chapter

starts with a couple of examples illustrating the basis notions of the statistical estimation

theory. Then it introduces some important notions like statistical experiment, regression

model, i.i.d. sample.

Chapter 2 is very important for understanding the whole book. It starts with very

classical stuff: Glivenko-Cantelli results for the empirical measure that motivate the

famous substitution principle. Then the method of moments is studied in more detail

including the risk analysis and asymptotic properties. Some other classical estimation

procedures are briefly discussed including the methods of minimum distance, M-estimates

and its special cases: least squares, least absolute deviations and maximum likelihood

estimates. The concept of efficiency is discussed in context of the Cramer-Rao risk bound


which is given in univariate and multivariate case. The last sections of Chapter 2 start a

kind of smooth transition from classical to “modern” parametric statistics and they reveal

the approach of the book. The presentation is focused on the (quasi) likelihood-based

concentration and confidence sets. The basic concentration result is first introduced

for the simplest Gaussian shift model and then extended to the case of a univariate

exponential family in Section 2.11.

Chapter 3 extends the notions and approaches introduced for the i.i.d. case to the

more general regression models. Chapter 4 systematically studies the estimation problem

for a linear model. The first four sections are fairly classical and the presented results are

based on the direct analysis of the linear estimation procedures. Section 4.6 reproduces

in a very short form the same results but now based on the likelihood analysis. The

presentation is based on the celebrated chi-squared phenomenon which appears to be

the fundamental fact yielding the exact likelihood based concentration and confidence

properties. The further sections are complementary and can be recommended for a more

profound reading. The issues like regularization, shrinkage, smoothness and roughness

are usually studied within the nonparametric theory, here I am trying to fit them to

the classical linear parametric setup. A special focus is on semiparametric estimation

in Section 4.9. In particular, efficient estimation and chi-squared result are extended to

the semiparametric framework. Chapter 5 briefly discusses the Bayes approach in the

problem of parameter estimation.

The remaining chapters of the volume are devoted to the testing problem. Chapter 6

presents the classical results like Neyman-Pearson Lemma, properties of the likelihood

ratio test for the exponential family. Chapter 7 focuses on testing problem for linear

Gaussian model. Finally, Chapter 8 presents an overview of some nonparametric testing

procedures like minimum distance, Kolmogorov-Smirnov, ω2 , χ2 tests. A brief look on

testing problem from Bayes viewpoint is given at the end.

Part I

Basics

13

Chapter 1

Basic notions

The starting point of any statistical analysis is data, also called observations or a sample.

A statistical model is used to explain the nature of the data. A standard approach

assumes that the data is random and utilizes some probabilistic framework. On the

contrary to probability theory, the distribution of the data is not known precisely and

the goal of the analysis is to infer on this unknown distribution.

The parametric approach assumes that the distribution of the data is known up to the

value of a parameter θ from some subset Θ of a finite-dimensional space IRp . In this

case the statistical analysis is naturally reduced to the estimation of the parameter θ :

as soon as θ is known, we know the whole distribution of the data. Before introducing

the general notion of a statistical model, we discuss some popular examples.

1.1 Example of a Bernoulli experiment

Let Y = (Y1, . . . , Yn)> be a sequence of binary digits zero or one. We distinguish

between deterministic and random sequences. Deterministic sequences appear e.g. from

the binary representation of a real number, or from digitally coded images, etc. Random

binary sequences appear e.g. from coin throw, games, etc. In many situations incomplete

information can be treated as random data: the classification of healthy and sick patients,

individual vote results, the bankruptcy of a firm or credit default, etc.

Basic assumptions behind a Bernoulli experiment are:

• the observed data Yi are independent and identically distributed.

• each Yi assumes the value one with probability θ ∈ [0, 1] .

The parameter θ completely identifies the distribution of the data Y . Indeed, for every

15

16

i ≤ n and y ∈ {0, 1} ,

IP (Yi = y) = θy(1− θ)1−y,

and the independence of the Yi ’s implies for every sequence y = (y1, . . . , yn) that

IP(Y = y

)=

n∏i=1

θyi(1− θ)1−yi . (1.1)

To indicate this fact, we write IPθ in place of IP .

The equation (1.1) can be rewritten as

IPθ(Y = y

)= θsn(1− θ)n−sn ,

where

sn =n∑i=1

yi .

The value sn is often interpreted as the number of successes in the sequence y .

Probabilistic theory focuses on the probabilistic properties of the data Y under the

given measure IPθ . The aim of the statistical analysis is to infer on the measure IPθ for

an unknown θ based on the available data Y . Typical examples of statistical problems

are:

1. Estimate the parameter θ i.e. build a function θ of the data Y into [0, 1] which

approximates the unknown value θ as well as possible;

2. Build a confidence set for θ i.e. a random (data-based) set (usually an interval)

containing θ with a prescribed probability;

3. Testing a simple hypothesis that θ coincides with a prescribed value θ0 , e.g. θ0 =

1/2 ;

4. Testing a composite hypothesis that θ belongs to a prescribed subset Θ0 of the

interval [0, 1] .

Usually any statistical method is based on a preliminary probabilistic analysis of the

model under the given θ .

Theorem 1.1.1. Let Y be i.i.d. Bernoulli with the parameter θ . Then the mean and

the variance of the sum Sn = Y1 + . . .+ Yn satisfy

IEθSn = nθ,

Varθ Sndef= IEθ

(Sn − IEθSn

)2= nθ(1− θ).

17

Exercise 1.1.1. Prove this theorem.

This result suggests that the empirical mean θ = Sn/n is a reasonable estimate of

θ . Indeed, the result of the theorem implies

IEθθ = θ, IEθ(θ − θ

)2= θ(1− θ)/n.

The first equation means that θ is an unbiased estimate of θ , that is, IEθθ = θ for all

θ . The second equation yields a kind of concentration (consistency) property of θ : with

n growing, the estimate θ concentrates in a small neighborhood of the point θ . By the

Chebyshev inequality

IPθ(∣∣θ − θ∣∣ > δ

)≤ θ(1− θ)/(nδ2).

This result is refined by the famous de Moivre-Laplace theorem.

Theorem 1.1.2. Let Y be i.i.d. Bernoulli with the parameter θ . Then for every k ≤ n

IPθ(Sn = k

)=

(n

k

)θk(1− θ)n−k

≈ 1√2πnθ(1− θ)

exp

{−(k − nθ

)22nθ(1− θ)

},

where an ≈ bn means an/bn → 1 as n→∞ . Moreover, for any fixed z > 0 ,

IPθ

(∣∣∣Snn− θ∣∣∣ > z

√θ(1− θ)/n

)≈ 2√

2π

∫ ∞z

e−t2/2dt .

This concentration result yields that the estimate θ deviates from a root-n neigh-

borhood A(z, θ)def= {u : |θ − u| ≤ z

√θ(1− θ)/n} with probability of order e−z

2/2 .

This result bounding the difference |θ−θ| can also be used to build random confidence

intervals around the point θ . Indeed, by the result of the theorem, the random interval

E∗(z) = {u : |θ−u| ≤ z√θ(1− θ)/n} fails to cover the true point θ with approximately

the same probability:

IPθ(E∗(z) 63 θ

)≈ 2√

2π

∫ ∞z

e−t2/2dt . (1.2)

Unfortunately, the construction of this interval E∗(z) is not entirely data-based. Its

width involves the true unknown value θ . A data based confidence set can be obtained

by replacing the population variance σ2def= IEθ

(Y1 − θ

)2= θ(1 − θ) with its empirical

counterpart

σ2def=

1

n

n∑i=1

(Yi − θ

)2

18

The resulting confidence set E(z) reads as

E(z)def= {u : |θ − u| ≤ z

√n−1σ2}.

It possesses the same asymptotic properties as E∗(z) including (1.2).

The hypothesis that the value θ is equal to a prescribed value θ0 , e.g. θ0 = 1/2 , can

be checked by examining the difference |θ − 1/2| . If this value is too large compared to

σn−1/2 or with σn−1/2 , then the hypothesis is wrong with high probability. Similarly

one can consider a composite hypothesis that θ belongs to some interval [θ1, θ2] ⊂ [0, 1] .

If θ deviates from this interval at least by the value zσn−1/2 with a large z , then the

data significantly contradict this hypothesis.

1.2 Least squares estimation in a linear model

A linear model assumes a linear systematic dependence between the output (also called

response or explained variable) Y from the input (also called regressor or explanatory

variable) Ψ which in general can be multidimensional. The linear model is usually

written in the form

IE(Y)

= Ψ>θ∗

with an unknown vector of coefficients θ∗ = (θ∗1, . . . , θ∗p)> . Equivalently one writes

Y = Ψ>θ∗ + ε (1.3)

where ε stands for the individual error with zero mean: IEε = 0 . Such a linear model is

often used to describe the influence of the response on the regressor Ψ from the collection

of data in the form of a sample (Yi, Ψi) for i = 1, . . . , n .

Let θ be a vector of coefficients considered as a candidate for θ∗ . Then each obser-

vation Yi is approximated by Ψ>i θ . One often measures the quality of approximation

by the sum of quadratic errors |Yi − Ψ>i θ|2 . Under the model assumption (1.3), the

expected value of this sum is

IE∑|Yi − Ψ>i θ|2 = IE

∑∣∣Ψ>i (θ∗ − θ) + εi∣∣2 =

∑∣∣Ψ>i (θ∗ − θ)∣∣2 +

∑IEε2i .

The cross term cancels in view of IEεi = 0 . Note that minimizing this expression w.r.t.

θ is equivalent to minimizing the first sum because the second sum does not depend on

θ . Therefore,

argminθ

IE∑|Yi − Ψ>i θ|2 = argmin

θ

∑∣∣Ψ>i (θ∗ − θ)∣∣2 = θ∗.

19

In words, the true parameter vector θ∗ minimizes the expected quadratic error of fitting

the data with a linear combinations of the Ψi ’s. The least squares estimate of the

parameter vector θ∗ is defined by minimizing in θ its empirical counterpart, that is,

the sum of the squared errors∣∣Yi − Ψ>i θ∣∣2 over all i :

θdef= argmin

θ

n∑i=1

∣∣Yi − Ψ>i θ∣∣2.This equation can be solved explicitly under some condition on the Ψi ’s. Define the

p × n design matrix Ψ = (Ψ1, . . . , Ψn) . The aforementioned condition means that this

matrix is of rank p .

Theorem 1.2.1. Let Yi = Ψ>i θ∗ + εi for i = 1, . . . , n , where εi are independent and

satisfy IEεi = 0 , IEε2i = σ2 . Suppose that the matrix Ψ is of rank p . Then

θ =(ΨΨ>

)−1ΨY

where Y = (Y1, . . . , Yn)> . Moreover, θ is unbiased in the sense that

IEθ∗ θ = θ∗

and its variance satisfies Var(θ)

= σ2(ΨΨ>

)−1.

For each vector h ∈ IRp , the random value a =⟨h, θ

⟩= h>θ is an unbiased estimate

of a∗ = h>θ∗ :

IEθ∗(a) = a∗ (1.4)

with the variance

Var(a)

= σ2h>(ΨΨ>

)−1h.

Proof. Define

L(θ)def=

n∑i=1

∣∣Yi − Ψ>i θ∣∣2 = ‖Y − Ψ>θ‖2,

where ‖y‖2 def=∑

i y2i . The normal equation dL(θ)/dθ = 0 can be written as ΨΨ>θ =

ΨY yielding the representation of θ . Now the model equation yields IEθY = Ψ>θ∗

and thus

IEθ∗ θ =(ΨΨ>

)−1ΨIEθ∗Y =

(ΨΨ>

)−1ΨΨ>θ∗ = θ∗

as required.

20

Exercise 1.2.1. Check that Var(θ)

= σ2(ΨΨ>

)−1.

Similarly one obtains IEθ∗(a) = IEθ∗(h>θ

)= h>θ∗ = a∗ , that is, a is an unbiased

estimate of a∗ . Also

Var(a)

= Var(h>θ

)= h>Var

(θ)h = σ2h>

(ΨΨ>

)−1h.

which completes the proof.

The next result states that the proposed estimate a is in some sense the best possible

one. Namely, we consider the class of all linear unbiased estimates a satisfying the

identity (1.4). It appears that the variance σ2h>(ΨΨ>

)−1h of a is the smallest possible

in this class.

Theorem 1.2.2 (Gauss-Markov). Let Yi = Ψ>i θ∗+εi for i = 1, . . . , n with uncorrelated

εi satisfying IEεi = 0 and IEε2i = σ2 . Let rank(Ψ) = p . Suppose that the value a∗def=⟨

h,θ∗⟩

= h>θ∗ is to be estimated for a given vector h ∈ IRp . Then a =⟨h, θ

⟩= h>θ

is an unbiased estimate of a∗ . Moreover, a has the minimal possible variance over the

class of all linear unbiased estimates of a∗ .

This result was historically one of the first optimality results in statistics. It presents

a lower efficiency bound of any statistical procedure. Under the imposed restrictions it is

impossible to do better than the LSE does. This and more general results will be proved

later in Chapter 4.

Define also the vector of residuals

εdef= Y − Ψ>θ .

If θ is a good estimate of the vector θ∗ , then due to the model equation, ε is a good

estimate of the vector ε of individual errors. Many statistical procedures utilize this

observation by checking the quality of estimation via the analysis of the estimated vector

ε . In the case when this vector still shows a nonzero systematic component, there is

evidence that the assumed linear model is incorrect. This vector can also be used to

estimate the noise variance σ2 .

Theorem 1.2.3. Consider the linear model Yi = Ψ>i θ∗ + εi with independent homoge-

neous errors εi . Then the variance σ2 = IEε2i can be estimated by

σ2 =‖ε‖2nn− p

=‖Y − Ψ>θ‖2n

n− p

and σ2 is an unbiased estimate of σ2 , that is, IEθ∗ σ2 = σ2 for all θ∗ and σ .

21

Theorems 1.2.2 and 1.2.3 can be used to describe the concentration properties of the

estimate a and to build confidence sets based on a and σ , especially if the errors εi

are normally distributed.

Theorem 1.2.4. Let Yi = Ψ>i θ∗+εi for i = 1, . . . , n with εi ∼ N(0, σ2) . Let rank(Ψ) =

p . Then it holds for the estimate a = h>θ of θ = h>θ∗

a− a∗ ∼ N(0, s2

)with s2 = σ2h>

(ΨΨ>

)−1h .

Corollary 1.2.5 (Concentration). If for some α > 0 , zα is the 1−α/2 -quantile of the

standard normal law (i.e. Φ(zα) = 1− α/2 ) then

IPθ∗(|a− a∗| > zα s

)= α

Exercise 1.2.2. Check Corollary 1.2.5.

The next result describes the confidence set for a∗ . The unknown variance s2 is

replaced by its estimate

s2def= σ2h>

(ΨΨ>

)−1h

Corollary 1.2.6 (Confidence set). If E(zα)def= {a : |a− a| ≤ s zα} then

IPθ∗(E(zα) 63 a∗

)≈ α.

1.3 General parametric model

Let Y denote the observed data with values in the observation space Y . In most cases,

Y ∈ IRn , that is, Y = (Y1, . . . , Yn)> . Here n denotes the sample size (number of

observations). The basic assumption about these data is that the vector Y is a random

variable on a probability space (Y,B(Y), IPθ∗) , where B(Y) is the Borel σ -algebra on

Y . The probabilistic approach assumes that the probability measure IPθ∗ is known and

studies the distributional (population) properties of the vector Y . On the contrary,

the statistical approach assumes that the data Y are given and tries to recover the

distribution IP on the basis of the available data Y . One can say that the statistical

problem is inverse to the probabilistic one.

The statistical analysis is usually based on the notion of statistical experiment. This

notion assumes that a family P = {IP} of probability measures IP on (Y,B(Y)) is fixed

and the unknown underlying measure IPθ∗ belongs to this family. Often this family is

22

parameterized by the value θ from some parameter set Θ : P = (IPθ,θ ∈ Θ) . The

corresponding statistical experiment can be written as

(Y,B(Y), (IPθ,θ ∈ Θ)

).

The value θ∗ denotes the “true” parameter value, that is, IP = IPθ∗ .

The statistical experiment is dominated if there exists a dominating σ -finite measure

µ0 such that all the IPθ are absolutely continuous w.r.t. µ0 . In what follows we assume

without further mention that the considered statistical models are dominated. Usually

the choice of a dominating measure is unimportant and any one can be used.

The parametric approach assumes that Θ is a subset of a finite-dimensional Euclidean

space IRp . In this case, the unknown data distribution is specified by the value of a finite-

dimensional parameter θ from Θ ⊆ IRp . Since in this case the parameter θ completely

identifies the distribution of the observations Y , the statistical estimation problem is

reduced to recovering (estimating) this parameter from the data. The nice feature of the

parametric theory is that the estimation problem can be solved in a rather general way.

1.4 Statistical decision problem. Loss and Risk

The statistical decision problem is usually formulated in terms of game theory, the statis-

tician playing as it were against nature. Let D denote the decision space that is assumed

to be a topological space. Next, let ℘(·, ·) be a loss function given on the product D×Θ .

The value ℘(d,θ) denotes the loss associated with the decision d ∈ D when the true

parameter value is θ ∈ Θ . The statistical decision problem is composed of a statistical

experiment (Y,B(Y),P) , a decision space D and a loss function ℘(·, ·) .

A statistical decision ρ = ρ(Y ) is a measurable function of the observed data Y

with values in the decision space D . Clearly, ρ(Y ) can be considered as a random

D -valued element on the space (Y,B(Y)) . The corresponding loss under the true model

(Y,B(Y), IPθ∗) reads as ℘(ρ(Y ),θ∗) . Finally, the risk is defined as the expected value

of the loss:

R(ρ,θ∗)def= IEθ∗℘(ρ(Y ),θ∗).

Below we present a list of typical the statistical decision problems.

Example 1.4.1. [Point estimation problem] Let the target of analysis be the true pa-

rameter θ∗ itself, that is, D coincide with Θ , Let ℘(·, ·) be a kind of distance on

Θ , that is, ℘(θ,θ∗) denotes the loss of estimation, when the selected value is θ while

the true parameter is θ∗ . Typical examples of the loss function are quadratic loss

23

℘(θ,θ∗) = ‖θ− θ∗‖2 , l1 -loss ℘(θ,θ∗) = ‖θ− θ∗‖1 or sup-loss ℘(θ,θ∗) = ‖θ− θ∗‖∞ =

maxj=1,...,p |θj − θ∗j | .If θ is an estimate of θ∗ , that is, θ is a Θ -valued function of the data Y , then the

corresponding risk is

R(ρ,θ∗)def= IEθ∗℘(θ,θ∗).

Particularly, the quadratic risk reads as IEθ∗‖θ − θ∗‖2 .

Example 1.4.2. [Testing problem] Let Θ0 and Θ1 be two complementary subsets of Θ ,

that is, Θ0 ∩Θ1 = ∅ , Θ0 ∪Θ1 = Θ . Our target is to check whether the true parameter

θ∗ belongs to the subset Θ0 . The decision space consists of two points {0, 1} for which

d = 0 means the acceptance of the hypothesis H0 : θ∗ ∈ Θ0 while d = 1 rejects H0 in

favor of the alternative H1 : θ∗ ∈ Θ1 . Define the loss

℘(d,θ) = 1(d = 1,θ ∈ Θ0) + 1(d = 0,θ ∈ Θ1).

A test φ is a binary valued function of the data, φ = φ(Y ) ∈ {0, 1} . The corresponding

risk R(φ,θ∗) = IEθ∗φ(Y ) can be interpreted as the probability of selecting the wrong

subset.

Example 1.4.3. [Confidence estimation] Let the target of analysis again be the pa-

rameter θ∗ . However, we aim to identify a subset A of Θ , as small as possible, that

covers with a presribed probability the true value θ∗ . Our decision space D is now

the set of all measurable subsets in Θ . For any A ∈ D , the loss function is defined as

℘(A,θ∗) = 1(A 63 θ∗) . A confidence set is a random set E selected from the data Y ,

E = E(Y ) . The corresponding risk R(E,θ∗) = IEθ∗℘(E,θ∗) is just the probability that

E does not cover θ∗ .

Example 1.4.4. [Estimation of a functional] Let the target of estimation be a given

function f(θ∗) of the parameter θ∗ with values in another space F . A typical example is

given by a single component of the vector θ∗ . An estimate ρ of f(θ∗) is a function of the

data Y into F : ρ = ρ(Y ) ∈ F . The loss function ℘ is defined on the product F ×F ,

yielding the loss ℘(ρ(Y ), f(θ∗)) and the risk R(ρ(Y ), f(θ∗)) = IEθ∗℘(ρ(Y ), f(θ∗)) .

Exercise 1.4.1. Define the statistical decision problem for testing a simple hypothesis

θ∗ = θ0 for a given point θ0 .

1.5 Efficiency

After the statistical decision problem is stated, one can ask for its optimal solution.

Equivalently one can say that the aim of statistical analysis is to build a decision with

24

the minimal possible risk. However, a comparison of any two decisions on the basis of

risk can be a nontrivial problem. Indeed, the risk R(ρ,θ∗) of a decision ρ depends on

the true parameter value θ∗ . It may happen that one decision performs better for some

points θ∗ ∈ Θ but worse at other points θ∗ . An extreme example of such an estimate is

the trivial deterministic decision θ = θ0 which sets the estimate equal to the value θ0

whatever the data is. This is, of course, a very strange and poor estimate, but it clearly

outperforms all other methods if the true parameter θ∗ is indeed θ0 .

Two approaches are typically used to compare different statistical decisions: the

minimax approach considers the maximum R(ρ) of the risks R(ρ,θ) over the parameter

set Θ while the Bayes approach is based on the weighted sum (integral) Rπ(ρ) of such

risks with respect to some measure π on the parameter set Θ which is called the prior

distribution:

R(ρ) = supθ∈Θ

R(ρ,θ),

Rπ(ρ) =

∫R(ρ,θ)π(dθ).

The decision ρ∗ is called minimax if

R(ρ∗) = infρR(ρ) = inf

ρsupθ∈Θ

R(ρ,θ),

where the infimum is taken over the set of all possible decisions ρ . The value R∗ = R(ρ∗)

is called the minimax risk.

Similarly, the decision ρπ is called Bayes for the prior π if

Rπ(ρπ) = infρRπ(ρ).

The corresponding value Rπ(ρπ) is called the Bayes risk.

Exercise 1.5.1. Show that the minimax risk is greater than or equal to the Bayes risk

whatever the prior measure π is.

Hint: show that for any decision ρ , it holds R(ρ) ≥ Rπ(ρ) .

Usually the problem of finding a minimax or Bayes estimate is quite hard and a

closed form solution is available only in very few special cases. A standard way out of

this problem is to switch to an asymptotic set-up in which the sample size grows to

infinity.

Chapter 2

Parameter estimation for an i.i.d.

model

In the present chapter we consider the estimation problem for a sample of independent

identically distributed (i.i.d.) observations. Throughout the chapter the data Y are

assumed to be given in the form of a sample (Y1, . . . , Yn) . We assume that the obser-

vations Y1, . . . , Yn are independent identically distributed; each Yi is from an unknown

distribution P also called a marginal measure. The joint data distribution IP is the

n -fold product of P : IP = P⊗n . Thus, the measure IP is uniquely identified by P and

the statistical problem can be reduced to recovering P .

The further step in model specification is based on a parametric assumption (PA):

the measure P belongs to a given parametric family.

2.1 Empirical distribution. Glivenko-Cantelli Theorem

Let Y = (Y1, . . . , Yn)> be an i.i.d. sample. For simplicity we assume that the Yi ’s are

univariate with values in IR . Let P denote the distribution of each Yi :

P (B) = IP (Yi ∈ B), B ∈ B(IR).

One often says that Y is an i.i.d. sample from P . Let also F be the corresponding

distribution function (cdf):

F (y) = IP (Y1 ≤ y) = P ((−∞, y]).

The assumption that the Yi ’s are i.i.d. implies that the joint distribution IP of the data

Y is given by the n -fold product of the marginal measure P :

IP = P⊗n.

25

26

Let also Pn (resp. Fn ) be the empirical measure (resp. empirical distribution function

(edf))

Pn(B) =1

n

∑1(Yi ∈ B), Fn(y) =

1

n

∑1(Yi ≤ y).

Here and everywhere in this chapter the symbol∑

stands for∑n

i=1 . One can consider

Fn as the distribution function of the empirical measure Pn defined as the atomic

measure at the Yi ’s:

Pn(A)def=

1

n

n∑i=1

1(Yi ∈ A).

So, Pn(A) is the empirical frequency of the event A , that is, the fraction of observations

Yi belonging to A . By the law of large numbers one can expect that this empirical

frequency is close to the true probability P (A) if the number of observations is sufficiently

large.

An equivalent definition of the empirical measure and empirical distribution function

can be given in terms of the empirical mean IEng for a measurable function g :

IEngdef=

∫ ∞−∞

g(y)IPn(dy) =

∫ ∞−∞

g(y)dFn(y) =1

n

n∑i=1

g(Yi).

The first results claims that indeed, for every Borel set B on the real line, the empirical

mass Pn(B) (which is random) is close in probability to the population counterpart

P (B) .

Theorem 2.1.1. For any Borel set B , it holds

1. IEPn(B) = P (B) .

2. Var{Pn(B)

}= n−1σ2B with σ2B = P (B)

{1− P (B)

}.

3. Pn(B)→ P (B) in probability as n→∞ .

4.√n{Pn(B)− P (B)} w−→ N(0, σ2B) .

Proof. Denote ξi = 1(Yi ∈ B) . This is a Bernoulli r.v. with parameter P (B) = IEξi .

The first statement holds by definition of Pn(B) = n−1∑

i ξi . Next, for each i ≤ n ,

Var ξidef= IEξ2i −

(IEξi

)2= P (B)

{1− P (B)

}in view of ξ2i = ξi . Independence of the ξi ’s yields

Var{Pn(B)

}= Var

(n−1

n∑i=1

ξi

)= n−2

n∑i=1

Var ξi = n−1σ2B.

27

The third statement follows by the law of large numbers for the i.i.d. r.v. ξi :

1

n

n∑i=1

ξiIP−→ IEξ1 .

Finally, the last statement follows by the Central Limit Theorem for the ξi :

1√n

n∑i=1

(ξi − IEξi

) w−→ N(0, σ2B

).

The next important result shows that the edf Fn is a good approximation of the cdf

F in the uniform norm.

Theorem 2.1.2 (Glivenko-Cantelli). It holds

supy

∣∣Fn(y)− F (y)∣∣→ 0, n→∞

Proof. Consider first the case when the function F is continuous in y . Fix any integer

N and define with ε = 1/N the points t1 < t2 < . . . < tN = +∞ such that F (tj) −F (tj−1) = ε for j = 2, . . . , N . For every j , by (3) of Theorem 2.1.1, it holds Fn(tj)→F (tj) . This implies that for some n(N) , it holds for all n ≥ n(N)∣∣Fn(tj)− F (tj)

∣∣ ≤ ε, j = 1, . . . , N. (2.1)

Now for every t ∈ [tj−1, tj ] , it holds by definition

F (tj−1) ≤ F (t) ≤ F (tj), Fn(tj−1) ≤ Fn(t) ≤ Fn(tj).

This together with (2.1) implies ∣∣Fn(t)− F (t)∣∣ ≤ 2ε.

If the function F (·) is not continuous, then for every positive ε , there exists a finite set

Sε of points of discontinuity sm with F (sm) − F (sm − 0) ≥ ε . One can proceed as in

the continuous case by adding the points from Sε to the discrete set {tj} .

Exercise 2.1.1. Check the details of the proof of Theorem 2.1.2.

The results of Theorems 2.1.1 and 2.1.2 can be extended to certain functionals of the

distribution P . Let g(y) be a function on the real line. Consider its expectation

s0def= IEg(Y1) =

∫ ∞−∞

g(y)dF (y).

28

Its empirical counterpart is defined by

Sndef=

∫ ∞−∞

g(y)dFn(y) =1

n

n∑i=1

g(Yi).

It appears that Sn indeed well estimates s0 , at least for large n .

Theorem 2.1.3. Let g(y) be a function on the real line such that∫ ∞−∞

g2(y)dF (y) <∞

Then

SnIP−→ s0,

√n(Sn − s0)

w−→ N(0, σ2g), n→∞,

where

σ2gdef=

∫ ∞−∞

g2(y)dF (y)− s20 =

∫ ∞−∞

[g(y)− s0

]2dF (y).

Moreover, if h(z) is a twice continuously differentiable function on the real line, and

h′(s0) 6= 0 then

h(Sn)IP−→ h(s0),

√n{h(Sn)− h(s0)

} w−→ N(0, σ2h), n→∞,

where σ2hdef= |h′(s0)|2σ2g .

Proof. The first statement is again the CLT for the i.i.d. random variables ξi = g(Yi)

having mean value s0 and variance σ2g .

It also implies the second statement in view of the Taylor expansion h(Sn)−h(s0) ≈h′(s0) (Sn − s0) .

Exercise 2.1.2. Complete the proof.

Hint: use the first result to show that Sn belongs with high probability to a small

neighborhood U of the point s0 .

Then apply the Taylor expansion of second order to h(Sn)−h(s0) = h(s0+n−1/2ξn

)−

h(s0) with ξn =√n(Sn − s0) :

∣∣n1/2[h(Sn)− h(s0)]− h′(s0) ξn∣∣ ≤ n−1/2h∗ξ2n/2,

where h∗ = maxU |h′′(y)| . Show that n−1/2ξ2nIP−→ 0 because ξn is stochastically

bounded by the first statement of the theorem.

29

The results of Theorems 2.1.2 and 2.1.3 can be extended to the case of a vectorial

function g(·) : IR1 → IRm , that is, g(y) =(g1(y), . . . , gm(y)

)>for y ∈ IR1 . Then

s0 = (s0,1, . . . , s0,m)> and its empirical counterpart Sn = (Sn,1, . . . , Sn,m)> are vectors

in IRm as well:

s0,jdef=

∫ ∞−∞

gj(y)dF (y), Sn,jdef=

∫ ∞−∞

gj(y)dFn(y), j = 1, . . . ,m.

Theorem 2.1.4. Let g(y) be an IRm -valued function on the real line with a bounded

covariance matrix Σ = (Σjk)j,k=1,...,m :

Σjkdef=

∫ ∞−∞

[gj(y)− s0,j

][gk(y)− s0,k

]dF (y) <∞, j, k ≤ m

Then

SnIP−→ s0,

√n(Sn − s0

) w−→ N(0, Σ), n→∞.

Moreover, if H(z) is a twice continuously differentiable function on IRm and ΣH ′(s0) 6=0 where H ′(z) stands for the gradient of H at z then

H(Sn)IP−→ H(s0),

√n{H(Sn)−H(s0)

} w−→ N(0, σ2H), n→∞,

where σ2Hdef= H ′(s0)

>ΣH ′(s0) .

Exercise 2.1.3. Prove Theorem 2.1.4.

Hint: consider for every h ∈ IRm the scalar products h>g(y) , h>s0 , h>Sn . For

the first statement, it suffices to show that

h>SnIP−→ h>s0,

√nh>

(Sn − s0

) w−→ N(0,h>Σh), n→∞.

For the second statement, consider the expansion

∣∣n1/2[H(Sn)−H(s0)]− ξ>nH ′(s0)∣∣ ≤ n−1/2H∗ ‖ξn‖2/2 IP−→ 0,

with ξn = n1/2(Sn − s0) and H∗ = maxy∈U ‖H ′′(y)‖ for a neighborhood U of s0 .

2.2 Substitution principle. Method of moments

By the Glivenko-Cantelli theorem the empirical measure Pn (resp. edf Fn ) is a good

approximation of the true measure P (reps. pdf F ), at least, if n is sufficiently large.

This leads to the important substitution method of statistical estimation: represent the

target of estimation as a function of the distribution P , then replace P by Pn .

30

Suppose that there exists some functional g of a measure Pθ from the family P =

(Pθ,θ ∈ Θ) such that the following identity holds:

θ = g(Pθ), θ ∈ Θ.

This particularly implies θ∗ = g(Pθ∗) = g(P ) . The substitution estimate is defined by

substituting Pn for P :

θ = g(Pn).

Sometimes the obtained value θ can lie outside the parameter set Θ . Then one can

redefine the estimate θ as the value providing the best fit of g(Pn) :

θ = argminθ‖g(Pθ)− g(Pn)‖.

Here ‖ · ‖ denotes some norm on the parameter set Θ , e.g. the Euclidean norm.

2.2.1 Method of moments. Univariate parameter

The method of moments is a special but at the same time the most frequently used

case of the substitution method. For illustration, we start with the univariate case. Let

Θ ⊆ IR , that is, θ is a univariate parameter. Let g(y) be a function on IR such that

the first moment

m(θ)def= Eθg(Y1) =

∫g(y)dPθ(y)

is continuous and monotonic. Then the parameter θ can be uniquely identified by the

value m(θ) , that is, there exists an inverse function m−1 satisfying

θ = m−1(∫

g(y)dPθ(y)

).

The substitution method leads to the estimate

θ = m−1(∫

g(y)dPn(y)

)= m−1

(1

n

∑g(Yi)

).

Usually g(x) = x or g(x) = x2 , which explains the name of the method. This method

was proposed by Pearson and is historically the first regular method of constructing a

statistical estimate.

31

2.2.2 Method of moments. Multivariate parameter

The method of moments can be easily extended to the multivariate case. Let Θ ⊆ IRp ,

and let g(y) =(g1(y), . . . , gp(y)

)>be a function with values in IRp . Define the moments

m(θ) =(m1(θ), . . . ,mp(θ)

)by

mj(θ) = Eθgj(Y1) =

∫gj(y)dPθ(y).

The main requirement on the choice of the vector function g is that the function m is

invertible, that is, the system of equations

mj(θ) = tj

has a unique solution for any t ∈ IRp . The empirical counterpart Mn of the true

moments m(θ∗) is given by

Mndef=

∫g(y)dPn(y) =

(1

n

∑g1(Yi), . . . ,

1

n

∑gp(Yi)

)>.

Then the estimate θ can be defined as

θdef= m−1(Mn) = m−1

(1

n

∑g1(Yi), . . . ,

1

n

∑gp(Yi)

).

2.2.3 Method of moments. Examples

This section lists some widely used parametric families and discusses the problem of

constructing the parameter estimates by different methods. In all the examples we assume

that an i.i.d. sample from a distribution P is observed, and this measure P belongs a

given parametric family (Pθ,θ ∈ Θ) , that is, P = Pθ∗ for θ∗ ∈ Θ .

Gaussian shift

Let Pθ be the normal distribution on the real line with mean θ and the known variance

σ2 . The corresponding density w.r.t. the Lebesgue measure reads as

p(y, θ) =1√

2πσ2exp{−(y − θ)2

2σ2

}.

It holds IEθY1 = θ and Varθ(Y1) = σ2 leading to the moment estimate

θ =

∫ydPn(y) =

1

n

∑Yi

with mean IEθθ = θ and variance

Varθ(θ) = σ2/n.

32

Univariate normal distribution

Let Yi ∼ N(α, σ2) as in the previous example but both mean α and the variance σ2

are unknown. This leads to the problem of estimating the vector θ = (θ1, θ2) = (α, σ2)

from the i.i.d. sample Y .

The method of moments suggests to estimate the parameters from the first two em-

pirical moments of the Yi ’s using the equations m1(θ) = IEθY1 = α , m2(θ) = IEθY21 =

α2 + σ2 . Inverting these equalities leads to

α = m1(θ) , σ2 = m2(θ)−m21(θ)

Substituting the empirical measure Pn yields the expressions for θ :

α =1

n

∑Yi, σ2 =

1

n

∑Y 2i −

(1

n

∑Yi

)2

=1

n

∑(Yi − α

)2. (2.2)

As previously for the case of a known variance, it holds under IP = IPθ :

IEα = α, Varθ(α) = σ2/n.

However, for the estimate σ2 of σ2 , the result is slightly different and it is described in

the next theorem.

Theorem 2.2.1. It holds

IEθσ2 =

n− 1

nσ2, Varθ(σ2) =

2(n− 1)

n2σ4.

Proof. We use vector notation. Consider the unit vector e = n−1/2(1, . . . , 1)> ∈ IRn and

denote by Π1 the projector on e :

Π1h = (e>h)e.

Then by definition α = n−1/2e>Π1Y and σ2 = n−1‖Y −Π1Y ‖2 . Moreover, the model

equation Y = n1/2αe+ ε implies in view of Π1e = e that

Π1Y = (n1/2αe+Π1ε).

Now

nσ2 = ‖Y −Π1Y ‖2 = ‖ε−Π1ε‖2 = ‖(In −Π1)ε‖2

where In is the identity operator in IRn and In−Π1 is the projector on the hyperplane

in IRn orthogonal to the vector e . Obviously (In−Π1)ε is a Gaussian vector with zero

33

mean and the covariance matrix V defined by

V = IE[(In −Π1)εε

>(In −Π1)]

= (In −Π1)IE(εε>)(In −Π1)

= σ2(In −Π1)2 = σ2(In −Π1).

It remains to note that for any Gaussian vector ξ ∼ N(0, V ) it holds

IE‖ξ‖2 = trV, Var(‖ξ‖2

)= 2 tr(V 2).

Exercise 2.2.1. Check the details of the proof.

Hint: reduce to the case of diagonal V .

Exercise 2.2.2. Compute the covariance IE(α−α)(σ2−σ2) . Show that α and σ2 are

independent.

Hint: represent α − α = eΠ1ε and σ2 = n−1‖(In − Π1)ε‖2 . Use that Π1ε and

(In −Π1)ε are independent if Π1 is a projector and ε is a Gaussian vector.

Uniform distribution on [0, θ]

Let Yi be uniformly distributed on the interval [0, θ] of the real line where the right

end point θ is unknown. The density p(y, θ) of Pθ w.r.t. the Lebesgue measure is

θ−11(y ≤ θ) . It is easy to compute that for an integer k

IEθ(Yk1 ) = θ−1

∫ θ

0ykdy = θk/(k + 1),

or θ ={

(k + 1)IEθ(Yk1 )}1/k

. This leads to the family of estimates

θk =

(k + 1

n

∑Y ki

)1/(k+1)

.

Letting k to infinity leads to the estimate

θ∞ = max{Y1, . . . , Yn}.

This estimate is quite natural in the context of the univariate distribution. Later it will

appear once again as the maximum likelihood estimate. However, it is not a moment

estimate.

34

Bernoulli or binomial model

Let Pθ be a Bernoulli law for θ ∈ [0, 1] . Then every Yi is binary with

IEθYi = θ.

This leads to the moment estimate

θ =

∫ydPn(y) =

1

n

∑Yi .

Exercise 2.2.3. Compute the moment estimate for g(y) = yk , k ≥ 1 .

Multinomial model

The multinomial distribution Bmθ describes the number of successes in m experiments

when each success has the probability θ ∈ [0, 1] . This distribution can be viewed as the

sum of m binomials with the same parameter θ . Observed is the sample Y where each

Yi is the number of successes in the i th experiment. One has

Pθ(Y1 = k) =

(m

k

)θk(1− θ)m−k, k = 0, . . . ,m.

Exercise 2.2.4. Check that method of moments with g(x) = x leads to the estimate

θ =1

mn

∑Yi .

Compute Varθ(θ) .

Hint: Reduce the multinomial model to the sum of m Bernoulli.

Exponential model

Let Pθ be an exponential distribution on the positive semiaxis with the parameter θ .

This means

IPθ(Y1 > y) = e−y/θ.


θ =1

n

∑Yi .

Compute Varθ(θ) .

35

Poisson model

Let Pθ be the Poisson distribution with the parameter θ . The Poisson random variable

Y1 is integer-valued with

Pθ(Y1 = k) =θk

k!e−k.


θ =1

n

∑Yi .

Compute Varθ(θ) .

Shift of a Laplace (double exponential) law

Let P0 be a symmetric distribution defined by the equations

P0(|Y1| > y) = e−y/σ, y ≥ 0,

for some given σ > 0 . Equivalently one can say that the absolute value of Y1 is

exponential with parameter σ under P0 . Now define Pθ by shifting P0 by the value

θ . This means that

Pθ(|Y1 − θ| > y) = e−y/σ, y ≥ 0.

It is obvious that IE0Y1 = 0 and IEθY1 = θ .

Exercise 2.2.7. Check that method of moments leads to the estimate

θ =1

n

∑Yi .

Compute Varθ(θ) .

Shift of a symmetric density

Let the observations Yi be defined by the equation

Yi = θ∗ + εi

where θ∗ is an unknown parameter and the errors εi are i.i.d. with a density symmetric

around zero and finite second moment σ2 = IEε21 . This particularly yields that IEεi = 0

and IEYi = θ∗ . The method of moments immediately yields the empirical mean estimate

θ =1

n

∑Yi

with Varθ(θ) = σ2/n .

36

2.3 Unbiased estimates, bias, and quadratic risk

Consider a parametric i.i.d. experiment corresponding to a sample Y = (Y1, . . . , Yn)>

from a distribution Pθ∗ ∈ (Pθ,θ ∈ Θ ⊆ IRp) . By θ∗ we denote the true parameter from

Θ . Let θ be an estimate of θ∗ , that is, a function of the available data Y with values

in Θ : θ = θ(Y ) .

An estimate θ of the parameter θ∗ is called unbiased if

IEθ∗ θ = θ∗.

This property seems to be rather natural and desirable. However, it is often just matter

of parametrization. Indeed, if g : Θ → Θ is a linear transformation of the parameter

set Θ , that is, g(θ) = Aθ + b , then the estimate ϑdef= Aθ + b of the new parameter

ϑ = Aθ + b is again unbiased. However, if m(·) is a nonlinear transformation, then the

identity IEθ∗m(θ) = m(θ∗) is not preserved.

Example 2.3.1. Consider the Gaussian shift experiments for Yi i.i.d. N(θ∗, σ2) with

known variance σ2 but the shift parameter θ∗ is unknown. Then θ = n−1(Y1+ . . .+Yn)

is an unbiased estimate of θ∗ . However, for m(θ) = θ2 , it holds

IEθ∗ |θ|2 = |θ∗|2 + σ2/n,

that is, the estimate |θ|2 of |θ∗|2 is slightly biased.

The property of “no bias” is especially important in connection with the quadratic

risk of the estimate θ . To illustrate this point, we first consider the case of a univariate

parameter.

2.3.1 Univariate parameter

Let θ ∈ Θ ⊆ IR1 . Denote by Var(θ) the variance of the estimate θ :

Varθ∗(θ) = IEθ∗(θ − IEθ∗ θ

)2.

The quadratic risk of θ is defined by

R(θ, θ∗)def= IEθ∗

∣∣θ − θ∗∣∣2.It is obvious that R(θ, θ∗) = Varθ∗(θ) if θ is unbiased. It turns out that the quadratic

risk of θ is larger than the variance when this property is not fulfilled. Define the bias

of θ as

b(θ, θ∗)def= IEθ∗ θ − θ∗.

37

Theorem 2.3.1. It holds for any estimate θ of the univariate parameter θ∗ :

R(θ, θ∗) = Varθ∗(θ) + b2(θ, θ∗).

Due to this result, the bias b(θ, θ∗) contributes the value b2(θ, θ∗) in the quadratic

risk. This particularly explains why one is interested in considering unbiased or at least

nearly unbiased estimates.

2.3.2 Multivariate case

Now we extend the result to the multivariate case with θ ∈ Θ ⊆ IRp . Then θ is a vector

in IRp . The corresponding variance-covariance matrix Varθ∗(θ) is defined as

Varθ∗(θ)def= IEθ∗

[(θ − IEθ∗ θ

)(θ − IEθ∗ θ

)>].

As previously, θ is unbiased if IEθ∗ θ = θ∗ , and the bias of θ is b(θ,θ∗)def= IEθ∗ θ− θ∗ .

The quadratic risk of the estimate θ in the multivariate case is usually defined via

the Euclidean norm of the difference θ − θ∗ :

R(θ,θ∗)def= IEθ∗

∥∥θ − θ∗∥∥2.Theorem 2.3.2. It holds

R(θ,θ∗) = tr[Varθ∗(θ)

]+∥∥b(θ,θ∗)∥∥2

Proof. The result follows similarly to the univariate case using the identity ‖v‖2 =

tr(vv>) for any vector v ∈ IRp .

Exercise 2.3.1. Complete the proof of Theorem 2.3.2.

2.4 Asymptotic properties

The properties of the previously introduced estimate θ heavily depend on the sample size

n . We therefore, use the notation θn to highlight this dependence. A natural extension

of the condition that θ is unbiased is the requirement that the bias b(θ,θ∗) becomes

negligible as the sample size n increases. This leads to the notion of consistency.

Definition 2.4.1. A sequence of estimates θn is consistent if

θnIP−→ θ∗ n→∞.

θn is mean consistent if

IEθ∗‖θn − θ∗‖ → 0, n→∞.

38

Clearly mean consistency implies consistency and also asymptotic unbiasedness:

b(θn,θ∗) = IEθn − θ∗

IP−→ 0, n→∞.

The property of consistency means that the difference θ − θ∗ is small for n large. The

next natural question to address is how fast this difference tends to zero with n . The

Glivenko-Cantelli result suggests that√n(θn − θ∗

)is asymptotically normal.

Definition 2.4.2. A sequence of estimates θn is root-n normal if

√n(θn − θ∗

) w−→ N(0, V )

for some fixed matrix V .

We aim to show that the moment estimates are consistent and asymptotically root-n

normal under very general conditions. We start again with the univariate case.

2.4.1 Root-n normality. Univariate parameter

Our first result describes the simplest situation when the parameter of interest θ∗ can

be represented as an integral∫g(y)dPθ∗(y) for some function g(·) .

Theorem 2.4.3. Suppose that Θ ⊆ IR and a function g(·) : IR→ IR satisfies for every

θ ∈ Θ ∫g(y)p(y, θ)dµ0(y) = θ,∫ [

g(y)− θ]2p(y, θ)dµ0(y) = σ2(θ) <∞.

Then the moment estimates θn = n−1∑g(Yi) satisfy the following conditions:

1. each θn is unbiased, that is, IEθ∗ θn = θ∗ .

2. the normalized quadratic risk nIEθ∗(θn − θ∗

)2fulfills

nIEθ∗(θn − θ∗

)2= σ2(θ∗).

3. θn is asymptotically root-n normal:

√n(θn − θ∗

) w−→ N(0, σ2(θ∗)).

This result has already been proved, see Theorem 2.1.3. Next we extend this result

to the more general situation when θ∗ is defined implicitly via the moment s0(θ∗) =∫

g(y)dPθ∗(y) . This means that there exists another function m(θ∗) such that m(θ∗) =∫g(y)dPθ∗(y) .

39

Theorem 2.4.4. Suppose that Θ ⊆ IR and a functions g(y) : IR→ IR and m(θ) : Θ →IR satisfy ∫

g(y)p(y, θ∗)dµ0(y) = m(θ∗),∫ {g(y)−m(θ∗)

}2p(y, θ∗)dµ0(y) = σ2g(θ

∗) <∞.

We also assume that m(·) is monotonic and twice continuously differentiable with m′(m(θ∗)

)6=

0 . Then the moment estimates θn = m−1(n−1

∑g(Yi)

)satisfy the following conditions:

1. θn is consistent, that is, θnIP−→ θ∗ .


√n(θn − θ∗

) w−→ N(0, σ2(θ∗)), (2.3)

where σ2(θ∗) = |m′(m(θ∗)

)|−2σ2g(θ∗) .

This result also follows directly from Theorem 2.1.3 with h(s) = m−1(s) .

The property of asymptotic normality allows us to study the asymptotic concentration

of θn and to build asymptotic confidence sets.

Corollary 2.4.5. Let θn be asymptotically root-n normal: see (2.3). Then for any

z > 0

limn→∞

IPθ∗(√n∣∣θn − θ∗∣∣ > zσ(θ∗)

)= 2Φ(−z)

where Φ(z) is the cdf of the standard normal law.

In particular, this result implies that the estimate θn belongs to a small root-n

neighborhood

A(z)def= [θ∗ − n−1/2σ(θ∗)z, θ∗ + n−1/2σ(θ∗)z]

with the probability about 2Φ(−z) which is small provided that z is sufficiently large.

Next we briefly discuss the problem of interval (or confidence) estimation of the

parameter θ∗ . This problem differs from the problem of point estimation: the target is

to build an interval (a set) Eα on the basis of the observations Y such that IP (Eα 3θ∗) ≈ 1−α for a given α ∈ (0, 1) . This problem can be attacked similarly to the problem

of concentration by considering the interval of width 2σ(θ∗)z centered at the estimate

θ . However, the major difficulty is raised by the fact that this construction involves the

true parameter value θ∗ via the variance σ2(θ∗) . In some situations this variance does

not depend on θ∗ : σ2(θ∗) ≡ σ2 with a known value σ2 . In this case the construction is

immediate.

40

Corollary 2.4.6. Let θn be asymptotically root-n normal: see (2.3). Then for any

α ∈ (0, 1) , the set

E◦(zα)def= [θn − n−1/2σ(θ∗)zα, θn + n−1/2σ(θ∗)zα],

where zα is defined by 2Φ(−zα) = α , satisfies

limn→∞

IPθ∗(E(zα) 3 θ∗)

)= 1− α. (2.4)

Exercise 2.4.1. Check Corollaries 2.4.5 and 2.4.6.

Next we consider the case when the variance σ2(θ∗) is unknown. Instead we assume

that a consistent variance estimate σ2 is available. Then we plug this estimate in the

construction of the confidence set in place of the unknown true variance σ2(θ∗) leading

to the following confidence set:

E(zα)def= [θn − n−1/2σzα, θn + n−1/2σzα]. (2.5)

Theorem 2.4.7. Let θn be asymptotically root-n normal: see (2.3). Let σ(θ∗) > 0 and

σ2 be a consistent estimate of σ2(θ∗) in the sense that σ2IP−→ σ2(θ∗) . Then for any

α ∈ (0, 1) , the set E(zα) is asymptotically α -confident in the sense of (2.4).

One natural estimate of the variance σ(θ∗) can be obtained by plugging in the esti-

mate θ in place of θ∗ leading to σ = σ(θ) . If σ(θ) is a continuous function of θ in a

neighborhood of θ∗ , then consistency of θ implies consistency of σ .

Corollary 2.4.8. Let θn be asymptotically root-n normal and let the variance σ2(θ) be

a continuous function of θ at θ∗ . Then σdef= σ(θn) is a consistent estimate of σ(θ∗)

and the set A(zα) from (2.5) is asymptotically α -confident.

2.4.2 Root-n normality. Multivariate parameter

Let now Θ ⊆ IRp and θ∗ be the true parameter vector. The method of moments requires

at least p different moment functions for identifying p parameters. Let g(y) : IR→ IRp

be a vector of moment functions, g(y) =(g1(y), . . . , gp(y)

)>. Suppose first that the true

parameter can be obtained just by integration: θ∗ =∫g(y)dPθ∗(y) . This yields the

moment estimate θn = n−1∑g(Yi) .

Theorem 2.4.9. Suppose that a vector-function g(y) : IR → IRp satisfies the following

conditions: ∫g(y)p(y,θ∗)dµ0(y) = θ∗,∫ {

g(y)− θ∗}{g(y)− θ∗

}>p(y,θ∗)dµ0(y) = Σ(θ∗).

41

Then it holds for the moment estimate θn = n−1∑g(Yi) :

1. θ is unbiased, that is, IEθ∗ θ = θ∗ .


√n(θn − θ∗

) w−→ N(0, Σ(θ∗)). (2.6)

3. the normalized quadratic risk nIEθ∗∥∥θn − θ∗∥∥2 fulfills

nIEθ∗∥∥θn − θ∗∥∥2 = trΣ(θ∗).

Similarly to the univariate case, this result yields corollaries about concentration and

confidence sets with intervals replaced by ellipsoids. Indeed, due to the second statement,

the vector

ξndef=√n{Σ(θ∗)}−1/2(θ − θ∗)

is asymptotically standard normal: ξnw−→ ξ ∼ N(0, Ip) . This also implies that the

squared norm of ξn is asymptotically χ2p -distributed where ξ2p is the law of ‖ξ‖2 =

ξ21 + . . .+ ξ2p . Define the value zα via the quantiles of χ2p by the relation

IP(‖ξ‖ > zα

)= α. (2.7)

Corollary 2.4.10. Suppose that θn is root-n normal, see (2.6). Define for a given z

the ellipsoid

A(z)def= {θ : (θ − θ∗)>{Σ(θ∗)}−1(θ − θ∗) ≤ z2/n}.

Then A(zα) is asymptotically (1− α) -concentration set for θn in the sense that

limn→∞

IP(θ 6∈ A(zα)

)= α.

The weak convergence ξnw−→ ξ suggests to build confidence sets also in form of

ellipsoids with the axis defined by the covariance matrix Σ(θ∗) . Define for α > 0

E◦(zα)def={θ :√n∥∥{Σ(θ∗)}−1/2(θ − θ)

∥∥ ≤ zα}.The result of Theorem 2.4.9 implies that this set covers the true value θ∗ with probability

approaching 1− α .

Unfortunately, in typical situations the matrix Σ(θ∗) is unknown because it depends

on the unknown parameter θ∗ . It is natural to replace it with the matrix Σ(θ) replacing

42

the true value θ∗ with its consistent estimate θ . If Σ(θ) is a continuous function of

θ , then Σ(θ) provides a consistent estimate of Σ(θ∗) . This leads to the data-driven

confidence set:

E(zα)def={θ :√n∥∥{Σ(θ)}−1/2(θ − θ)

∥∥ ≤ z}.Corollary 2.4.11. Suppose that θn is root-n normal, see (2.6), with a non-degenerate

matrix Σ(θ∗) . Let the matrix function Σ(θ) be continuous at θ∗ . Let zα be defined

by (2.7). Then E◦(zα) and E(zα) are asymptotically (1− α) -confidence sets for θ∗ :

limn→∞

IP(E◦(zα) 3 θ∗

)= lim

n→∞IP(E(zα) 3 θ∗

)= 1− α.

Exercise 2.4.2. Check Corollaries 2.4.10 and 2.4.11 about the set E◦(zα) .

Exercise 2.4.3. Check Corollary 2.4.11 about the set E(zα) .

Hint: θ is consistent and Σ(θ) is continuous and invertible at θ∗ . This implies

Σ(θ)−Σ(θ∗)IP−→ 0, {Σ(θ)}−1 − {Σ(θ∗)}−1 IP−→ 0,

and hence, the sets E◦(zα) and E(zα) are nearly the same.

Finally we discuss the general situation when the target parameter is a function of

the moments. This means the relations

m(θ) =

∫g(y)dPθ(y), θ = m−1

(m(θ)

).

Of course, these relations assume that the vector function m(·) is invertible. The sub-

stitution principle leads to the estimate

θdef= m−1(Mn),

where Mn is the vector of empirical moments:

Mndef=

∫g(y)dPn(y) =

1

n

∑g(Yi).

The central limit theorem implies (see Theorem 2.1.4) that Mn is a consistent estimate of

m(θ∗) and the vector√n[Mn−m(θ∗)

]is asymptotically normal with some covariance

matrix Σg(θ∗) . Moreover, if m−1 is differentiable at the point m(θ∗) , then

√n(θ−θ∗)

is asymptotically normal as well:

√n(θ − θ∗) w−→ N(0, Σ(θ∗))

where Σ(θ∗) = H>Σg(θ∗)H and H is the p × p -Jacobi matrix of m−1 at m(θ∗) :

Hdef= d

dθm−1(m(θ∗)

).

43

2.5 Some geometric properties of a parametric family

The parametric situation means that the true marginal distribution P belongs to some

given parametric family (Pθ,θ ∈ Θ ⊆ IRp) . By θ∗ we denote the true value, that is,

P = Pθ∗ ∈ (Pθ) . The natural target of estimation in this situation is the parameter

θ∗ itself. Below we assume that the family (Pθ) is dominated, that is, there exists a

dominating measure µ0 . The corresponding density is denoted by

p(y,θ) =dPθdµ0

(y).

We also use the notation

`(y,θ)def= log p(y,θ)

for the log-density.

The following two important characteristics of the parametric family (Pθ) will be

frequently used in the sequel: the Kullback-Leibler divergence and Fisher information.

2.5.1 Kullback-Leibler divergence

Definition 2.5.1. For any two parameters θ,θ′ , the value

K(Pθ, Pθ′) =

∫log

p(y,θ)

p(y,θ′)p(y,θ)dµ0(y) =

∫ [`(y,θ)− `(y,θ′)

]p(y,θ)dµ0(y)

is called the Kullback-Leibler divergence (KL-divergence) between Pθ and Pθ′ .

We also write K(θ,θ′) instead of K(Pθ, Pθ′) if there is no risk of confusion. Equiv-

alently one can represent the KL-divergence as

K(θ,θ′) = Eθ logp(Y,θ)

p(Y,θ′)= Eθ

[`(Y,θ)− `(Y,θ′)

],

where Y ∼ Pθ . An important feature of the Kullback-Leibler divergence is that it is

always non-negative and it is equal to zero iff the measures Pθ and Pθ′ coincide.

Lemma 2.5.2. For any θ,θ′ , it holds

K(θ,θ′) ≥ 0.

Moreover, K(θ,θ′) = 0 implies that the densities p(y,θ) and p(y,θ′) coincide µ0 -a.s.

Proof. Define Z(y) = p(y,θ′)/p(y,θ) . Then∫Z(y)p(y,θ)dµ0(y) =

∫p(y,θ′)dµ0(y) = 1

44

because p(y,θ′) is the density of Pθ′ w.r.t. µ0 . Next, d2

dt2log(t) = −t−2 < 0 , thus, the

log-function is strictly concave. The Jensen inequality implies

K(θ,θ′) = −∫

log(Z(y))p(y,θ)dµ0(y) ≥ − log

(∫Z(y)p(y,θ)dµ0(y)

)= − log(1) = 0.

Moreover, the strict concavity of the log-function implies that the equality in this relation

is only possible if Z(y) ≡ 1 Pθ -a.s. This implies the last statement of the lemma.

The two mentioned features of the Kullback-Leibler divergence suggest to consider it

as a kind of distance on the parameter space. In some sense, it measures how far Pθ′ is

from Pθ . Unfortunately, it is not a metric because it is not symmetric:

K(θ,θ′) 6= K(θ′,θ)

with very few exceptions for some special situations.

Exercise 2.5.1. Compute KL-divergence for the Gaussian shift, Bernoulli, Poisson,

volatility and exponential families. Check in which cases it is symmetric.

Exercise 2.5.2. Consider the shift experiment given by the equation Y = θ + ε where

ε is an error with the given density function p(·) on IR . Compute the KL-divergence

and check for symmetry.

One more important feature of the KL-divergence is its additivity.

Lemma 2.5.3. Let (P(1)θ ,θ ∈ Θ) and (P

(2)θ ,θ ∈ Θ) be two parametric families with the

same parameter set Θ , and let (Pθ = P(1)θ × P (2)

θ ,θ ∈ Θ) be the product family. Then

for any θ,θ′ ∈ Θ

K(Pθ, Pθ′) = K(P(1)θ , P

(1)

θ′) + K(P

(2)θ , P

(2)

θ′)

Exercise 2.5.3. Prove Lemma 2.5.3. Extend the result to the case of the m -fold product

of measures.

Hint: use that the log-density `(y1, y2,θ) of the product measure Pθ fulfills `(y1, y2,θ) =

`(1)(y1,θ) + `(2)(y2)(θ) .

The additivity of the KL-divergence helps to easily compute the KL quantity for two

measures IPθ and IPθ′ describing the i.i.d. sample Y = (Y1, . . . , Yn)> . The log-density

of the measure IPθ w.r.t. µ0 = µ⊗n0 at the point y = (y1, . . . , yn)> is given by

L(y,θ) =∑

`(yi,θ).

45

An extension of the result of Lemma 2.5.3 yields

K(IPθ, IPθ′)def= IEθ

{L(Y ,θ)− L(Y ,θ′)

}= nK(θ,θ′).

2.5.2 Hellinger distance

Another useful characteristic of a parametric family (Pθ) is the so-called Hellinger dis-

tance. For a fixed µ ∈ [0, 1] and any θ,θ′ ∈ Θ , define

h(µ, Pθ, Pθ′) = Eθ

(dPθ′

dPθ(Y )

)µ=

∫ (p(y,θ′)p(y,θ)

)µdPθ(y)

=

∫pµ(y,θ′)p1−µ(y,θ)dµ0(y).

Note that this function can be represented as an exponential moment of the log-likelihood

ratio `(Y,θ,θ′) = `(Y,θ)− `(Y,θ′) :

h(µ, Pθ, Pθ′) = Eθ exp{µ`(Y,θ′,θ)

}= Eθ

(dPθ′

dPθ(Y )

)µ.

It is obvious that h(µ, Pθ, Pθ′) ≥ 0 . Moreover, h(µ, Pθ, Pθ′) ≤ 1 . Indeed, the function

xµ for µ ∈ [0, 1] is concave and by the Jensen inequality:

Eθ

(dPθ′

dPθ(Y )

)µ≤(IEθ

dPθ′

dPθ(Y )

)µ= 1.

Similarly to the Kullback-Leibler, we often write h(µ,θ,θ′) in place of h(µ, Pθ, Pθ′) .

Typically the Hellinger distance is considered for µ = 1/2 . Then

h(1/2,θ,θ′) =

∫p1/2(y,θ′)p1/2(y,θ)dµ0(y).

In contrast to the Kullback-Leibler divergence, this quantity is symmetric and can be

used to define a metric on the parameter set Θ .

Introduce

m(µ,θ,θ′) = − log h(µ,θ,θ′) = − logEθ exp{µ`(Y,θ′,θ)

}.

The property h(µ,θ,θ′) ≤ 1 implies m(µ,θ,θ′) ≥ 0 . This rate function will play

important role in the concentration properties of the maximum likelihood estimate, see

Section ??.

The rate function, like the KL-divergence, is additive.

46

Lemma 2.5.4. Let (P(1)θ ,θ ∈ Θ) and (P

(2)θ ,θ ∈ Θ) be two parametric families with the


θ ,θ ∈ Θ) be the product family. Then

for any θ,θ′ ∈ Θ and any µ ∈ [0, 1]

m(µ, Pθ, Pθ′) = m(P(1)θ , P

(1)

θ′) + m(P

(2)θ , P

(2)

θ′).

Exercise 2.5.4. Prove Lemma 2.5.4. Extend the result to the case of a m -fold product

of measures.

Hint: use that the log-density `(y1, y2,θ) of the product measure Pθ fulfills `(y1, y2,θ) =

`(1)(y1,θ) + `(2)(y2,θ) .

Application of this lemma to the i.i.d. product family yields

M(µ,θ′,θ)def= − log IEθ exp

{µL(Y ,θ,θ′)

}= nm(µ,θ′,θ).

2.5.3 Regularity and the Fisher Information. Univariate parameter

An important assumption on the considered parametric family (Pθ) is that the corre-

sponding density function p(y,θ) is absolutely continuous w.r.t. the parameter θ for

almost all y . Then the log-density `(y,θ) is differentiable as well with

∇`(y,θ)def=

∂`(y,θ)

∂θ=

1

p(y,θ)

∂p(y,θ)

∂θ

with the convention 10 log(0) = 0 . In the case of a univariate parameter θ ∈ IR , we also

write `′(y, θ) instead of ∇`(y, θ) .

Moreover, we usually assume some regularity conditions on the density p(y,θ) . The

next definition presents one possible set of such conditions for the case of a univariate

parameter θ .

Definition 2.5.5. The family (Pθ, θ ∈ Θ ⊂ IR) is regular if the following conditions are

fulfilled:

1. The sets A(θ)def= {y : p(y, θ) = 0} are the same for all θ ∈ Θ .

2. Differentiability under the integration sign: for any function s(y) satisfying∫s2(y)p(y, θ)dµ0(y) ≤ C, θ ∈ Θ

it holds

∂

∂θ

∫s(y)dPθ(y) =

∂

∂θ

∫s(y)p(y, θ)dµ0(y) =

∫s(y)

∂p(y, θ)

∂θdµ0(y).

47

3. Finite Fisher information: the log-density function `(y, θ) is differentiable in θ

and its derivative is square integrable w.r.t. Pθ :∫ ∣∣`′(y, θ)∣∣2dPθ(y) =

∫|p′(y, θ)|2

p(y, θ)dµ0(y). (2.8)

The quantity in the condition (2.8) plays an important role in asymptotic statistics.

Definition 2.5.6. Let (Pθ, θ ∈ Θ ⊂ IR) be a regular parametric family with the univari-

ate parameter. Then the quantity

I(θ)def=

∫ ∣∣`′(y, θ)∣∣2p(y, θ)dµ0(y) =

∫|p′(y, θ)|2

p(y, θ)dµ0(y)

is called the Fisher information of (Pθ) at θ ∈ Θ .

The definition of I(θ) can be written as

I(θ) = IEθ∣∣`′(Y, θ)∣∣2

with Y ∼ Pθ .

A simple sufficient condition for regularity of a family (Pθ) is given by the next

lemma.

Lemma 2.5.7. Let the log-density `(y, θ) = log p(y, θ) of a dominated family (Pθ) be

differentiable in θ and let the Fisher information I(θ) be a continuous function on Θ .

Then (Pθ) is regular.

The proof is technical and can be found e.g. in Borovkov (1998). Some useful prop-

erties of the regular families are listed in the next lemma.

Lemma 2.5.8. Let (Pθ) be a regular family. Then for any θ ∈ Θ and Y ∼ Pθ

1. Eθ`′(Y, θ) =

∫`′(y, θ) p(y, θ) dµ0(y) = 0 and I(θ) = Varθ

[`′(Y, θ)

].

2. I(θ) = −Eθ`′′(Y, θ) = −∫`′′(y, θ)p(y, θ)dµ0(y).

Proof. Differentiating the identity∫p(y, θ)dµ0(y) =

∫exp{`(y, θ)}dµ0(y) ≡ 1 implies

under the regularity conditions the first statement of the lemma. Differentiating once

more yields the second statement with another representation of the Fisher information.

Like the KL-divergence, the Fisher information possesses the important additivity

property.

48

Lemma 2.5.9. Let (P(1)θ , θ ∈ Θ) and (P

(2)θ , θ ∈ Θ) be two parametric families with the


θ , θ ∈ Θ) be the product family. Then

for any θ ∈ Θ , the Fisher information I(θ) satisfies

I(θ) = I(1)(θ) + I(2)(θ)

where I(1)(θ) (resp. I(2)(θ) ) is the Fisher information for (P(1)θ ) (resp. for (P

(2)θ ) ).

Exercise 2.5.5. Prove Lemma 2.5.9.

Hint: use that the log-density of the product experiment can be represented as

`(y1, y2, θ) = `1(y1, θ) + `2(y2, θ) . The independence of Y1 and Y2 implies

I(θ) = Varθ[`′(Y1, Y2, θ)

]= Varθ

[`′1(Y1, θ) + `′2(Y2, θ)

]= Varθ

[`′1(Y1, θ)

]+ Varθ

[`′2(Y2, θ)

].

Exercise 2.5.6. Compute the Fisher information for the Gaussian shift, Bernoulli, Pois-

son, volatility and exponential families. Check in which cases it is constant.

Exercise 2.5.7. Consider the shift experiment given by the equation Y = θ+ε where ε

is an error with the given density function p(·) on IR . Compute the Fisher information

and check whether it is constant.

Exercise 2.5.8. Check that the i.i.d. experiment from the uniform distribution on the

interval [0, θ] with unknown θ is not regular.

Now we consider the properties of the i.i.d. experiment from a given regular family

(Pθ) . The distribution of the whole i.i.d. sample Y is described by the product measure

IPθ = P⊗nθ which is dominated by the measure µ0 = µ⊗n0 . The corresponding log-density

L(y,θ) is given by

L(y, θ)def= log

dIPθdµ0

(y) =∑

`(yi, θ).

The function expL(y, θ) is the density of IPθ w.r.t. µ0 and hence, for any r.v. ξ

IEθξ = IE0

[ξ expL(Y , θ)

].

In particular, for ξ ≡ 1 , this formula leads to the indentity

IE0

[expL(Y , θ)

]=

∫exp{L(y, θ)

}µ0(dy) ≡ 1. (2.9)

The next lemma claims that the product family (IPθ) for an i.i.d. sample from a

regular family is also regular.

49

Lemma 2.5.10. Let (Pθ) be a regular family and IPθ = P⊗nθ . Then

1. The set Andef= {y = (y1, . . . , yn)> :

∏p(yi, θ) = 0} is the same for all θ ∈ Θ .

2. For any r.v. S = S(Y ) with IEθS2 ≤ C , θ ∈ Θ , it holds

∂

∂θIEθS =

∂

∂θIE0

[S expL(Y , θ)

]= IE0

[SL′(Y , θ) expL(Y , θ)

],

where L′(Y , θ)def= ∂

∂θL(Y , θ) .

3. The derivative L′(Y , θ) is square integrable and

IEθ∣∣L′(Y , θ)∣∣2 = nI(θ).

Local properties of the Kullback-Leibler divergence and Hellinger distance

Here we show that the quantities introduced so far are closely related to each other. We

start with the Kullback-Leibler divergence.

Lemma 2.5.11. Let (Pθ) be a regular family. Then the KL-divergence K(θ, θ′) satisfies:

K(θ, θ′)∣∣∣θ′=θ

= 0,

d

dθ′K(θ, θ′)

∣∣∣θ′=θ

= 0,

d2

dθ′2K(θ, θ′)

∣∣∣θ′=θ

= I(θ).

In a small neighborhood of θ , the KL-divergence can be approximated by

K(θ, θ′) ≈ I(θ)|θ′ − θ|2/2.

Similar properties can be established for the rate function m(µ, θ, θ′) .

Lemma 2.5.12. Let (Pθ) be a regular family. Then the rate function m(µ, θ, θ′) satis-

fies:

m(µ, θ, θ′)∣∣∣θ′=θ

= 0,

d

dθ′m(µ, θ, θ′)

∣∣∣θ′=θ

= 0,

d2

dθ′2m(µ, θ, θ′)

∣∣∣θ′=θ

= µ(1− µ)I(θ).

In a small neighborhood of θ , the rate function m(µ, θ, θ′) can be approximated by

m(µ, θ, θ′) ≈ µ(1− µ)I(θ)|θ′ − θ|2/2.

50

Moreover, for any θ, θ′ ∈ Θ

m(µ, θ, θ′)∣∣∣µ=0

= 0,

d

dµm(µ, θ, θ′)

∣∣∣µ=0

= Eθ`(Y, θ, θ′) = K(θ, θ′),

d2

dµ2m(µ, θ, θ′)

∣∣∣µ=0

= −Varθ[`(Y, θ, θ′)

].

This implies an approximation for µ small

m(µ, θ, θ′) ≈ µK(θ, θ′)− µ2

2Varθ

[`(Y, θ, θ′)

].

Exercise 2.5.9. Check the statements of Lemmas 2.5.11 and 2.5.12.

2.6 Cramer-Rao Inequality

Let θ be an estimate of the parameter θ∗ . We are interested in establishing a lower

bound for the risk of this estimate. This bound indicates that under some conditions the

quadratic risk of this estimate can never be below a specific value.

2.6.1 Univariate parameter

We again start with the univariate case and consider the case of an unbiased estimate

θ . Suppose that the family (Pθ, θ ∈ Θ) is dominated by a σ -finite measure µ0 on the

real line and denote by p(y, θ) the density of Pθ w.r.t. µ0 :

p(y, θ)def=

dPθdµ0

(y).

Theorem 2.6.1 (Cramer-Rao Inequality). Let θ = θ(Y ) be an unbiased estimate of θ

for an i.i.d. sample from a regular family (Pθ) . Then

IEθ|θ − θ|2 = Varθ(θ) ≥1

nI(θ).

Moreover, if θ is not unbiased and τ(θ) = IEθθ , then with τ ′(θ)def= d

dθτ(θ) , it holds

Varθ(θ) ≥|τ ′(θ)|2

nI(θ)

and

IEθ|θ − θ|2 = Varθ(θ) + |τ(θ)− θ|2 ≥ |τ′(θ)|2

nI(θ)+ |τ(θ)− θ|2.

51

Proof. Consider first the case of an unbiased estimate θ with IEθθ ≡ θ . Differentiating

the identity (2.9) IE0 expL(Y , θ) ≡ 1 w.r.t. θ yields

0 ≡∫ [

L′(y, θ) exp{L(y, θ)

}]µ0(dy) = IEθL

′(Y , θ). (2.10)

Similarly, the identity IEθθ = θ implies

1 ≡∫ [

θL′(Y , θ) exp{L(Y , θ)

}]µ0(dy) = IEθ

[θL′(Y , θ)

].

Together with (2.10), this gives

IEθ[(θ − θ)L′(Y , θ)

]≡ 1.

By the Cauchy-Schwartz inequality

1 = IE2θ

[(θ − θ)L′(Y , θ)

]≤ IEθ(θ − θ)2 IEθ|L′(Y , θ)|2 = Varθ(θ)nI(θ). (2.11)

This implies the first assertion.

Now we consider the general case. The proof is similar. The property (2.10) continues

to hold. Next, the identity IEθθ = θ is replaced with IEθθ = τ(θ) yielding

IEθ[θL′(Y , θ)

]≡ τ ′(θ)

and

IEθ[{θ − τ(θ)}L′(Y , θ)

]≡ τ ′(θ).

Again by the Cauchy-Schwartz inequality∣∣τ ′(θ)∣∣2 = IE2θ

[{θ − τ(θ)}L′(Y , θ)

]≤ IEθ{θ − τ(θ)}2 IEθ|L′(Y , θ)|2

= Varθ(θ) nI(θ)

and the second assertion follows. The last statement is the usual decomposition of the

quadratic risk into the squared bias and the variance of the estimate.

2.6.2 Exponential families and R-efficiency

An interesting question is how good (precise) the Cramer-Rao lower bound is. In par-

ticular, when it is an equality. Indeed, if we restrict ourselves to unbiased estimates, no

estimate can have quadratic risk smaller than [nI(θ)]−1 . If an estimate has exactly the

risk [nI(θ)]−1 then this estimate is automatically efficient in the sense that it is the best

in the class in terms of the quadratic risk.

52

Definition 2.6.2. An unbiased estimate θ is R-efficient if

Varθ(θ) = [nI(θ)]−1.

Theorem 2.6.3. An unbiased estimate θ is R-efficient if and only if

θ = n−1∑

U(Yi),

where the function U(·) on IR satisfies∫U(y)dPθ(y) ≡ θ and the log-density `(y, θ) of

Pθ can be represented as

`(y, θ) = C(θ)U(y)−B(θ) + `(y), (2.12)

for some functions C(·) and B(·) on Θ and a function `(·) on IR .

Proof. Suppose first that the representation (2.12) for the log-density is correct. Then

`′(y, θ) = C ′(θ)U(y)−B′(θ) and the identity Eθ`′(y, θ) = 0 implies the relation between

the functions B(·) and C(·) :

θC ′(θ) = B′(θ). (2.13)

Next, differentiating the equality

0 ≡∫{U(y)− θ}dPθ(y) =

∫{U(y)− θ}eL(y,θ)dµ0(y)

w.r.t. θ implies in view of (2.13)

1 ≡ IEθ[{U(Y )− θ} ×

{C ′(θ)U(Y )−B′(θ)

}]= C ′(θ)IEθ

{U(Y )− θ

}2.

This yields Varθ{U(Y )

}= 1/C ′(θ) . This leads to the following representation for the

Fisher information:

I(θ) = Varθ{`′(Y, θ)

}= Varθ{C ′(θ)U(Y )−B′(θ)}

={C ′(θ)

}2Varθ

{U(Y )

}= C ′(θ).

The estimate θ = n−1∑U(Yi) satisfies

IEθθ = θ,

that is, it is unbiased. Moreover,

Varθ(θ)

= Varθ

{ 1

n

∑U(Yi)

}=

1

n2

∑Var{U(Yi)

}=

1

nC ′(θ)=

1

nI(θ)

53

and θ is R-efficient.

Now we show an reverse statement. Due to the proof of the Cramer-Rao inequality,

the only possibility of getting the equality in this inequality is if (2.11) holds as an

equality. It is well known that the Cauchy-Schwartz inequality IEξη ≤√IEξ2IEη2 is an

equality iff ξ, η are linearly dependent. This leads to the relation

L′(Y , θ) = c(θ)(θ − θ)− b(θ)

for some coefficients c(θ), b(θ) . This implies for some fixed θ0 and any θ

L(Y , θ)− L(Y , θ0) =

∫ θ

θ0

L′(Y , θ)dθ

= θ

∫ θ

θ0

c(θ)dθ −∫ θ

θ0

b(θ)dθ

= θC(θ)−B(θ)

with C(θ) =∫ θθ0c(θ)dθ and B(θ) =

∫ θθ0b(θ)dθ . Applying this equality to a sample with

n = 1 yields U(Y1) = θ(Y1) , and

`(Y1, θ) = `(Y1, θ0) + C(θ)U(Y1)−B(θ).

The desired representation follows.

Exercise 2.6.1. Apply the Cramer-Rao inequality and check R-efficiency to the empir-

ical mean estimate θ = n−1∑Yi for the Gaussian shift, Bernoulli, Poisson, exponential

and volatility families.

2.7 Cramer-Rao inequality. Multivariate parameter

This section extends the notions and results of the previous sections from the case of a

univariate parameter to the case of a multivariate parameter with θ ∈ Θ ⊂ IRp .

2.7.1 Regularity and Fisher Information. Multivariate parameter

The definition of regularity naturally extends to the case of a multivariate parameter

θ = (θ1, . . . , θp)> . It suffices to check the same conditions as in the univariate case for

every partial derivative ∂p(y,θ)/∂θj of the density p(y,θ) for j = 1, . . . , p .

Definition 2.7.1. The family (Pθ,θ ∈ Θ ⊂ IRp) is regular if the following conditions

are fulfilled:

1. The sets A(θ)def= {y : p(y, θ) = 0} are the same for all θ ∈ Θ .

54

2. Differentiability under the integration sign: for any function s(y) satisfying

∫s2(y)p(y,θ)dµ0(y) ≤ C, θ ∈ Θ

it holds

∂

∂θ

∫s(y)dPθ(y) =

∂

∂θ

∫s(y)p(y,θ)dµ0(y) =

∫s(y)

∂p(y,θ)

∂θdµ0(y).

3. Finite Fisher information: the log-density function `(y,θ) is differentiable in θ

and its derivative ∇`(y,θ) = ∂`(y,θ)/∂θ is square integrable w.r.t. Pθ :

∫ ∣∣∇`(y,θ)∣∣2dPθ(y) =

∫|∇p(y, θ)|2

p(y, θ)dµ0(y) <∞.

In the case of a multivariate parameter, the notion of the Fisher information leads to

the Fisher information matrix.

Definition 2.7.2. Let (Pθ,θ ∈ Θ ⊂ IRp) be a parametric family. The matrix

I(θ)def=

∫∇`(y,θ)∇>`(y,θ)p(y,θ)dµ0(y)

=

∫∇p(y,θ)∇>p(y,θ)

1

p(y,θ)dµ0(y)

is called the Fisher information matrix of (Pθ) at θ ∈ Θ .

This definition can be rewritten as

I(θ) = IEθ[∇`(Y1, θ){∇`(Y1, θ)}>

].

The additivity property of the Fisher information extends to the multivariate case as

well.

Lemma 2.7.3. Let (Pθ,θ ∈ Θ) be a regular family. Then the n -fold product family

(IPθ) with IPθ = P⊗nθ is also regular. The Fisher information matrix I(θ) satisfies

IEθ[∇L(Y ,θ){∇L(Y ,θ)}>

]= nI(θ). (2.14)

Exercise 2.7.1. Compute the Fisher information matrix for the i.i.d. experiment Yi =

θ + σεi with unknown θ and σ and εi i.i.d. standard normal.

55

2.7.2 Local properties of the Kullback-Leibler divergence and Hellinger

distance

The local relations between the Kullback-Leibler divergence, rate function and Fisher

information naturally extend to the case of a multivariate parameter. We start with the

Kullback-Leibler divergence.

Lemma 2.7.4. Let (Pθ) be a regular family. Then the KL-divergence K(θ,θ′) satisfies:

K(θ,θ′)∣∣∣θ′=θ

= 0,

d

dθ′K(θ,θ′)

∣∣∣θ′=θ

= 0,

d2

dθ′2K(θ,θ′)

∣∣∣θ′=θ

= I(θ).

In a small neighborhood of θ , the KL-divergence can be approximated by

K(θ,θ′) ≈ (θ′ − θ)>I(θ) (θ′ − θ)/2.

Similar properties can be established for the rate function m(µ,θ,θ′) .

Lemma 2.7.5. Let (Pθ) be a regular family. Then the rate function m(µ,θ,θ′) satisfies:

m(µ,θ,θ′)∣∣∣θ′=θ

= 0,

d

dθ′m(µ,θ,θ′)

∣∣∣θ′=θ

= 0,

d2

dθ′2m(µ,θ,θ′)

∣∣∣θ′=θ

= µ(1− µ)I(θ).

In a small neighborhood of θ , the rate function can be approximated by

m(µ,θ,θ′) ≈ µ(1− µ)(θ′ − θ)>I(θ) (θ′ − θ)/2.

Moreover, for any θ,θ′ ∈ Θ

m(µ,θ,θ′)∣∣∣µ=0

= 0,

d

dµm(µ,θ,θ′)

∣∣∣µ=0

= Eθ`(Y,θ,θ′) = K(θ,θ′),

d2

dµ2m(µ,θ,θ′)

∣∣∣µ=0

= −Varθ[`(Y,θ,θ′)

].

This implies an approximation for µ small:

m(µ,θ,θ′) ≈ µK(θ,θ′)− µ2

2Varθ

[`(Y,θ,θ′)

].

Exercise 2.7.2. Check the statements of Lemmas 2.7.4 and 2.7.5.

56

2.7.3 Multivariate Cramer-Rao Inequality

Let θ = θ(Y ) be an estimate of the unknown parameter vector. This estimate is called

unbiased if

IEθθ ≡ θ.

Theorem 2.7.6 (Multivariate Cramer-Rao Inequality). Let θ = θ(Y ) be an unbiased

estimate of θ for an i.i.d. sample from a regular family (Pθ) . Then

Varθ(θ) ≥{nI(θ)

}−1,

IEθ‖θ − θ‖2 = tr{

Varθ(θ)}≥ tr

[{nI(θ)

}−1].

Moreover, if θ is not unbiased and τ(θ) = IEθθ , then with ∇τ(θ)def= d

dθ τ(θ) , it holds

Varθ(θ) ≥ ∇τ(θ){nI(θ)

}−1{∇τ(θ)}>,

and

IEθ‖θ − θ‖2 = tr[Varθ(θ)

]+ ‖τ(θ)− θ‖2

≥ tr[∇τ(θ)

{nI(θ)

}−1{∇τ(θ)}>]

+ ‖τ(θ)− θ‖2.

Proof. Consider first the case of an unbiased estimate θ with IEθθ ≡ θ . Differentiating

the identity (2.9) IEθ expL(Y ,θ) ≡ 1 w.r.t. θ yields

0 ≡∫∇L(y,θ) exp

{L(y,θ)

}µ0(dy) = IEθ

[∇L(Y ,θ)

]≡ 0. (2.15)

Similarly, the identity IEθθ = θ implies

I ≡∫θ(y)

{∇L(y,θ)

}>exp{L(y,θ)

}µ0(dy) = IEθ

[θ {∇L(Y ,θ)}>

].

Together with (2.15), this gives

IEθ[(θ − θ) {∇L(Y ,θ)}>

]≡ I. (2.16)

Consider the random vector

hdef={nI(θ)

}−1∇L(Y ,θ).

By (2.15) IEθh = 0 and by (2.14)

Varθ(h) = IEθ(hh>

)= n−2IEθ

[I−1(θ)∇L(Y ,θ)

{I−1(θ)∇L(Y ,θ)

}>]= n−2I−1(θ)IEθ

[∇L(Y ,θ){∇L(Y ,θ)}>

]I−1(θ) =

{nI(θ)

}−1.

57

and the identities (2.15) and (2.16) imply that

IEθ[(θ − θ − h)h>

]= 0. (2.17)

The “no bias” property yields IEθ(θ − θ

)= 0 and IEθ

[(θ − θ)(θ − θ)>

]= Varθ(θ) .

Finally by the orthogonality (2.17) and

Varθ(θ) = Varθ(h) + Var(θ − θ − h

)={nI(θ)

}−1+ Varθ

(θ − θ − h

)and the variance of θ is not smaller than

{nI(θ)

}−1. Moreover, the equality is only

possible if θ − θ − h is equal to zero almost surely.

Now we consider the general case. The proof is similar. The property (2.15) continues

to hold. Next, the identity IEθθ = θ is replaced with IEθθ = τ(θ) yielding

IEθ[θ {∇L(Y ,θ)}>

]≡ ∇τ(θ)

and

IEθ[{θ − τ(θ)

}{∇L(Y ,θ)

}>] ≡ ∇τ(θ).

Define

hdef= ∇τ(θ)

{nI(θ)

}−1∇L(Y ,θ).

Then similarly to the above

IEθ[hh>

]= ∇τ(θ)

{nI(θ)

}−1 {∇τ(θ)}>,

IEθ[(θ − θ − h)h>

]= 0,

and the second assertion follows. The statements about the quadratic risk follow from

its usual decomposition into squared bias and the variance of the estimate.

2.7.4 Exponential families and R-efficiency

The notion of R-efficiency naturally extends to the case of a multivariate parameter.

Definition 2.7.7. An unbiased estimate θ is R-efficient if

Varθ(θ) ={nI(θ)

}−1.

58

Theorem 2.7.8. An unbiased estimate θ is R-efficient if and only if

θ = n−1∑

U(Yi),

where the vector function U(·) on IR satisfies∫U(y)dPθ(y) ≡ θ and the log-density

`(y,θ) of Pθ can be represented as

`(y,θ) = C(θ)>U(y)−B(θ) + `(y), (2.18)

for some functions C(·) and B(·) on Θ and a function `(·) on IR .

Proof. Suppose first that the representation (2.18) for the log-density is correct. Denote

by C ′(θ) the p × p Jacobi matrix of the vector function C : C ′(θ)def= d

dθC(θ) . Then

∇`(y,θ) = C ′(θ)U(y) − ∇B(θ) and the identity Eθ∇`(y,θ) = 0 implies the relation

between the functions B(·) and C(·) :

C ′(θ)θ = ∇B(θ). (2.19)

Next, differentiating the equality

0 ≡∫ [U(y)− θ

]dPθ(y) =

∫[U(y)− θ]eL(y,θ)dµ0(y)

w.r.t. θ implies in view of (2.19)

I ≡ IEθ[{U(Y )− θ}

{C ′(θ)U(Y )−∇B(θ)

}]>= C ′(θ)IEθ

[{U(Y )− θ} {U(Y )− θ}>

].

This yields Varθ[U(Y )

]= [C ′(θ)]−1 . This leads to the following representation for the

Fisher information:

I(θ) = Varθ[∇`(Y,θ)

]= Varθ[C ′(θ)U(Y )−∇B(θ)]

=[C ′(θ)

]2Varθ

[U(Y )

]= C(θ).

The estimate θ = n−1∑U(Yi) satisfies

IEθθ = θ,

that is, it is unbiased. Moreover,

Varθ(θ)

= Varθ

( 1

n

∑U(Yi)

)=

1

n2

∑Var[U(Yi)

]=

1

n

[C ′(θ)

]−1={nI(θ)

}−1

59

and θ is R-efficient.

As in the univariate case, one can show that equality in the Cramer-Rao bound is

only possible if ∇L(Y ,θ) and θ − θ are linearly dependent. This leads again to the

exponential family structure of the likelihood function.

Exercise 2.7.3. Complete the proof of the Theorem 2.7.8.

2.8 Maximum likelihood and other estimation methods

This section presents some other popular methods of estimating the unknown parameter

including minimum distance and M-estimation, maximum likelihood procedure, etc.

2.8.1 Minimum distance estimation

Let ρ(P, P ′) denote some functional (distance) defined for measures P, P ′ on the real

line. We assume that ρ satisfies the following conditions: ρ(Pθ1 , Pθ2) ≥ 0 and ρ(Pθ1 , Pθ2) =

0 iff θ1 = θ2 . This implies for every θ∗ ∈ Θ that

argminθ∈Θ

ρ(Pθ, Pθ∗) = θ∗.

The Glivenko-Cantelli theorem states that Pn converges weakly to the true distribution

Pθ∗ . Therefore, it is natural to define an estimate θ of θ∗ by replacing in this formula

the true measure Pθ∗ by its empirical counterpart Pn , that is, by minimizing the distance

ρ between the measures Pθ and Pn over the set (Pθ) . This leads to the minimum

distance estimate

θ = argminθ∈Θ

ρ(Pθ, Pn).

2.8.2 M -estimation and Maximum likelihood estimation

Another general method of building an estimate of θ∗ , the so-called M -estimation is

defined via a contrast function ψ(y,θ) given for every y ∈ IR and θ ∈ Θ . The principal

condition on ψ is that the integral IEθψ(Y1,θ′) is minimized for θ = θ′ :

θ = argminθ′

∫ψ(y,θ′)dPθ(y), θ ∈ Θ. (2.20)

In particular,

θ∗ = argminθ∈Θ

∫ψ(y,θ)dPθ∗(y),

60

and the M -estimate is again obtained by substitution, that is, by replacing the true

measure Pθ∗ with its empirical counterpart Pn :

θ = argminθ∈Θ

∫ψ(y,θ)dPn(y) = argmin

θ∈Θ

1

n

∑ψ(Yi,θ).

Exercise 2.8.1. Let Y be an i.i.d. sample from P ∈ (Pθ, θ ∈ Θ ⊂ IR) .

(i) Let also g(y) satisfy∫g(y)dPθ(y) ≡ θ , leading to the moment estimate

θ = n−1∑

g(Yi).

Show that this estimate can be obtained as the M-estimate for a properly selected function

ψ(·) .

(ii) Let∫g(y)dPθ(y) ≡ m(θ) for the given functions g(·) and m(·) whereas m(·) is

monotonous. Show that the moment estimate θ = m−1(Mn) with Mn = n−1∑g(Yi)

can be obtained as the M-estimate for a properly selected function ψ(·) .

We mention three prominent examples of the contrast function ψ and the resulting

estimates: least squares, least absolute deviation and maximum likelihood.

Least squares estimation

The least squares estimate (LSE) corresponds to the quadratic contrast ‖ψ(y) − θ‖2 ,

where ψ(y) is a p -dimensional function of the observation y satisfying∫ψ(y)dPθ(y) ≡ θ, θ ∈ Θ

Then the true parameter θ∗ fulfills the relation

θ∗ = argminθ∈Θ

∫‖ψ(y)− θ‖2dPθ∗(y)

because ∫‖ψ(y)− θ‖2dPθ∗(y) = ‖θ∗ − θ‖2 +

∫‖ψ(y)− θ∗‖2dPθ∗(y).

The substitution method leads to the estimate θ of θ∗ defined by minimization of the

empirical version of the integral∫‖ψ(y)− θ‖2dPθ∗(y) :

θdef= argmin

θ∈Θ

∫‖ψ(y)− θ‖2dPn(y) = argmin

θ∈Θ

∑‖ψ(Yi)− θ‖2.

This is again a quadratic optimization problem having a closed form solution called least

squares or ordinary least squares estimate.

61

Lemma 2.8.1. It holds

θ = argminθ∈Θ

∑‖ψ(Yi)− θ‖2 =

1

n

∑ψ(Yi).

One can see that the LSE θ coincides with the moment estimate based on the function

g(·) = ψ(·) . Indeed, the equality∫g(y)dPθ∗(y) = θ∗ leads directly to the LSE θ =

n−1∑g(Yi) .

Least absolute deviation (median) estimation

The next example of an M-estimate is given by the absolute deviation contrast fit. For

simplicity of presentation, we consider here only the case of a univariate parameter. The

contrast function ψ(y, θ) is given by ψ(y, θ)def= |ψ(y) − θ| . The solution of the related

optimization problem (2.20) is given by the median med(Pθ) of the distribution Pθ .

Definition 2.8.2. The value t is called the median of a distribution function F if

F (t) ≥ 1/2, F (t−) < 1/2.

If F (·) is a continuous function then the median t = med(F ) satisfies F (t) = 1/2 .

Theorem 2.8.3. For any cdf F , the median med(F ) satisfies

infθ∈IR

∫|y − θ| dF (y) =

∫|y −med(F )| dF (y).

Proof. Consider for simplicity the case of a continuous distribution function F . One has

|y− θ| = (θ− y)1(y < θ) + (y− θ)1(y ≥ θ) . Differentiating w.r.t. θ yields the following

equation for any extreme point of∫|y − θ| dF (y) :

−∫ θ

−∞dF (y) +

∫ ∞θ

dF (y) = 0.

The median is the only solution of this equation.

Let the family (Pθ) be such that θ = med(Pθ) for all θ ∈ IR . Then the M-estimation

approach leads to the least absolute deviation (LAD) estimate

θdef= argmin

θ∈IR

∫|y − θ| dFn(y) = argmin

θ∈IR

∑|Yi − θ|.

Due to Theorem 2.8.3, the solution of this problem is given by the median of the edf Fn .

62

Maximum likelihood estimation

Let now ψ(y,θ) = −`(y,θ) = − log p(y,θ) where p(y,θ) is the density of the measure

Pθ at y w.r.t. to some dominating measure µ0 . This choice leads to the maximum

likelihood estimate (MLE):

θ = argmaxθ∈Θ

n−1∑

log p(Yi,θ).

The condition (2.20) is fulfilled because

argminθ′

∫ψ(y,θ′)dPθ(y) = argmin

θ′

∫ {ψ(y,θ′)− ψ(y,θ)

}dPθ(y)

= argminθ′

∫log

p(y,θ)

p(y,θ′)dPθ(y)

= argminθ′

K(θ,θ′) = θ.

Here we used that the Kullback-Leibler divergence K(θ,θ′) attains its minimum equal

to zero at the point θ′ = θ which in turn follows from the concavity of the log-function

by the Jensen inequality.

Note that the definition of the MLE does not depend on the choice of the dominating

measure µ0 .

Exercise 2.8.2. Show that the MLE θ does not change if another dominating measure

is used.

Computing an M -estimate or MLE leads to solving an optimization problem for the

empirical quantity∑ψ(Yi,θ) w.r.t. the parameter θ . If the function ψ is differentiable

w.r.t. θ then the solution can be found from the estimating equation

∂

∂θ

∑ψ(Yi,θ) = 0.

Exercise 2.8.3. Show that any M -estimate and particularly the MLE can be repre-

sented as minimum distance estimate with a properly defined distance ρ .

Hint: define ρ(Pθ, Pθ∗) as∫ [ψ(y,θ)− ψ(y,θ∗)

]dPθ∗(y) .

Recall that the MLE θ is defined by maximizing the expression L(θ) =∑`(Yi,θ)

w.r.t. θ . Below we use the notation L(θ,θ′)def= L(θ) − L(θ′) , often called the log-

likelihood ratio.

In our study we will focus on the value of the maximum L(θ) = maxθ L(θ) .

Definition 2.8.4. Let L(θ) =∑`(Yi,θ) be the likelihood function. The value

L(θ)def= max

θL(θ)

63

is called the maximum log-likelihood or fitted log-likelihood. The excess L(θ)− L(θ∗)

is the difference between the maximum of the likelihood function L(θ) over θ and its

particular value at the true parameter θ∗ :

L(θ,θ∗)def= max

θL(θ)− L(θ∗).

The next section collects some examples of computing the MLE θ and the corre-

sponding maximum log-likelihood.

2.9 Maximum Likelihood for some parametric families

The examples of this section focus on the structure of the log-likelihood and the corre-

sponding MLE θ and the maximum log-likelihood L(θ) .

2.9.1 Gaussian shift

Let Pθ be the normal distribution on the real line with mean θ and the known variance

σ2 . The corresponding density w.r.t. the Lebesgue measure reads as

p(y,θ) =1√

2πσ2exp{−(y − θ)2

2σ2

}.

The log-likelihood L(θ) is

L(θ) =∑

log p(Yi, θ) = −n2

log(2πσ2)− 1

2σ2

∑(Yi − θ)2.

The corresponding normal equation L′(θ) = 0 yields

− 1

σ2

∑(Yi − θ) = 0 (2.21)

leading to the empirical mean solution θ = n−1∑Yi .

The computation of the fitted likelihood is a bit more involved.

Theorem 2.9.1. Let Yi = θ∗ + εi with εi ∼ N(0, σ2) . For any θ

L(θ, θ) = nσ−2(θ − θ)2/2. (2.22)

Moreover,

L(θ, θ∗) = nσ−2(θ − θ∗)2/2 = ξ2/2

where ξ is a standard normal r.v. so that 2L(θ, θ∗) has the fixed χ21 distribution with

one degree of freedom. If zα is the quantile of χ21/2 with P (ξ2/2 > zα) = α , then

E(zα) = {u : L(θ, u) ≤ zα} (2.23)

64

is an α -confidence set: IPθ∗(E(zα) 63 θ∗) = α .

For every r > 0 ,

IEθ∗∣∣2L(θ, θ∗)

∣∣r = cr ,

where cr = E|ξ|2r with ξ ∼ N(0, 1) .

Proof 1. Consider L(θ, θ)def= L(θ)− L(θ) as a function of the parameter θ . Obviously

L(θ, θ) = − 1

2σ2

∑[(Yi − θ)2 − (Yi − θ)2

],

so that L(θ, θ) is a quadratic function of θ . Next, it holds L(θ, θ)∣∣θ=θ

= 0 andddθL(θ, θ)

∣∣θ=θ

= − ddθL(θ)

∣∣θ=θ

= 0 due to the normal equation (2.21). Finally,

d2

dθ2L(θ, θ)

∣∣θ=θ

= − d2

dθ2L(θ)

∣∣θ=θ

= n/σ2.

This implies by the Taylor expansion of a quadratic function L(θ, θ) at θ = θ :

L(θ, θ) =n

2σ2(θ − θ)2.

Proof 2. First observe that for any two points θ′, θ , the log-likelihood ratio L(θ′, θ) =

log(dIPθ′/dIPθ) = L(θ′)− L(θ) can be represented in the form

L(θ′, θ) = L(θ′)− L(θ) = σ−2(S − nθ)(θ′ − θ)− nσ−2(θ′ − θ)2/2.

Substituting the MLE θ = S/n in place of θ′ implies

L(θ, θ) = nσ−2(θ − θ)2/2.

Now we consider the second statement about the distribution of L(θ, θ∗) . The sub-

stitution θ = θ∗ in (2.22) and the model equation Yi = θ∗+ εi imply θ− θ∗ = n−1/2σξ ,

where

ξdef=

1

σ√n

∑εi

is standard normal. Therefore,

L(θ, θ∗) = ξ2/2.

This easily implies the result of the theorem.

We see that under IPθ∗ the variable 2L(θ, θ∗) is χ21 distributed with one degree

of freedom, and this distribution does not depend on the sample size n and the scale

parameter σ . This fact is known in a more general form as chi-squared theorem.

65

Exercise 2.9.1. Check that the confidence sets

E◦(zα)def= [θ − n−1/2σzα, θ + n−1/2σzα],

where zα is defined by 2Φ(−zα) = α , and E(zα) from (2.23) coincide.

Exercise 2.9.2. Compute the constant cr from Theorem 2.9.1 for r = 0.5, 1, 1.5, 2 .

Already now we point out an interesting feature of the fitted log-likelihood L(θ, θ∗) .

It can be viewed as the normalized squared loss of the estimate θ because L(θ, θ∗) =

nσ−2|θ − θ∗|2 . The last statement of Theorem 2.9.1 yields that

IEθ∗ |θ − θ∗|2r = crσ2rn−r.

2.9.2 Variance estimation for the normal law

Let Yi be i.i.d. normal with mean zero and unknown variance θ∗ :

Yi ∼ N(0, θ∗), θ∗ ∈ IR+ .

The likelihood function reads as

L(θ) =∑

log p(Yi, θ) = −n2

log(2πθ)− 1

2θ

∑Y 2i .

The normal equation L′(θ) = 0 yields

L′(θ) = − n

2θ+

1

2θ2

∑Y 2i = 0

leading to

θ =1

nSn

with Sn =∑Y 2i . Moreover, for any θ

L(θ, θ) = −n2

log(θ/θ)− Sn2

(1/θ − 1/θ

)= nK(θ, θ)

where

K(θ, θ′) = −1

2

[log(θ/θ′) + 1/θ − 1/θ′

]is the Kullback-Leibler divergence for two Gaussian measures N(0, θ) and N(0, θ′) .

66

2.9.3 Univariate normal distribution

Let Yi be as in previous example N{α, σ2} but neither the mean α nor the variance

σ2 are known. This leads to estimating the vector θ = (θ1, θ2) = (α, σ2) from the i.i.d.

sample Y .

The maximum likelihood approach leads to maximizing the log-likelihood w.r.t. the

vector θ = (α, σ2)> :

L(θ) =∑

log p(Yi,θ) = −n2

log(2πθ2)−1

2θ2

∑(Yi − θ1)2.

Exercise 2.9.3. Check that the ML approach leads to the same estimates (2.2) as the

method of moments.

2.9.4 Uniform distribution on [0, θ]

Let Yi be uniformly distributed on the interval [0, θ] of the real line where the right

end point θ is unknown. The density p(y, θ) of Pθ w.r.t. the Lebesgue measure is

θ−11(y ≤ θ) . The likelihood reads as

Z(θ) = θ−n1(maxiYi ≤ θ).

This density is positive iff θ ≥ maxi Yi and it is maximized exactly for θ = maxi Yi .

One can see that the MLE θ is the limiting case of the moment estimate θk as k grows

to infinity.

2.9.5 Bernoulli or binomial model

Let Pθ be a Bernoulli law for θ ∈ [0, 1] . The density of Yi under Pθ can be written as

p(y, θ) = θy(1− θ)1−y.

The corresponding log-likelihood reads as

L(θ) =∑{

Yi log θ + (1− Yi) log(1− θ)}

= Sn logθ

1− θ+ n log(1− θ)

with Sn =∑Yi . Maximizing this expression w.r.t. θ results again in the empirical

mean

θ = Sn/n.

This implies

L(θ, θ) = nθ logθ

θ+ n(1− θ) log

1− θ1− θ

= nK(θ, θ)

67

where K(θ, θ′) = θ log(θ/θ′)+(1−θ) log{(1−θ)/(1−θ′) is the Kullback-Leibler divergence

for the Bernoulli law.

2.9.6 Multinomial model

The multinomial distribution Bmθ describes the number of successes in m experiments

when one success has the probability θ ∈ [0, 1] . This distribution can be viewed as the

sum of m binomials with the same parameter θ .

One has

Pθ(Y1 = k) =

(m

k

)θk(1− θ)m−k, k = 0, . . . ,m.

Exercise 2.9.4. Check that the ML approach leads to the estimate

θ =1

mn

∑Yi .

Compute L(θ, θ) .

2.9.7 Exponential model

Let Y1, . . . , Yn be i.i.d. exponential random variables with parameter θ∗ > 0 . This

means that Yi are nonnegative and satisfy IP (Yi > t) = e−t/θ∗

. The density of the

exponential law w.r.t. the Lebesgue measure is p(y, θ∗) = e−y/θ∗/θ∗ . The corresponding

log-likelihood can be written as

L(θ) = −n log θ −n∑i=1

Yi/θ = −S/θ − n log θ,

where S = Y1 + . . .+ Yn .

The ML estimating equation yields S/θ2 = n/θ or

θ = S/n.

For the fitted log-likelihood L(θ, θ) this gives

L(θ, θ) = −n(1− θ/θ)− n log(θ/θ) = nK(θ, θ).

Here once again K(θ, θ′) = θ/θ′−1− log(θ/θ′) is the Kullback-Leibler divergence for the

exponential law.

68

2.9.8 Poisson model

Let Y1, . . . , Yn be i.i.d. Poisson random variables satisfying IP (Yi = m) = |θ∗|me−θ∗/m!

for m = 0, 1, 2, . . . . The corresponding log-likelihood can be written as

L(θ) =n∑i=1

log(θYie−θ/Yi!

)= log θ

n∑i=1

Yi − θ − log(Yi!) = S log θ − nθ +R,

where S = Y1 + . . .+ Yn and R =∑n

i=1 log(Yi!) . Here we leave out that 0! = 1 .

The ML estimating equation immediately yields S/θ = n or

θ = S/n.

For the fitted log-likelihood L(θ, θ) this gives

L(θ, θ) = nθ log(θ/θ)− n(θ − θ) = nK(θ, θ).

Here again K(θ, θ′) = θ log(θ/θ′) − (θ − θ′) is the Kullback-Leibler divergence for the

Poisson law.

2.9.9 Shift of a Laplace (double exponential) law

Let P0 be the symmetric distribution defined by the equations

P0(|Y1| > y) = e−y/σ, y ≥ 0,

for some given σ > 0 . Equivalently one can say that the absolute value of Y1 is

exponential with parameter σ under P0 . Now define Pθ by shifting P0 by the value

θ . This means that

Pθ(|Y1 − θ| > y) = e−y/σ, y ≥ 0.

The density of Y1 − θ under Pθ is p(y) = (2σ)−1e−|y|/σ . The maximum likelihood

approach leads to maximizing the sum

L(θ) = −n log(2σ)−∑|Yi − θ|/σ,

or equivalently to minimizing the sum∑|Yi − θ| :

θ = argminθ

∑|Yi − θ|. (2.24)

This is just the least absolute deviation estimate given by the median of the edf:

θ = med(Fn).

69

Exercise 2.9.5. Show that the median solves the problem (2.24).

Hint: suppose that n is odd. Consider the ordered observations Y(1) ≤ Y(2) ≤ . . . ≤Y(n) . Show that the median of Pn is given by Y((n+1)/2) . Show that this point solves

(2.24).

2.10 Quasi Maximum Likelihood approach

Let Y = (Y1, . . . , Yn)> be a sample from a marginal distribution P . Let also (Pθ,θ ∈Θ) be a given parametric family with the log-likelihood `(y,θ) . The parametric approach

is based on the assumption that the underlying distribution P belongs to this family.

The quasi maximum likelihood method applies the maximum likelihood approach for

family (Pθ) even if the underlying distribution P does not belong to this family. This

leads again to the estimate θ that maximizes the expression L(θ) =∑`(Yi,θ) and

is called the quasi MLE. It might happen that the true distribution belongs to some

other parametric family for which one also can construct the MLE. However, there could

be serious reasons for applying the quasi maximum likelihood approach even in this

misspecified case. One of them is that the properties of the estimate θ are essentially

determined by the geometrical structure of the log-likelihood. The use of a parametric

family with a nice geometric structure (which are quadratic or convex functions of the

parameter) can seriously simplify the algorithmic burdens and improve the behavior of

the method.

2.10.1 LSE as quasi likelihood estimation

Consider the model

Yi = θ∗ + εi (2.25)

where θ∗ is the parameter of interest from IR and εi are random errors satisfying

IEεi = 0 . The assumption that εi are i.i.d. normal N(0, σ2) leads to the quasi log-

likelihood

L(θ) = −n2

log(2πσ2)− 1

2σ2

∑(Yi − θ)2.

Maximizing the expression L(θ) leads to minimizing the sum of squared residuals (Yi−θ)2 :

θ = argminθ

∑(Yi − θ)2 =

1

n

∑Yi .

This estimate is called a least squares estimate (LSE) or ordinary least squares estimate

(oLSE).

70

Example 2.10.1. Consider the model (2.25) with heterogeneous errors, that is, εi are

independent normal with zero mean and variances σ2i . The corresponding log-likelihood

reads

L◦(θ) = −1

2

∑{log(2πσ2i ) +

(Yi − θ)2

σ2i

}.

The MLE θ◦ is

θ◦def= argmax

θL◦(θ) = N−1

∑Yi/σ

2i , N =

∑σ−2i .

We now compare the estimates θ and θ◦ .

Lemma 2.10.1. The following assertions hold for the estimate θ :

1. θ is unbiased: IEθ∗ θ = θ∗ .

2. The quadratic risk of θ is equal to the variance Var(θ) given by

R(θ, θ∗)def= IEθ∗ |θ − θ∗|2 = Var(θ) = n−2

∑σ2i .

3. θ is not R-efficient unless all σ2i are equal.

Now we consider the MLE θ◦ .

Lemma 2.10.2. The following assertions hold for the estimate θ◦ :

1. θ◦ is unbiased: IEθ∗ θ◦ = θ∗ .

2. The quadratic risk of θ◦ is equal to the variance Var(θ◦) given by

R(θ◦, θ∗)def= IEθ∗ |θ◦ − θ∗|2 = Var(θ◦) = N−2

∑σ−2i = N−1.

3. θ◦ is R-efficient.

Exercise 2.10.1. Check the statements of Lemma 2.10.1 and 2.10.2.

Hint: compute the Fisher information for the model (2.25) using the property of

additivity:

I(θ) =∑

I(i)(θ) =∑

σ−2i = N,

where I(i)(θ) is the Fisher information in the marginal model Yi = θ+ εi with just one

observation Yi . Apply the Cramer-Rao inequality for one observation of the vector Y .

71

2.10.2 LAD and robust estimation as quasi likelihood estimation

Consider again the model (2.25). The classical least squares approach faces serious prob-

lems if the available data Y are contaminated with outliers. The reasons for contami-

nation could be missing data or typing errors, etc. Unfortunately, even a single outlier

can significantly disturb the sum L(θ) and thus, the estimate θ . A typical approach

proposed and developed by Huber is to apply another “influence function” ψ(Yi − θ) in

the sum L(θ) in place of the squared residual |Yi − θ|2 leading to the M-estimate

θ = argminθ

∑ψ(Yi − θ). (2.26)

A popular ψ -function for robust estimation is the absolute value |Yi− θ| . The resulting

estimate

θ = argminθ

∑|Yi − θ|

is called least absolute deviation and the solution is the median of the empirical distri-

bution Pn . Another proposal is called the Huber function: it is quadratic in a vicinity

of zero and linear outside:

ψ(x) =

x2 if |x| ≤ t,

a|x|+ b otherwise.

Exercise 2.10.2. Show that for each t > 0 , the coefficients a = a(t) and b = b(t) can

be selected to provide that ψ(x) and its derivative are continuous.

A remarkable fact about this approach is that every such estimate can be viewed as a

quasi MLE for the model (2.25). Indeed, for a given function ψ , define the measure Pθ

with the log-density `(y, θ) = −ψ(y−θ) . Then the log-likelihood is L(θ) = −∑ψ(Yi−θ)

and the corresponding (quasi) MLE coincides with (2.26).

Exercise 2.10.3. Suggest a σ -finite measure µ such that exp{−ψ(y−θ)

}is the density

of Yi for the model (2.25) w.r.t. the measure µ .

Hint: suppose for simplicity that

Cψdef=

∫exp{−ψ(x)

}dx <∞.

Show that C−1ψ exp{−ψ(y − θ)

}is a density w.r.t. the Lebesgue measure for any θ .

Exercise 2.10.4. Show that the LAD θ = argminθ∑|Yi − θ| is the quasi MLE for

the model (2.25) when the errors εi are assumed Laplacian (double exponential) with

density p(x) = (1/2)e−|x| .

72

2.11 Univariate exponential families

Most parametric families considered in the previous sections are particular cases of expo-

nential families (EF) distributions. This includes the Gaussian shift, Bernoulli, Poisson,

exponential, volatility models. The notion of an EF already appeared in the context of

the Cramer-Rao inequality. Now we study such families in further detail.

We say that P is an EF if all measures Pθ ∈ P are dominated by a σ -finite measure

µ0 on Y and the density functions p(y, θ) = dPθ/dµ0(y) are of the form

p(y, θ)def=

dPθdµ0

(y) = p(y)eyC(θ)−B(θ).

Here C(θ) and B(θ) are some given nondecreasing functions on Θ and p(y) is a non-

negative function on Y .

Usually one assumes some regularity conditions on the family P . One possibility

was already given when we discussed the Cramer-Rao inequality; see Definition 2.5.5.

Below we assume that that condition is always fulfilled. It basically means that we can

differentiate w.r.t. θ under the integral sign.

For an EF, the log-likelihood admits an especially simple representation, nearly linear

in y :

`(y, θ)def= log p(y, θ) = yC(θ)−B(θ) + log p(y)

so that the log-likelihood ratio for θ, θ′ ∈ Θ reads as

`(y, θ, θ′)def= `(y, θ)− `(y, θ′) = y

[C(θ)− C(θ′)

]−[B(θ)−B(θ′)

].

2.11.1 Natural parametrization

Let P =(Pθ)

be an EF. By Y we denote one observation from the distribution Pθ ∈ P .

In addition to the regularity conditions, one often assumes the natural parametrization

for the family P which means the relation EθY = θ . Note that this relation is fulfilled

for all the examples of EF’s that we considered so far in the previous section. It is obvious

that the natural parametrization is only possible if the following identifiability condition

is fulfilled: for any two different measures from the considered parametric family, the

corresponding mean values are different. Otherwise the natural parametrization is always

possible: just define θ as the expectation of Y . Below we use the abbreviation EFn for

an exponential family with natural parametrization.

Some properties of an EFn The natural parametrization implies an important prop-

erty for the functions B(θ) and C(θ) .

73

Lemma 2.11.1. Let(Pθ)

be a naturally parameterized EF. Then

B′(θ) = θC ′(θ).

Proof. Differentiating both sides of the equation∫p(y, θ)µ0(dy) = 1 w.r.t. θ yields

0 =

∫ {yC ′(θ)−B′(θ)

}p(y, θ)µ0(dy)

=

∫ {yC ′(θ)−B′(θ)

}Pθ(dy)

= θC ′(θ)−B′(θ)

and the result follows.

The next lemma computes the important characteristics of a natural EF such as the

Kullback-Leibler divergence K(θ, θ′) = Eθ log(p(Y, θ)/p(Y, θ′)

), the Fisher information

I(θ)def= Eθ|`′(Y, θ)|2 , and the rate function m(µ, θ, θ′) = − logEθ exp

{µ`(Y, θ, θ′)

}.

Lemma 2.11.2. Let (Pθ) be an EFn. Then with θ, θ′ ∈ Θ fixed, it holds for

• the Kullback-Leibler divergence K(θ, θ′) = Eθ log(p(Y, θ)/p(Y, θ′)

):

K(θ, θ′) =

∫log

p(y, θ)

p(y, θ′)Pθ(dy)

={C(θ)− C(θ′)

}∫yPθ(dy)−

{B(θ)−B(θ′)

}= θ

{C(θ)− C(θ′)

}−{B(θ)−B(θ′)

}; (2.27)

• the Fisher information I(θ)def= Eθ|`′(Y, θ)|2 :

I(θ) = C ′(θ);

• the rate function m(µ, θ, θ′) = − logEθ exp{µ`(Y, θ, θ′)

}:

m(µ, θ, θ′) = K(θ, θ + µ(θ′ − θ)

);

• the variance Varθ(Y ) :

Varθ(Y ) = 1/I(θ) = 1/C ′(θ). (2.28)

Proof. Differentiating the equality

0 ≡∫

(y − θ)Pθ(dy) =

∫(y − θ)eL(y,θ)µ0(dy)

74

w.r.t. θ implies in view of Lemma 2.11.1

1 ≡ IEθ[(Y − θ)

{C ′(θ)Y −B′(θ)

}]= C ′(θ)IEθ(Y − θ)2.

This yields Varθ(Y ) = 1/C ′(θ) . This leads to the following representation of the Fisher

information:

I(θ) = Varθ[`′(Y, θ)

]= Varθ[C

′(θ)Y −B′(θ)] =[C ′(θ)

]2Varθ(Y ) = C ′(θ).

Exercise 2.11.1. Check the equations for the Kullback-Leibler divergence and Fisher

information from Lemma 2.11.2.

MLE and maximum likelihood for an EFn Now we discuss the maximum likeli-

hood estimation for a sample from an EFn. The log-likelihood can be represented in the

form

L(θ) =n∑i=1

log p(Yi, θ) = C(θ)n∑i=1

Yi −B(θ)n∑i=1

1 +n∑i=1

log p(Yi) (2.29)

= SC(θ)− nB(θ) +R

where

S =n∑i=1

Yi, R =n∑i=1

log p(Yi).

The remainder term R is unimportant because it does not depend on θ and thus it

does not enter in the likelihood ratio. The maximum likelihood estimate θ is defined by

maximizing L(θ) w.r.t. θ , that is,

θ = argmaxθ∈Θ

L(θ) = argmaxθ∈Θ

{SC(θ)− nB(θ)

}.

In the case of an EF with the natural parametrization, this optimization problem admits

a closed form solution given by the next theorem.

Theorem 2.11.3. Let (Pθ) be an EFn. Then the MLE θ fulfills

θ = S/n = n−1n∑i=1

Yi .

It holds

IEθθ = θ, Varθ(θ) = [nI(θ)]−1 = [nC ′(θ)]−1

75

so that θ is R-efficient. Moreover, the fitted log-likelihood L(θ, θ)def= L(θ)−L(θ) satisfies

for any θ ∈ Θ :

L(θ, θ) = nK(θ, θ). (2.30)

Proof. Maximization of L(θ) w.r.t. θ leads to the estimating equation nB′(θ)−SC ′(θ) =

0 . This and the identity B′(θ) = θC ′(θ) yield the MLE

θ = S/n.

The variance Varθ(θ) is computed using (2.28) from Lemma 2.11.2. The formula (2.27)

for the Kullback-Leibler divergence and (2.29) yield the representation (2.30) for the

fitted log-likelihood L(θ, θ) for any θ ∈ Θ .

One can see that the estimate θ is the mean of the Yi ’s. As for the Gaussian

shift model, this estimate can be motivated by the fact that the expectation of every

observation Yi under Pθ is just θ and by the law of large numbers the empirical mean

converges to its expectation as the sample size n grows.

2.11.2 Canonical parametrization

Another useful representation of an EF is given by the so-called canonical parametriza-

tion. We say that υ is the canonical parameter for this EF if the density of each measure

Pυ w.r.t. the dominating measure µ0 is of the form:

p(y, υ)def=

dPυdµ0

(y) = p(y) exp{yυ − d(υ)

}.

Here d(υ) is a given convex function on Θ and p(y) is a nonnegative function on Y .

The abbreviation EFc will indicate an EF with the canonical parametrization.

Some properties of an EFc The next relation is an obvious corollary of the definition:

Lemma 2.11.4. An EFn (Pθ) always permits a unique canonical representation. The

canonical parameter υ is related to the natural parameter θ by υ = C(θ) , d(υ) = B(θ)

and θ = d′(υ) .

Proof. The first two relations follow from the definition. They imply B′(θ) = d′(υ) ·dυ/dθ = d′(υ) · C ′(θ) and the last statement follows from B′(θ) = θC ′(θ) .

The log-likelihood ratio `(y, υ, υ1) for an EFc reads as

`(Y, υ, υ1) = Y (υ − υ1)− d(υ) + d(υ1).

The next lemma collects some useful facts about an EFc.

76

Lemma 2.11.5. Let P =(Pυ, υ ∈ U

)be an EFc and let the function d(·) be two times

continuously differentiable. Then it holds for any υ, υ1 ∈ U :

(i). The mean EυY and the variance Varυ(Y ) fulfill

EυY = d′(υ), Varυ(Y ) = Eυ(Y − EυY )2 = d′′(υ).

(ii). The Fisher information I(υ)def= Eυ|`′(Y, υ)|2 satisfies

I(υ) = d′′(υ).

(iii). The Kullback-Leibler divergence Kc(υ, υ1) = Eυ`(Y, υ, υ1) satisfies

Kc(υ, υ1) =

∫log

p(y, υ)

p(y, υ1)Pυ(dy)

= d′(υ)(υ − υ1

)−{d(υ)− d(υ1)

}= d′′(υ) (υ1 − υ)2/2,

where υ is a point between υ and υ1 . Moreover, for υ ≤ υ1 ∈ U

Kc(υ, υ1) =

∫ υ1

υ(υ1 − u)d′′(u)du.

(iv). The rate function m(µ, υ1, υ)def= − log IEυ exp

{µ`(Y, υ1, υ)

}fulfills

m(µ, υ1, υ) = µKc(υ, υ1

)−Kc

(υ, υ + µ(υ1 − υ)

)Proof. Differentiating the equation

∫p(y, υ)µ0(dy) = 1 w.r.t. υ yields∫ {

y − d′(υ)}p(y, υ)µ0(dy) = 0,

that is, EυY = d′(υ) . The expression for the variance can be proved by one more

differentiating of this equation. Similarly one can check (ii) . The item (iii) can be

checked by simple algebra and (iv) follows from (i) .

Further, for any υ, υ1 ∈ U , it holds

`(Y, υ1, υ)− Eυ`(Y, υ1, υ) = (υ1 − υ){Y − d′(υ)

}and with u = µ(υ1 − υ)

logEυ exp{u(Y − d′(υ)

)}= −ud′(υ) + d(υ + u)− d(υ) + logEυ exp

{uY − d(υ + u) + d(υ)

}= d(υ + u)− d(υ)− ud′(υ) = Kc(υ, υ + u),

77

because

Eυ exp{uY − d(υ + u) + d(υ)

}= Eυ

dPυ+udPυ

= 1

and (iv) follows by (iii) .

Table 2.1 presents the canonical parameter and the Fisher information for the exam-

ples of exponential families from Section 2.9.

Table 2.1: υ(θ) , d(υ) , I(υ) = d′′(υ) and θ = θ(υ) for the examples from Section 2.9.

Model υ d(υ) I(υ) θ(υ)

Gaussian regression θ/σ2 υ2σ2/2 σ2 σ2υ

Bernoulli model log(θ/(1− θ)

)log(1 + eυ) eυ/(1 + eυ)2 eυ/(1 + eυ)

Poisson model log θ eυ eυ eυ

Exponential model 1/θ − log υ 1/υ2 1/υ

Volatility model −1/(2θ) − 12 log(−2υ) 1/(2υ2) −1/(2υ)

Exercise 2.11.2. Check (iii) and (iv) in Lemma 2.11.5.

Exercise 2.11.3. Check the entries of Table 2.1.

Exercise 2.11.4. Check that Kc(υ, υ′) = K(θ(υ), θ(υ′)

)Exercise 2.11.5. Plot Kc(υ∗, υ) as a function of υ for the families from Table 2.1.

Maximum likelihood estimation for an EFc The structure of the log-likelihood in

the case of the canonical parametrization is particularly simple:

L(υ) =

n∑i=1

log p(Yi, υ) = υ

n∑i=1

Yi − d(υ)

n∑i=1

1 +

n∑i=1

log p(Yi)

= Sυ − nd(υ) +R

where

S =n∑i=1

Yi, R =n∑i=1

log p(Yi).

Again, as in the case of an EFn, we can ignore the remainder term R . The estimating

equation dL(υ)/dυ = 0 for the maximum likelihood estimate υ reads as

d′(υ) = S/n.

78

This and the relation θ = d′(υ) lead to the following result.

Theorem 2.11.6. The maximum likelihood estimates θ and υ for the natural and

canonical parametrization are related by the equations

θ = d′(υ) υ = C(θ).

The next result describes the structure of the fitted log-likelihood and basically re-

peats the result of Theorem 2.11.3.

Theorem 2.11.7. Let (Pυ) be an EF with canonical parametrization. Then for any

υ ∈ U the fitted log-likelihood L(υ, υ)def= maxυ′ L(υ′, υ) satisfies

L(υ, υ) = nKc(υ, υ).

Exercise 2.11.6. Check the statement of Theorem 2.11.7.

2.11.3 Deviation probabilities for the maximum likelihood

Let Y1, . . . , Yn be i.i.d. observations from an EF P . This section presents a probability

bound for the fitted likelihood. To be more specific we assume that P is canonically

parameterized, P = (Pυ) . However, the bound applies to the natural and any other

parametrization because the value of maximum of the likelihood process L(θ) does not

depend on the choice of parametrization. The log-likelihood ratio L(υ′, υ) is given by the

expression (2.29) and its maximum over υ′ leads to the fitted log-likelihood L(υ, υ) =

nKc(υ, υ) .

Our first result concerns a deviation bound for L(υ, υ) . It utilizes the representation

for the fitted log-likelihood given by Theorem 2.11.3. As usual, we assume that the family

P is regular. In addition, we require the following condition.

(Pc) P = (Pυ, υ ∈ U ⊆ IR) is a regular EF. The parameter set U is convex. The

function d(υ) is two times continuously differentiable and the Fisher information

I(υ) = d′′(υ) satisfies I(υ) > 0 for all υ .

The condition (Pc) implies that for any compact set U0 there is a constant a =

a(U0) > 0 such that

|I(υ1)/I(υ2)|1/2 ≤ a, υ1, υ2 ∈ U0 .

Theorem 2.11.8. Let Yi be i.i.d. from a distribution Pυ∗ which belongs to an EFc

satisfying (Pc) . For any z > 0

IPυ∗(L(υ, υ∗) > z

)= IPυ∗

(nKc(υ, υ∗) > z

)≤ 2e−z.

79

Proof. The proof is based on two properties of the log-likelihood. The first one is that

the expectation of the likelihood ratio is just one: IEυ∗ expL(υ, υ∗) = 1 . This and the

exponential Markov inequality imply for z ≥ 0

IPυ∗(L(υ, υ∗) ≥ z

)≤ e−z. (2.31)

The second property is specific to the considered univariate EF and is based on geometric

properties of the log-likelihood function: linearity in the observations Yi and convexity

in the parameter υ . We formulate this important fact in a separate

Lemma 2.11.9. Let the EFc P fulfill (Pc) . For given z and any υ0 ∈ U , there exist

two values υ+ > υ0 and υ− < υ0 satisfying Kc(υ±, υ0) = z/n such that

{L(υ, υ0) > z} ⊆ {L(υ+, υ0) > z} ∪ {L(υ−, υ0) > z}.

Proof. It holds

{L(υ, υ0) > z} ={

supυ

[S(υ − υ0

)− n

{d(υ)− d(υ0)

}]> z}

⊆{S > inf

υ>υ0

z + n{d(υ)− d(υ0)

}υ − υ0

}∪{−S > inf

υ<υ0

z + n{d(υ)− d(υ0)

}υ0 − υ

}.

Define for every u > 0

f(u) =z + n

{d(υ0 + u)− d(υ0)

}u

.

This function attains its minimum at a point u satisfying the equation

z/n+ d(υ0 + u)− d(υ0)− d′(υ0 + u)u = 0

or, equivalently,

K(υ0 + u, υ0) = z/n.

The condition (Pc) provides that there is only one solution u ≥ 0 of this equation.

Exercise 2.11.7. Check that the equation K(υ0 + u, υ0) = z/n has only one positive

solution for any z > 0 .

Hint: use that K(υ0 + u, υ0) is a convex function of u with minimum at u = 0 .

Now, it holds with υ+ = υ0 + u{S > inf

υ>υ0

z + n[d(υ)− d(υ0)

]υ − υ0

}=

{S >

z + n[d(υ+)− d(υ0)

]υ+ − υ0

}⊆ {L(υ+, υ0) > z}.

80

Similarly{−S > inf

υ<υ0

z + n{d(υ)− d(υ0)

}υ0 − υ

}=

{−S >

z + n[d(υ−)− d(υ0)

]υ0 − υ−

}⊆ {L(υ−, υ0) > z}.

for some υ− < υ0 .

The assertion of the theorem is now easy to obtain. Indeed,

IPυ∗(L(υ, υ∗) ≥ z

)≤ IPυ∗

(L(υ+, υ∗) ≥ z

)+ IPυ∗

(L(υ−, υ∗) ≥ z

)≤ 2e−z

yielding the result.

Exercise 2.11.8. Let (Pυ) be a Gaussian shift experiment, that is, Pυ = N(υ, 1) .

• Check that L(υ, υ) = n|υ − υ|2/2 ;

• Given z ≥ 0 , find the points υ+ and υ− such that

{L(υ, υ∗) > z} ⊆ {L(υ+, υ∗) > z} ∪ {L(υ−, υ∗) > z}.

• Plot the mentioned sets {υ : L(υ, υ) > z} , {υ : L(υ+, υ) > z} , and {υ : L(υ−, υ) >

z} as functions of υ for a fixed S =∑Yi .

Remark 2.11.1. Note that the mentioned result only utilizes the geometric structure

of the univariate EFc. The most important feature of the log-likelihood ratio L(υ, υ∗) =

S(υ − υ∗)− d(υ) + d(υ∗) is its linearity w.r.t. the stochastic term S . This allows us to

replace the maximum over the whole set U by the maximum over the set consisting of two

points υ± . Note that the proof does not rely on the distribution of the observations Yi .

In particular, Lemma 2.11.9 continues to hold even within the quasi likelihood approach

when L(υ) is not the true log-likelihood. However, the bound (2.31) relies on the nature

of L(υ, υ∗) . Namely, it utilizes that IEυ∗ exp{L(υ, υ±)

}= 1 , which is generally false

in the quasi likelihood setup. Nevertheless, the exponential bound can be extended to

the quasi likelihood approach under the condition of bounded exponential moments for

L(υ, υ∗) : for some µ > 0 , it should hold IE exp{µL(υ, υ∗)

}= C(µ) <∞ .

Theorem 2.11.8 yields a simple construction of a confidence interval for the parameter

υ∗ and the concentration property of the MLE υ .

Theorem 2.11.10. Let Yi be i.i.d. from Pυ∗ ∈ P with P satisfying (Pc) .

81

1. If zα satisfies e−zα ≤ α/2 , then

E(zα) ={υ : nKc

(υ, υ

)≤ zα

}is a α -confidence set for the parameter υ∗ .

2. Define for any z > 0 the set A(z, υ∗) = {υ : Kc(υ, υ∗) ≤ z/n} . Then

IPυ∗(υ /∈ A(z, υ∗)

)≤ 2e−z.

The second assertion of the theorem claims that the estimate υ belongs with a high

probability to the vicinity A(z, υ∗) of the central point υ∗ defined by the Kullback-

Leibler divergence. Due to Lemma 2.11.5, (iii) Kc(υ, υ∗) ≈ I(υ∗) (υ − υ∗)2/2 , where

I(υ∗) is the Fisher information at υ∗ . This vicinity is an interval around υ∗ of length

of order n−1/2 . In other words, this result implies the root-n consistency of υ .

The deviation bound for the fitted log-likelihood from Theorem 2.11.8 can be viewed

as a bound for the normalized loss of the estimate υ . Indeed, define the loss function

℘(υ′, υ) = K1/2(υ′, υ) . Then Theorem 2.11.8 yields that the loss is with high probability

bounded by√z/n provided that z is sufficiently large. Similarly one can establish the

bound for the risk.

Theorem 2.11.11. Let Yi be i.i.d. from the distribution Pυ∗ which belongs to a canon-

ically parameterized EF satisfying (Pc) . The following properties hold:

(i). For any r > 0 there is a constant rr such that

IEυ∗Lr(υ, υ∗) = nrIEυ∗K

r(υ, υ∗) ≤ rr .

(ii). For every λ < 1

IEυ∗ exp{λL(υ, υ∗)

}= IEυ∗ exp

{λnK(υ, υ∗)

}≤ (1 + λ)/(1− λ).

Proof. By Theorem 2.11.8

IEυ∗Lr(υ, υ∗) = −

∫z≥0

zrdIPυ∗{L(υ, υ∗) > z

}= r

∫z≥0

zr−1IPυ∗{L(υ, υ∗) > z

}dz

≤ r

∫z≥0

2zr−1e−zdz

and the first assertion is fulfilled with rr = 2r∫z≥0 z

r−1e−zdz . The assertion (ii) is proved

similarly.

82

Deviation bound for other parameterizations The results for the maximum like-

lihood and their corollaries have been stated for an EFc. An immediate question that

arises in this respect is whether the use of the canonical parametrization is essential.

The answer is “no”: a similar result can be stated for any EF whatever the parametriza-

tion used. This fact is based on the simple observation that the maximum likelihood is

the value of the maximum of the likelihood process; this value does not depend on the

parametrization.

Lemma 2.11.12. Let (Pθ) be an EF. Then for any θ

L(θ, θ) = nK(Pθ, Pθ). (2.32)

Exercise 2.11.9. Check the result of Lemma 2.11.12.

Hint: use that both sides of (2.32) depend only on measures Pθ, Pθ and not on the

parametrization.

Below we write as before K(θ, θ) instead of K(Pθ, Pθ) . The property (2.32) and the

exponential bound of Theorem 2.11.8 imply the bound for a general EF:

Theorem 2.11.13. Let (Pθ) be a univariate EF. Then for any z > 0

IPθ∗(L(θ, θ∗) > z

)= IPθ∗

(nK(θ, θ∗) > z

)≤ 2e−z.

This result allows us to build confidence sets for the parameter θ∗ and concentration

sets for the MLE θ in terms of the Kullback-Leibler divergence:

A(z, θ∗) = {θ : K(θ, θ∗) ≤ z/n},

E(z) = {θ : K(θ, θ) ≤ z/n}.

Corollary 2.11.14. Let (Pθ) be an EF. If e−zα = α/2 then

IPθ∗(θ 6∈ A(zα, θ

∗))≤ α,

and

IPθ∗(E(zα) 63 θ

)≤ α.

Moreover, for any r > 0

IEθ∗Lr(θ, θ∗) = nrIEθ∗K

r(θ, θ∗) ≤ rr .

83

Asymptotic against likelihood-based approach The asymptotic approach rec-

ommends to apply symmetric confidence and concentration sets with width of order

[nI(θ∗)]−1/2 :

An(z, θ∗) = {θ : I(θ∗) (θ − θ∗)2 ≤ 2z/n},

En(z) = {θ : I(θ∗) (θ − θ)2 ≤ 2z/n},

E′n(z) = {θ : I(θ) (θ − θ)2 ≤ 2z/n}.

Then asymptotically, i.e. for large n , these sets do approximately the same job as the

non-asymptotic sets A(z, θ∗) and E(z) . However, the difference for finite samples can

be quite significant. In particular, for some cases, e.g. the Bernoulli of Poisson families,

the sets An(z, θ∗) and E′n(z) may extend beyond the parameter set Θ .

84

Chapter 3

Regression Estimation

This chapter discusses the estimation problem for the regression model. First a linear

regression model is considered, then a generalized linear modeling is discussed. We also

mention median and quantile regression.

3.1 Regression model

The (mean) regression model can be written in the form IE(Y |X) = f(X) , or equiva-

lently,

Y = f(X) + ε, (3.1)

where Y is the dependent (explained) variable and X is the explanatory variable (regres-

sor) which can be multidimensional. The target of analysis is the systematic dependence

of the explained variable Y from the explanatory variable X . The regression function

f describes the dependence of the mean of Y as a function of X . The value ε can be

treated as an individual deviation (error). It is usually assumed to be random with zero

mean. Below we discuss in more detail the components of the regression model (3.1).

3.1.1 Observations

In almost all practical situations, regression analysis is performed on the basis of available

data (observations) given in the form of a sample of pairs (Xi, Yi) for i = 1, . . . , n , where

n is the sample size. Here Y1, . . . , Yn are observed values of the regression variable Y

and X1, . . . , Xn are the corresponding values of the explanatory variable X . For each

observation Yi , the regression model reads as:

Yi = f(Xi) + εi

where εi is the individual i th error.

85

86

3.1.2 Design

The set X1, . . . , Xn of the regressor’s values is called a design. The set X of all possible

values of the regressor X is called the design space. If this set X is compact, then one

speaks of a compactly supported design.

The nature of the design can be different for different statistical models. However,

it is important to mention that the design is always observable. Two kinds of design

assumptions are usually used in statistical modeling. A deterministic design assumes

that the points X1, . . . , Xn are nonrandom and given in advance. Here are typical

examples:

Example 3.1.1. [Time series] Let Yt0 , Yt0+1, . . . , YT be a time series. The time points

t0, t0 + 1, . . . , T build a regular deterministic design. The regression function f explains

the trend of the time series Yt as a function of time.

Example 3.1.2. [Imaging] Let Yij be the observed grey value at the pixel (i, j) of an

image. The coordinate Xij of this pixel is the corresponding design value. The regression

function f(Xij) gives the true image value at Xij which is to be recovered from the

noisy observations Yij .

If the design is supported on a cube in IRd and the design points Xi form a grid in

this cube, then the design is called equidistant. An important feature of such a design

is that the number NA of design points in any “massive” subset A of the unit cube is

nearly the volume of this subset VA multiplied by the sample size n : NA ≈ nVA . Design

regularity means that the value NA is nearly proportional to nVA , that is, NA ≈ cnVAfor some positive constant c which may depend on the set A .

In some applications, it is natural to assume that the design values Xi are randomly

drawn from some design distribution. Typical examples are given by sociological studies.

In this case one speaks of a random design. The design values X1, . . . , Xn are assumed

to be independent and identically distributed from a law PX on the design space X

which is a subset of the Euclidean space IRd . The design variables X are also assumed

to be independent of the observations Y .

One special case of random design is the uniform design when the design distribution

is uniform on the unit cube in IRd . The uniform design possesses a similar, important

property to an equidistant design: the number of design points in a “massive” subset of

the unit cube is on average close to the volume of this set multiplied by n . The random

design is called regular on X if the design distribution is absolutely continuous with

respect to the Lebesgue measure and the design density p(x) = dPX(x)/dλ is positive

and continuous on X . This again ensures with a probability close to one the regularity

property NA ≈ cnVA with c = p(x) for some x ∈ A .

87

It is worth mentioning that the case of a random design can be reduced to the case of

a deterministic design by considering the conditional distribution of the data given the

design variables X1, . . . , Xn .

3.1.3 Errors

The decomposition of the observed response variable Y into the systematic component

f(x) and the error ε in the model equation (3.1) is not formally defined and cannot be

done without some assumptions on the errors εi . The standard approach is to assume

that the mean value of every εi is zero. Equivalently this means that the expected

value of the observation Yi is just the regression function f(Xi) . This case is called

mean regression or simply regression. It is usually assumed that the errors εi have finite

second moments. Homogeneous errors case means that all the errors εi have the same

variance σ2 = Var ε2i . The variance of heterogeneous errors εi may vary with i . In

many applications not only the systematic component f(Xi) = IEYi but also the error

variance VarYi = Var εi depend on the regressor (location) Xi . Such models are often

written in the form

Yi = f(Xi) + σ(Xi)εi .

The observation (noise) variance σ2(x) can be the target of analysis similarly to the

mean regression function.

The assumption of zero mean noise, IEεi = 0 , is very natural and has a clear in-

terpretation. However, in some applications, it can cause trouble, especially if data are

contaminated by outliers. In this case, the assumption of a zero mean can be replaced by

a more robust assumption of a zero median. This leads to the median regression model

which assumes IP (εi ≤ 0) = 1/2 , or, equivalently

IP(Yi − f(Xi) ≤ 0

)= 1/2.

A further important assumption concerns the joint distribution of the errors εi . In the

majority of applications the errors are assumed to be independent. However, in some

situations, the dependence of the errors is quite natural. One example can be given by

time series analysis. The errors εi are defined as the difference between the observed

values Yi and the trend function fi at the i th time moment. These errors are often

serially correlated and indicate short or long range dependence. Another example comes

from imaging. The neighbor observations in an image are often correlated due to the

imaging technique used for recoding the images. The correlation particularly results from

the automatic movement correction.

88

For theoretical study one often assumes that the errors εi are not only independent

but also identically distributed. This, of course, yields a homogeneous noise. The theo-

retical study can be simplified even further if the error distribution is normal. This case

is called Gaussian regression and is denoted as εi ∼ N(0, σ2) . This assumption is very

useful and greatly simplifies the theoretical study. The main advantage of Gaussian noise

is that the observations and their linear combinations are also normally distributed. This

is an exclusive property of the normal law which helps to simplify the exposition and

avoid technicalities.

Under the given distribution of the errors, the joint distribution of the observations

Yi is determined by the regression function f(·) .

3.1.4 Regression function

By the equation (3.1), the regression variable Y can be decomposed into a systematic

component and a (random) error ε . The systematic component is a deterministic func-

tion f of the explanatory variable X called the regression function. Classical regression

theory considers the case of linear dependence, that is, one fits a linear relation between

Y and X :

f(x) = a+ bx

leading to the model equation

Yi = θ1 + θ2Xi + εi .

Here θ1 and θ2 are the parameters of the linear model. If the regressor x is multidimen-

sional, then θ2 is a vector from IRd and θ2x becomes the scalar product of two vectors.

In many practical examples the assumption of linear dependence is too restrictive. It can

be extended by several ways. One can try a more sophisticated functional dependence

of Y n X , for instance polynomial. More generally, one can assume that the regression

function f is known up to the finite-dimensional parameter θ = (θ1, . . . , θp)> ∈ IRp .

This situation is called parametric regression and denoted by f(·) = fθ(·) . If the func-

tion fθ depends on θ linearly, that is, fθ(x) = θ1ψ1(x) + . . .+ θpψp(x) for some given

functions ψ1, . . . , ψp , then the model is called linear regression. An important special

case is given by polynomial regression when f(x) is a polynomial function of degree

p− 1 : f(x) = θ1 + θ2x+ . . .+ θpxp−1 .

In many applications a parametric form of the regression function cannot be justified.

Then one speaks of nonparametric regression.

89

3.2 Method of substitution and M-estimation

Observe that the parametric regression equation can be rewritten as

εi = Yi − f(Xi,θ∗).

If θ is an estimate of the parameter θ∗ , then the residuals εi = Yi − f(Xi, θ) are

estimates of the individual errors εi . So, the idea of the method is to select the parameter

estimate θ in a way that the empirical distribution Pn of the residuals εi mimics as well

as possible certain prescribed features of the error distribution. We consider one approach

called minimum contrast or M-estimation. Let ψ(y) be an influence or contrast function.

The main condition on the choice of this function is that

IEψ(εi + z) ≥ IEψ(εi)

for all i = 1, . . . , n and all z . Then the true value θ∗ clearly minimizes the expectation

of the sum∑

i ψ(Yi − f(Xi,θ)

):

θ∗ = argminθ

IE∑i

ψ(Yi − f(Xi,θ)

).

This leads to the M-estimate

θ = argminθ

∑i

ψ(Yi − f(Xi,θ)

).

This estimation method can be treated as replacing the true expectation of the errors by

the empirical distribution of the residuals.

We specify this approach for regression estimation by the classical examples of least

squares, least absolute deviation and maximum likelihood estimation corresponding to

ψ(x) = x2 , ψ(x) = |x| and ψ(x) = log p(x) , where p(x) is the error density. All these

examples belong within framework of M-estimation and the quasi maximum likelihood

approach.

3.2.1 Mean regression. Least squares estimate

The observations Yi are assumed to follow the model

Yi = f(Xi,θ∗) + εi , IEεi = 0 (3.2)

90

with an unknown target θ∗ . Suppose in addition that σ2i = IEε2i <∞ . Then for every

θ ∈ Θ and every i ≤ n due to (3.2)

IEθ∗{Yi − f(Xi,θ)

}2= IEθ∗

{εi + f(Xi,θ

∗)− f(Xi,θ)}2

= σ2i +∣∣f(Xi,θ

∗)− f(Xi,θ)∣∣2.

This yields for the whole sample

IEθ∗∑{

Yi − f(Xi,θ)}2

=∑{

σ2i +∣∣f(Xi,θ

∗)− f(Xi,θ)∣∣2}.

This expression is clearly minimized at θ = θ∗ . This leads to the idea of estimating the

parameter θ∗ by maximizing its empirical counterpart. The resulting estimate is called

the (ordinary) least squares estimate (LSE):

θLSE = argminθ

∑{Yi − f(Xi,θ)

}2.

This estimate is very natural and requires minimal information about the errors εi : just

IEεi = 0 and IEε2i <∞ .

3.2.2 Median regression. Least absolute deviation estimate

Consider the same regression model as in (3.2), but the errors εi are not zero-mean.

Instead we assume that their median is zero:

Yi = f(Xi,θ∗) + εi , med(εi) = 0

As previously, the target of estimation is the parameter θ∗ . Observe that εi = Yi −f(Xi,θ

∗) and hence, the latter r.v. has median zero. We now use the following simple

fact: if med(ε) = 0 , then for any z 6= 0

IE|ε+ z| ≥ IE|ε|. (3.3)

Exercise 3.2.1. Prove (3.3).

The property (3.3) implies for every θ

IEθ∗∑∣∣Yi − f(Xi,θ)

∣∣ ≥ IEθ∗∑∣∣Yi − f(Xi,θ∗)∣∣,

that is, θ∗ minimizes over θ the expectation under the true measure of the sum∑∣∣Yi−

f(Xi,θ)∣∣ . This leads to the empirical counterpart of θ∗ given by

θ = argminθ∈Θ

∑∣∣Yi − f(Xi,θ)∣∣.

91

3.2.3 Maximum likelihood regression

Let the density function p(·) of the errors εi be known. The regression equation (3.2)

implies εi = Yi− f(Xi,θ∗) . Therefore, every Yi has the density p(y− f(Xi,θ

∗)) . Inde-

pendence of the Yi ’s implies the product structure of the density of the joint distribution:∏p(yi − f(Xi,θ)),

yielding the log-likelihood

L(θ) =∑

`(Yi − f(Xi,θ))

with `(t) = log p(t) . The MLE is the point of maximum of L(θ) :

θ = argmaxθ

L(θ) = argmaxθ

∑`(Yi − f(Xi,θ)).

A closed form solution for this equation exists only in some special cases like linear

Gaussian regression. Otherwise this equation has to be solved numerically.

Consider an important special case corresponding to the i.i.d. Gaussian errors when

p(y) is the density of the normal law with mean zero and variance σ2 . Then

L(θ) = −n2

log(2πσ2)− 1

2σ2

∑∣∣Yi − f(Xi,θ)∣∣2.

The corresponding MLE maximizes L(θ) or, equivalently, minimizes the sum∑∣∣Yi −

f(Xi,θ)∣∣2 :

θ = argmaxθ∈Θ

L(θ) = argminθ∈Θ

∑∣∣Yi − f(Xi,θ)∣∣2.

This estimate has already been introduced as the ordinary least squares estimate (oLSE).

A small extension of the previous example is given by inhomogeneous Gaussian regres-

sion, when the errors εi are independent Gaussian zero-mean but the variances depend

on i : IEε2i = σ2i . Then the log-likelihood L(θ) is given by the sum

L(θ) =∑{

−∣∣Yi − f(Xi,θ)

∣∣22σ2i

− 1

2log(2πσ2i )

}.

Maximizing this expression w.r.t. θ is equivalent to minimizing the weighted sum∑σ−2i

∣∣Yi − f(Xi,θ)∣∣2 :

θ = argminθ

∑σ−2i

∣∣Yi − f(Xi,θ)∣∣2.

Such an estimate is also-called the weighted least squares (wLSE).

92

Another example corresponds to the case when the errors εi are i.i.d. double expo-

nential, so that IP (±ε1 > t) = e−t/σ for some given σ > 0 . Then p(y) = (2σ)−1e−y/σ

and

L(θ) = −n log(2σ)− σ−1∑∣∣Yi − f(Xi,θ)

∣∣.The MLE θ maximizes L(θ) or, equivalently, minimizes the sum

∑∣∣Yi − f(Xi,θ)∣∣ :

θ = argmaxθ∈Θ

L(θ) = argminθ∈Θ

∑∣∣Yi − f(Xi,θ)∣∣.

So the maximum likelihood regression with Laplacian errors leads back to the least ab-

solute deviation (LAD) estimate.

3.3 Linear regression

One standard way of modeling the regression relationship is based on a linear expansion

of the regression function. This approach is based on the assumption that the unknown

regression function f(·) can be represented as a linear combination of given basis func-

tions ψ1(·), . . . , ψp(·) :

f(x) = θ1ψ1(x) + . . .+ θpψp(x).

Typical examples are:

Example 3.3.1. [Multivariate linear regression] Let the regressor x = (x1, . . . , xd) be

d -dimensional. The linear regression function f(x) can be written as

f(x) = a+ b1x1 + . . .+ bdxd.

Here we have p = d + 1 and the basis functions are ψ1(x) ≡ 1 and ψm = xm−1 for

m = 2, . . . , p . The coefficient a is often called the intercept and b1, . . . , bd are the slope

coefficients. The vector of coefficients θ = (a, b1, . . . , bd)> uniquely describes the linear

relation.

Example 3.3.2. [Polynomial regression] Let x be univariate and f(·) be a polynomial

function of degree p− 1 , that is,

f(x) = θ1 + θ2x+ . . .+ θpxp−1.

Then the basic functions are ψ1(x) ≡ 1 , ψ2(x) ≡ x , ψp(x) ≡ xp−1 , while θ =

(θ1, . . . , θp)> is the corresponding vector of coefficients.

93

Example 3.3.3. [Series expansion] Let ψ1(x), . . . , ψp(x), . . . be a given system of func-

tions. Specific examples are trigonometric (Fourier, cosine), orthogonal polynomial

(Chebyshev, Legendre, Jacobi), and wavelet systems among many others. The com-

pleteness of this system means that a given function f under mild regularity conditions

can be uniquely expanded in the form

f(x) =∞∑m=1

θmψm(x).

However, such an expansion is untractable because it involves infinitely many coefficients

θm . A standard procedure is to truncate this expansion after the first p terms leading

to the approximation

f(x) ≈p∑

m=1

θmψm(x). (3.4)

Such an approximation becomes better and better as the number p of terms grows, but

then one must estimate more and more coefficients. A choice of a proper truncation value

p is one of the central problems in nonparametric function estimation. The parametric

approach simply assumes that the value p is fixed and the approximation (3.4) is treated

as exact equality: f(x) ≡ θ1ψ1(x) + . . .+ θpψp .

Exercise 3.3.1. Let the regressor x be d -dimensional, x = (x1, . . . , xd)> . Describe

the basis system and the corresponding vector of coefficients for the case when f is a

quadratic function of x .

Linear regression is often described using vector-matrix notation. Let Ψi be the

vector in IRp whose entries are the values ψm(Xi) of the basis functions at the design

point Xi , m = 1, . . . , p . Then f(Xi) = Ψ>i θ∗ , and the linear regression model can be

written as

Yi = Ψ>i θ∗ + εi , i = 1, . . . , n.

Denote by Y = (Y1, . . . , Yn)> the vector of observations (responses), and ε = (ε1, . . . , εn)>

the vector of errors. Let finally Ψ be the p × n matrix with columns Ψ1, . . . , Ψn , that

is, Ψ =(ψm(Xi)

)i=1,...,n

m=1,...,p. Note that each row of Ψ is composed by the values of the

corresponding basis function ψm at the design points Xi . Now the regression equation

reads as

Y = Ψ>θ∗ + ε.

The estimation problem for this linear model will be discussed in detail in Chapter 4.

94

3.3.1 Projection estimation

3.3.2 Piecewise linear estimation

3.3.3 Spline estimation

3.3.4 Wavelet estimation

3.3.5 Kernel estimation

3.4 Density function estimation

3.4.1 Linear projection estimation

3.4.2 Wavelet density estimation

3.4.3 Kernel density estimation

3.4.4 Estimation based on Fourier transformation

3.5 Generalized regression

Let the response Yi be observed at the design point Xi ∈ IRd , i = 1, . . . , n . A (mean)

regression model assumes that the observed values Yi are independent and can be de-

composed into the systematic component f(Xi) and the individual centered stochastic

error εi . In some cases such a decomposition is questionable. This especially concerns

the case when the data Yi are categorical, e.g. binary or discrete. Another striking

example is given by nonnegative observations Yi . In such cases one usually assumes

that the distribution of Yi belongs to some given parametric family (Pυ, υ ∈ U) and

only the parameter of this distribution depends on the design point Xi . We denote this

parameter value as f(Xi) ∈ U and write the model in the form

Yi ∼ Pf(Xi) .

As previously, f(·) is called a regression function and its values at the design points Xi

completely specify the joint data distribution:

Y ∼∏i

Pf(Xi).

Below we assume that (Pυ) is a univariate exponential family with the log-density

`(y, υ) .

The parametric modeling approach assumes that the regression function f can be

specified by a finite-dimensional parameter θ ∈ Θ ⊂ IRp : f(x) = f(x,θ) . As usual, by

95

θ∗ we denote the true parameter value. The log-likelihood function for this model reads

as

L(θ) =∑i

`(Yi, f(Xi,θ)

).

The corresponding MLE θ maximizes L(θ) :

θ = argmaxθ

∑i

`(Yi, f(Xi,θ)

).

The estimating equation ∇L(θ) = 0 reads as∑i

`′(Yi, f(Xi,θ)

)∇f(Xi,θ) = 0

where `′(y, υ)def= ∂`(y, υ)/∂υ .

The approach essentially depends on the parametrization of the considered EF. Usu-

ally one applies either the natural or canonical parametrization. In the case of the

natural parametrization, `(y, υ) = C(υ)y − B(υ) , where the functions C(·), B(·) sat-

isfy B′(υ) = υC ′(υ) . This implies `′(y, υ) = yC ′(υ) − B′(υ) = (y − υ)C ′(υ) and the

estimating equation reads as∑i

(Yi − f(Xi,θ)

)C ′(f(Xi,θ)

)∇f(Xi,θ) = 0

Unfortunately, a closed form solution for this equation exists only in very special cases.

Even the questions of existence and uniqueness of the solution cannot be studied in

whole generality. Some numerical algorithms are usually applied to solve the estimating

equation.

Exercise 3.5.1. Specify the estimating equation for generalized EFn regression and find

the solution for the case of the constant regression function f(Xi, θ) ≡ θ .

Hint: If f(Xi, θ) ≡ θ , then the Yi are i.i.d. from Pθ .

The canonical parametrization is often applied in combination with linear modeling

of the regression function. If (Pυ) is an EFc with the log-density `(y, υ) = yυ − d(υ) ,

then the log-likelihood L(θ) can be represented in the form

L(θ) =∑i

{Yif(Xi,θ)− d

(f(Xi,θ)

)}.

The corresponding estimating equation is∑i

{Yi − d′

(f(Xi,θ)

)}∇f(Xi,θ) = 0.

96

Exercise 3.5.2. Specify the estimating equation for generalized EFc regression and find

the solution for the case of constant regression with f(Xi, υ) ≡ υ . Relate the natural

and canonical representation.

3.6 Generalized linear models

Consider the generalized regression model

Yi ∼ Pf(Xi) ∈ P.

In addition we assume a linear (in parameters) structure of the regression function f(X) .

Such modeling is particularly useful to combine with the canonical parametrization of

the considered EF with the log-density `(y, υ) = yυ − d(υ) . The reason is that the

stochastic part in the log-likelihood of an EFc linearly depends on the parameter. So,

below we assume that P = (Pυ, υ ∈ U) is an EFc.

Linear regression f(Xi) = Ψ>i θ with given feature vectors Ψi ∈ IRp leads to the

model with the log-likelihood

L(θ) =∑i

{YiΨ

>i θ − d

(Ψ>i θ

)}Such a setup is called generalized linear model (GLM). Note that the log-likelihood can

be represented as

L(θ) = S>θ −A(θ),

where

S =∑i

YiΨi, A(θ) =∑i

d(Ψ>i θ

).

The corresponding MLE θ maximizes L(θ) . Again, a closed form solution only exists

in special cases. However, an important advantage of the GLM approach is that the

solution always exists and is unique. The reason is that the log-likelihood function L(θ)

is concave in θ .

Lemma 3.6.1. The MLE θ solves the following estimating equation:

∇L(θ) = S −∇A(θ) =

n∑i=1

(Yi − d′(Ψ>i θ)

)Ψi = 0. (3.5)

The solution exists and is unique.

97

Proof. Define the matrix

B(θ) =

n∑i=1

d′′(Ψ>i θ)ΨiΨ>i .

Since d′′(υ) is strictly positive for all u , the matrix B(θ) is positively defined as well.

It holds

∇2L(θ) = −∇2A(θ) = −n∑i=1

d′′(Ψ>i θ)ΨiΨ>i = −B(θ).

Thus, the function L(θ) is strictly concave w.r.t. θ and the estimating equation

∇L(θ) = S −∇A(θ) = 0 has the unique solution θ .

The solution of (3.5) can be easily obtained numerically by the Newton-Raphson

algorithm: select the initial estimate θ(0) . Then for every k ≥ 1 apply

θ(k+1) = θ(k) +{B(θ(k))

}−1{S −∇A(θ(k))

}until convergence.

3.6.1 Logit regression for binary data

Suppose that the observed data Yi are independent and binary, that is, each Yi is

either zero or one, i = 1, . . . , n . Such models are often used in e.g. sociological and

medical study, two-class classification, binary imaging, among many other fields. We

treat each Yi as a Bernoulli r.v. with the corresponding parameter fi = f(Xi) . This is

a special case of generalized regression also called binary response models. The parametric

modeling assumption means that the regression function f(·) can be represented in the

form f(Xi) = f(Xi,θ) for a given class of functions {f(·,θ),θ ∈ Θ ∈ IRp} . Then the

log-likelihood L(θ) reads as

L(θ) =∑i

`(Yi, f(Xi,θ)),

where `(y, υ) is the log-density of the Bernoulli law. For linear modeling, it is more

useful to work with the canonical parametrization. Then `(y, υ) = yυ− log(1 + eυ) , and

the log-likelihood reads

L(θ) =∑i

[Yif(Xi,θ)− log

(1 + ef(Xi,θ)

)].

In particular, if the regression function f(·,θ) is linear, that is, f(Xi,θ) = Ψ>i θ , then

L(θ) =∑i

[YiΨ

>i θ − log(1 + eΨ

>i θ)].

98

The corresponding estimate reads as

θ = argmaxθ

L(θ) = argmaxθ

∑i

[YiΨ

>i θ − log(1 + eΨ

>i θ)]

This modeling is usually referred to as logit regression.

Exercise 3.6.1. Specify the estimating equation for the case of logit regression.

3.6.2 Poisson regression

Suppose that the observations Yi are nonnegative integer numbers. The Poisson distri-

bution is a natural candidate for modeling such data. It is supposed that the underlying

Poisson parameter depends on the regressor Xi . Typical examples arise in different

types of imaging including medical positron emission and magnet resonance tomography,

satellite and low-luminosity imaging, queueing theory, high frequency trading, etc. (to

be continued).

3.7 Quasi Maximum Likelihood estimation

This section very briefly discusses an extension of the maximum likelihood approach. A

more detailed discussion will be given in context of linear modeling in Chapter 4. To be

specific, consider a regression model

Yi = f(Xi) + εi.

The maximum likelihood approach requires to specify the two main ingredients of this

model: a parametric class {f(x,θ),θ ∈ Θ} of regression functions and the distribution

of the errors εi . Sometimes such information is lacking. One or even both modeling

assumptions can be misspecified. In such situations one speaks of a quasi maximum

likelihood approach, where the estimate θ is defined via maximizing over θ the random

function L(θ) even through it is not necessarily the real log-likelihood. Some examples

of this approach have already been given.

Below we distinguish between misspecification of the first and second kind. The first

kind corresponds to the parametric assumption about the regression function: assumed

is the equality f(Xi) = f(Xi,θ∗) for some θ∗ ∈ Θ . In reality one can only expect

a reasonable quality of approximating f(·) by f(·,θ∗) . A typical example is given by

linear (polynomial) regression. The linear structure of the regression function is useful

and tractable but it can only be a rough approximation of the real relation between Y

and X . The quasi maximum likelihood approach suggests to ignore this misspecification

and proceed as if the parametric assumption is fulfilled. This approach raises a number

99

of questions: what is the target of estimation and what is really estimated by such

quasi ML procedure? In Chapter 4 we show in the context of linear modeling that

the target of estimation can be naturally defined as the parameter θ† providing the best

approximation of the true regression function f(·) by its parametric counterpart f(·,θ) .

The second kind of misspecification concerns the assumption about the errors εi . In

the most of applications, the distribution of errors is unknown. Moreover, the errors can

be dependent or non-identically distributed. Assumption of a specific i.i.d. structure

leads to a model misspecification and thus, to the quasi maximum likelihood approach.

We illustrate this situation by few examples.

Consider the regression model Yi = f(Xi,θ∗) + εi and suppose for a moment that

the errors εi are i.i.d. normal. Then the principal term of the corresponding log-

likelihood is given by the negative sum of the squared residuals:∑∣∣Yi− f(Xi,θ)

∣∣2 , and

its maximization leads to the least squares method. So, one can say that the LSE method

is the quasi MLE when the errors are assumed to be i.i.d. normal. That is, the LSE can

be obtained as the MLE for the imaginary Gaussian regression model when the errors

εi are not necessarily i.i.d. Gaussian.

If the data are contaminated or the errors have heavy tails, it could be unwise to

apply the LSE method. The LAD method is known to be more robust against outliers

and data contamination. At the same time, it has already been shown in Section 3.2.3

that the LAD estimates is the MLE when the errors are Laplacian (double exponential).

In other words, LAD is the quasi MLE for the model with Laplacian errors.

Inference for the quasi ML approach is discussed in detail in Chapter 4 in the context

of linear modeling.

100

Chapter 4

Estimation in linear models

This chapter discusses the estimation problem for a linear model by a quasi maximum

likelihood method. We especially focus on the validity of the presented results under

possible model misspecification. Another important issue is the way of measuring the

estimation loss and risk. We distinguish below between response estimation or predic-

tion and the parameter estimation. The most advanced results like chi-squared result

in Section 4.6 are established under the assumption of a Gaussian noise. However, a

misspecification of noise structure is allowed and addressed.

4.1 Modeling assumptions

A linear model assumes that the observations Yi follow the equation:

Yi = Ψ>i θ∗ + εi (4.1)

for i = 1, . . . , n , where θ∗ = (θ∗1, . . . , θ∗p)> ∈ IRp is an unknown parameter vector, Ψi

are given vectors in IRp and the εi ’s are individual errors with zero mean. A typical

example is given by linear regression (see Section 3.3) when the vectors Ψi are the values

of a set of functions (e.g polynomial, trigonometric) series at the design points Xi .

A linear Gaussian model assumes in addition that the vector of errors ε = (ε1, . . . εn)>

is normally distributed with zero mean and a covariance matrix Σ :

ε ∼ N(0, Σ).

In this chapter we suppose that Σ is given in advance. We will distinguish between

three cases:

1. the errors εi are i.i.d. N(0, σ2) , or equivalently, the matrix Σ is equal to σ2IIn

with IIn being the unit matrix in IRn .

101

102

2. the errors are independent but not homogeneous, that is, IEε2i = σ2i . Then the

matrix Σ is diagonal: Σ = diag(σ21, . . . , σ2n) .

3. the errors εi are dependent with a covariance matrix Σ .

In practical applications one mostly starts with the white Gaussian noise assumption

and more general cases 2 and 3 are only considered if there are clear indications of the

noise inhomogeneity or correlation. The second situation is typical e.g. for the eigenvector

decomposition in an inverse problem. The last case is the most general and includes the

first two.

4.2 Quasi maximum likelihood estimation

Denote by Y = (Y1, . . . , Yn)> (resp. ε = (ε1, . . . , εn)> ) the vector of observations (resp.

of errors) in IRn and by Ψ the p× n matrix with columns Ψi . Let also Ψ> denote its

transpose. Then the model equation can be rewritten as:

Y = Ψ>θ∗ + ε, ε ∼ N(0, Σ).

An equivalent formulation is that Σ−1/2(Y −Ψ>θ) is a standard normal vector in IRn .

The log-density of the distribution of the vector Y = (Y1, . . . , Yn)> w.r.t. the Lebesgue

measure in IRn is therefore of the form

L(θ) = −n2

log(2π)−log(detΣ

)2

− 1

2‖Σ−1/2(Y − Ψ>θ)‖2

= −n2

log(2π)−log(detΣ

)2

− 1

2(Y − Ψ>θ)>Σ−1(Y − Ψ>θ).

In case 1 this expression can be rewritten as

L(θ) = −n2

log(2πσ2)− 1

2σ2

n∑i=1

(Yi − Ψ>i θ)2.

In case 2 the expression is similar:

L(θ) = −n∑i=1

{1

2log(2πσ2i ) +

(Yi − Ψ>i θ)2

2σ2i

}.

The maximum likelihood estimate (MLE) θ of θ∗ is defined by maximizing the log-

likelihood L(θ) :

θ = argmaxθ∈IRp

L(θ) = argminθ∈IRp

(Y − Ψ>θ)>Σ−1(Y − Ψ>θ). (4.2)

103

We omit the other terms in the expression of L(θ) because they do not depend on θ .

This estimate is the least squares estimate (LSE) because it minimizes the sum of squared

distances between the observations Yi and the linear responses Ψ>i θ . Note that (4.2) is

a quadratic optimization problem which has a closed form solution. Differentiating the

right hand-side of (4.2) w.r.t. θ yields the normal equation

ΨΣ−1Ψ>θ = ΨΣ−1Y .

If the p×p -matrix ΨΣ−1Ψ> is non-degenerate then the normal equation has the unique

solution

θ =(ΨΣ−1Ψ>

)−1ΨΣ−1Y = SY , (4.3)

where

S =(ΨΣ−1Ψ>

)−1ΨΣ−1

is a p× n matrix. We denote by θm the entries of the vector θ , m = 1, . . . , p .

If the matrix ΨΣ−1Ψ> is degenerate, then the normal equation has infinitely many

solutions. However, one can still apply the formula (4.3) where (ΨΣ−1Ψ>)−1 is a pseudo-

inverse of the matrix ΨΣ−1Ψ> .

The ML-approach leads to the parameter estimate θ . Note that due to the model

(4.1), the product f = Ψ>θ is an estimate of the mean fdef= IEY of the vector of

observations Y :

f = Ψ>θ = Ψ>(ΨΣ−1Ψ>

)−1ΨΣ−1Y = ΠY ,

where

Π = Ψ>(ΨΣ−1Ψ>

)−1ΨΣ−1

is an n × n matrix (linear operator) in IRn . The vector f is called a prediction or

response regression estimate.

Below we study the properties of the estimates θ and f . In this study we try to

address both types of possible model misspecification: due to a wrong assumption about

the error distribution and due to a possibly wrong linear parametric structure. Namely

we consider the model

Yi = fi + εi, ε ∼ N(0, Σ0). (4.4)

The response values fi are usually treated as the value of the regression function f(·) at

the design points Xi . The parametric model (4.1) can be viewed as an approximation of

104

(4.4) while Σ is an approximation of the true covariance matrix Σ0 . If f is indeed equal

to Ψ>θ∗ and Σ = Σ0 , then θ and f are MLEs, otherwise quasi MLEs. In our study

we mostly restrict ourselves to the case 1 assumption about the noise ε : ε ∼ N(0, σ2IIn) .

The general case can be reduced to this one by a simple data transformation, namely, by

multiplying the equation (4.4) Y = f + ε with the matrix Σ−1/2 , see Section 4.6 for

more detail.

4.2.1 Estimation under the homogeneous noise assumption

If a homogeneous noise is assumed, that is Σ = σ2IIn and ε ∼ N(0, σ2IIn) , then the

formulae for the MLEs θ, f slightly simplify. In particular, the variance σ2 cancels and

the resulting estimate is the ordinary least squares (oLSE):

θ =(ΨΨ>

)−1ΨY = SY

with S =(ΨΨ>

)−1Ψ . Also

f = Ψ>(ΨΨ>

)−1ΨY = ΠY

with Π = Ψ>(ΨΨ>

)−1Ψ .

Exercise 4.2.1. Derive the formulae for θ, f directly from the log-likelihood L(θ) for

homogeneous noise.

If the assumption ε ∼ N(0, σ2IIn) about the errors is not precisely fulfilled, then the

oLSE can be viewed as a quasi MLE.

4.2.2 Linear basis transformation

Denote by ψ>1 , . . . ,ψ>p the rows of the matrix Ψ . Then the ψi ’s are vectors in IRn

and we call them the basis vectors. In the linear regression case the ψi ’s are obtained as

the values of the basis functions at the design points. Our linear parametric assumption

simply means that the underlying vector f can be represented as a linear combination

of the vectors ψ1, . . . ,ψp :

f = θ∗1ψ1 + . . .+ θ∗pψp .

In other words, f belongs to the linear subspace in IRn spanned by the vectors ψ1, . . . ,ψp .

It is clear that this assumption still holds if we select another basis in this subspace.

Let U be any linear orthogonal transformation in IRp with UU> = IIp . Then the

linear relation f = Ψ>θ∗ can be rewritten as

f = Ψ>UU>θ∗ = Ψ>u∗

105

with Ψ = U>Ψ and u∗ = U>θ∗ . Here the columns of Ψ mean the new basis vectors ψm

in the same subspace while u∗ is the vector of coefficients describing the decomposition

of the vector f w.r.t. this new basis:

f = u∗1ψ1 + . . .+ u∗pψp .

The natural question is how the expression for the MLEs θ and f change with the

change of the basis. The answer is straightforward. For notational simplicity, we only

consider the case with Σ = σ2IIn . The model can be rewritten as

Y = Ψ>u∗ + ε

yielding the solutions

u =(Ψ Ψ>

)−1ΨY = SY , f = Ψ>

(Ψ Ψ>

)−1ΨY = ΠY ,

where Ψ = U>Ψ implies

S =(Ψ Ψ>

)−1Ψ = U>S,

Π = Ψ>(Ψ Ψ>

)−1Ψ = Π.

This yields

u = U>θ

and moreover, the estimate f is not changed for any linear transformation of the basis.

The first statement can be expected in view of θ∗ = Uu∗ , while the second one will be

explained in the next section: Π is the linear projector on the subspace spanned by the

basis vectors and this projector is invariant w.r.t. basis transformations.

Exercise 4.2.2. Consider univariate polynomial regression of degree p− 1 . This means

that f is a polynomial function of degree p − 1 observed at the points Xi with errors

εi that are assumed to be i.i.d. normal. The function f can be represented as

f(x) = θ∗1 + θ∗2x+ . . .+ θ∗pxp−1

using the basis functions ψm(x) = xm−1 for m = 0, . . . , p − 1 . At the same time, for

any point x0 , this function can also be written as

f(x) = u∗1 + u∗2(x− x0) + . . .+ u∗p(x− x0)p−1

using the basis functions ψm = (x− x0)m−1 .

106

• Write the matrices Ψ and ΨΨ> and similarly Ψ and Ψ Ψ> .

• Describe the linear transformation A such that u = Aθ for p = 1 .

• Describe the transformation A such that u = Aθ for p > 1 .

Hint: use the formula

u∗m =1

(m− 1)!f (m−1)(x0), m = 1, . . . , p

to identify the coefficient u∗m via θ∗m, . . . , θ∗p .

4.2.3 Orthogonal and orthonormal design

Orthogonality of the design matrix Ψ means that the basis vectors ψ1, . . . , ψp are or-

thonormal in the sense

ψ>mψm′ =n∑i=1

ψm,iψm′,i =

0 if m 6= m′,

λm if m = m′,

for some positive values λ1, . . . , λp . Equivalently one can write

ΨΨ> = Λ = diag(λ1, . . . , λp).

This feature of the design is very useful and it essentially simplifies the computation and

analysis of the properties of θ . Indeed, ΨΨ> = Λ implies

θ = Λ−1ΨY , f = Ψ>θ = Ψ>Λ−1ΨY

with Λ−1 = diag(λ−11 , . . . , λ−1p ) . In particular, the first relation means

θm = λ−1m

n∑i=1

Yiψm,i,

that is, θm is the scalar product of the data and the basis vector ψm for m = 1, . . . , p .

The estimate of the response f reads as

f = θ1ψ1 + . . .+ θpψp.

Theorem 4.2.1. Consider the model Y = Ψ>θ + ε with homogeneous errors ε :

IEεε> = σ2IIn . If the design Ψ is orthogonal, that is, if ΨΨ> = Λ for a diagonal

matrix Λ , then the estimated coefficients θm are uncorrelated: Var(θ) = σ2Λ . More-

over, if ε ∼ N(0, σ2IIn) , then θ ∼ N(θ∗, σ2Λ) .

107

An important message of this result is that the orthogonal design allows for splitting

the original multivariate problem into a collection of independent univariate problems:

each coefficient θ∗m is estimated by θm independently on the remaining coefficients.

The calculus can be further simplified in the case of an orthogonal design with ΨΨ> =

IIp . Then one speaks about an orthonormal design. This also implies that every basis

function (vector) ψm is standardized: ‖ψm‖2 =∑n

i=1 ψ2m,i = 1 . In the case of an

orthonormal design, the estimate θ is particularly simple: θ = ΨY . Correspondingly,

the target of estimation θ∗ satisfies θ∗ = Ψf . In other words, the target is the collection

(θ∗m) of the Fourier coefficients of the underlying function (vector) f w.r.t. the basis Ψ

while the estimate θ is the collection of empirical Fourier coefficients θm :

θ∗m =n∑i=1

fiψm,i , θm =n∑i=1

Yiψm,i

An important feature of the orthonormal design is that it preserves the noise homogeneity:

Var(θ)

= σ2Ip .

4.2.4 Spectral representation

Consider a linear model

Y = Ψ>θ + ε (4.5)

with homogeneous errors ε : Var(ε) = σ2IIn . The rows of the matrix Ψ can be viewed

as basis vectors in IRn and the product Ψ>θ is a linear combinations of these vectors

with the coefficients (θ1, . . . , θp) . Effectively linear least squares estimation does a kind of

projection of the data onto the subspace generated by the basis functions. This projection

is of course invariant w.r.t. a basis transformation within this linear subspace. This

fact can be used to reduce the model to the case of an orthogonal design considered in

the previous section. Namely, one can always find a linear orthogonal transformation

U : IRp → IRp ensuring the orthogonality of the transformed basis. This means that the

rows of the matrix Ψ = UΨ are orthogonal and the matrix Ψ Ψ> is diagonal:

Ψ Ψ> = UΨΨ>U> = Λ = diag(λ1, . . . , λp).

The original model reads after this transformation in the form

Y = Ψ>u+ ε, Ψ Ψ> = Λ,

where u = Uθ ∈ IRp . Within this model, the transformed parameter u can be estimated

using the empirical Fourier coefficients Zm = ψ>mY , where ψm is the m th row of Ψ ,

108

m = 1, . . . , p . The original parameter vector θ can be recovered via the equation

θ = U>u . This set of equations can be written in the form

Z = Λu+ Λ1/2ξ (4.6)

where Z = ΨY = UΨY is a vector in IRp and ξ = Λ−1/2Ψε = Λ−1/2UΨε ∈ IRp . The

equation (4.6) is called the spectral representation of the linear model (4.5). The reason

is that the basic transformation U can be built by a singular value decomposition of Ψ .

This representation is widely used in context of linear inverse problems; see Section 4.8.

Theorem 4.2.2. Consider the model (4.5) with homogeneous errors ε : IEεε> = σ2IIn .

Then there exists an orthogonal transform U : IRp → IRp leading to the spectral represen-

tation (4.6) with homogeneous uncorrelated errors ξ : IEξξ> = σ2IIp . If ε ∼ N(0, σ2IIn) ,

then the vector ξ is normal as well: ξ = N(0, σ2IIp) .

Exercise 4.2.3. Prove the result of Theorem 4.2.2.

Hint: select any U ensuring U>ΨΨ>U = Λ . Then

IEξξ> = Λ−1/2UΨIEεε>Ψ>U>Λ−1/2 = σ2Λ−1/2U>ΨΨ>UΛ−1/2 = σ2IIp.

A special case of the spectral representation corresponds to the orthonormal design

with ΨΨ> = IIp . In this situation, the spectral model reads as Z = u + ξ , that is, we

simply observe the target u corrupted with a homogeneous noise ξ . Such an equation

is often called the sequence space model and it is intensively used in the literature for the

theoretical study; cf. Section 4.7 below.

4.3 Properties of the response estimate f

This section discusses some properties of the estimate f = Ψ>θ = ΠY of the response

vector f . It is worth noting that the first and essential part of the analysis does not

rely on the underlying model distribution, only on our parametric assumptions that

f = Ψ>θ∗ and Cov(ε) = Σ = σ2IIn . The real model only appears when studying the

risk of estimation. We will comment on the cases of misspecified f and Σ .

When Σ = σ2IIn , the operator Π in the representation f = ΠY of the estimate f

reads as

Π = Ψ>(ΨΨ>

)−1Ψ. (4.7)

First we make use of the linear structure of the model (4.1) and of the estimate f to

derive a number of its simple but important properties.

109

4.3.1 Decomposition into a deterministic and a stochastic component

The model equation Y = f + ε yields

f = ΠY = Π(f + ε) = Πf +Πε. (4.8)

The first element of this sum, Πf is purely deterministic, but it depends on the unknown

response vector f . Moreover, it will be shown in the next lemma that Πf = f if the

parametric assumption holds and the vector f indeed can be represented as Ψ>θ∗ . The

second element is stochastic as a linear transformation of the stochastic vector ε but is

independent of the model response f . The properties of the estimate f heavily rely on

the properties of the linear operator Π from (4.7) which we collect in the next section.

4.3.2 Properties of the operator Π

Let ψ1, . . . ,ψp be the columns of the matrix Ψ> . These are the vectors in IRn also

called the basis vectors.

Lemma 4.3.1. Let the matrix ΨΨ> be non-degenerate. Then the operator Π fulfills

the following conditions:

(i) Π is symmetric (self-adjoint), that is, Π> = Π .

(ii) Π is a projector in IRn , i.e. Π>Π = Π2 = Π and Π(1n −Π) = 0 , where 1n

means the unity operator in IRn .

(iii) For an arbitrary vector v from IRn , it holds ‖v‖2 = ‖Πv‖2 + ‖v −Πv‖2 .

(iv) The trace of Π is equal to the dimension of its image, tr Π = p .

(v) Π projects the linear space IRn on the linear subspace Lp =⟨ψ1, . . . ,ψp

⟩, which

is spanned by the basis vectors ψ1, . . .ψp , that is,

‖f −Πf‖ = infg∈Lp

‖f − g‖.

(vi) The matrix Π can be represented in the form

Π = U>ΛpU

where U is an orthonormal matrix and Λp is a diagonal matrix with the first p

diagonal elements equal to 1 and the others equal to zero:

Λp = diag{1, . . . , 1︸︷︷︸p

, 0, . . . , 0︸︷︷︸n−p

}.

110

Proof. It holds

{Ψ>(ΨΨ>

)−1Ψ}>

= Ψ>(ΨΨ>

)−1Ψ

and

Π2 = Ψ>(ΨΨ>

)−1ΨΨ>

(ΨΨ>

)−1Ψ = Ψ>

(ΨΨ>

)−1Ψ = Π,

which proves the first two statements of the lemma. The third one follows directly from

the first two. Next,

tr Π = tr Ψ>(ΨΨ>

)−1Ψ = tr ΨΨ>

(ΨΨ>

)−1= tr IIp = p.

The second property means that Π is a projector in IRn and the fourth one means that

the dimension of its image space is equal to p . The basis vectors ψ1, . . . ,ψp are the

rows of the matrix Ψ . It is clear that

ΠΨ> = Ψ>(ΨΨ>

)−1ΨΨ> = Ψ>.

Therefore, the vectors ψm are invariants of the operator Π and in particular, all these

vectors belong to the image space of this operator. If now g is a vector in Lp , then

it can be represented as g = c1ψ1 + . . . + cpψp and therefore, Πg = g and ΠLp =

Lp . Finally, the non-singularity of the matrix ΨΨ> means that the vectors ψ1, . . . ,ψp

forming the rows of Ψ are linearly independent. Therefore, the space Lp spanned by

the vectors ψ1, . . . ,ψp is of dimension p , and hence it coincides with the image space

of the operation Π .

The last property is the usual diagonal decomposition of a projector.

Exercise 4.3.1. Consider the case of an orthogonal design with ΨΨ> = IIp . Specify the

projector Π of Lemma 4.3.1 for this situation, particularly its decomposition from (vi).

4.3.3 Quadratic loss and risk of the response estimation

In this section we study the quadratic risk of estimating the response f . The reason for

studying the quadratic risk of estimating the response f will be made clear when we

discuss the properties of the fitted likelihood in the next section.

The loss ℘(f ,f) of the estimate f can be naturally defined as the squared norm of

the difference f − f :

℘(f ,f) = ‖f − f‖2 =

n∑i=1

|fi − fi|2.

111

Correspondingly, the quadratic risk of the estimate f is the mean of this loss

R(f) = IE℘(f ,f) = IE[(f − f)>(f − f)

]. (4.9)

The next result describes the loss and risk decomposition for two cases: when the

parametric assumption f = Ψ>θ∗ is correct and in the general case.

Theorem 4.3.2. Suppose that the errors εi from (4.1) are independent with IE εi = 0

and IE ε2i = σ2 , i.e. Σ = σ2IIn . Then the loss ℘(f ,f) = ‖ΠY − f‖2 and the risk

R(f) of the LSE f fulfill

℘(f ,f) = ‖f −Πf‖2 + ‖Πε‖2,

R(f) = ‖f −Πf‖2 + pσ2.

Moreover, if f = Ψ>θ∗ , then

℘(f ,f) = ‖Πε‖2,

R(f) = pσ2.

Proof. We apply (4.9) and the decomposition (4.8) of the estimate f . It follows

℘(f ,f) = ‖f − f‖2 = ‖f −Πf −Πε‖2

= ‖f −Πf‖2 + 2(f −Πf)>Πε+ ‖Πε‖2.

This implies the decomposition for the loss of f by Lemma 4.3.1, (ii). Next we compute

the mean of ‖Πε‖2 applying again Lemma 4.3.1. Indeed

IE‖Πε‖2 = IE(Πε)>Πε = IE tr{Πε(Πε)>

}= IE tr

(Πεε>Π>

)= tr

{ΠIE(εε>)Π

}= σ2 tr(Π2) = pσ2.

Now consider the case when f = Ψ>θ∗ . By Lemma 4.3.1 f = Πf and and the last two

statements of the theorem clearly follow.

4.3.4 Misspecified “colored noise”

Here we briefly comment on the case when ε is not a white noise. So, our assumption

about the errors εi is that they are uncorrelated and homogeneous, that is, Σ = σ2IIn

while the true covariance matrix is given by Σ0 . Many properties of the estimate f =

ΠY which are simply based on the linearity of the model (4.1) and of the estimate

112

f itself continue to apply. In particular, the loss ℘(f ,f

)= ‖f − f‖2 can again be

decomposed as

‖f − f‖2 = ‖f −Πf‖2 + ‖Πε‖2.

Theorem 4.3.3. Suppose that IEε = 0 and Var(ε) = Σ0 . Then the loss ℘(f ,f) and

the risk R(f) of the LSE f fulfill

℘(f ,f) = ‖f −Πf‖2 + ‖Πε‖2,

R(f) = ‖f −Πf‖2 + tr(ΠΣ0Π

).

Moreover, if f = Ψ>θ∗ , then

℘(f ,f) = ‖Πε‖2,

R(f) = tr(ΠΣ0Π

).

Proof. The decomposition of the loss from Theorem 4.3.2 only relies on the geometric

properties of the projector Π and does not use the covariance structure of the noise.

Hence, it only remains to check the expectation of ‖Πε‖2 . Observe that

IE‖Πε‖2 = IE tr[Πε(Πε)>

]= tr

[ΠIE(εε>)Π

]= tr

(ΠΣ0Π

)as required.

4.4 Properties of the MLE θ

In this section we focus on the properties of the quasi MLE θ built for the idealized

linear Gaussian model Y = Ψ>θ∗ + ε with ε ∼ N(0, σ2IIn) . As in the previous section,

we do not assume the parametric structure of the underlying model and consider a more

general model Y = f + ε with an unknown vector f and errors ε with zero mean

and covariance matrix Σ0 . Due to (4.3), it holds θ = SY with S =(ΨΨ>

)−1Ψ .

An important feature of this estimate is its linear dependence on the data. The linear

model equation Y = f + ε and linear structure of the estimate θ = SY allow us for

decomposing the vector θ into a deterministic and stochastic terms:

θ = SY = S(f + ε

)= Sf + Sε. (4.10)

The first term Sf is deterministic but depends on the unknown vector f while the

second term Sε is stochastic but it does not involve the model response f . Below we

study the properties of each component separately.

113

4.4.1 Properties of the stochastic component

The next result describes the distributional properties of the stochastic component ζ =

Sε for S =(ΨΨ>

)−1Ψ and thus, of the estimate θ .

Theorem 4.4.1. Assume Y = f + ε with IEε = 0 and Var(ε) = Σ0 . The stochastic

component ζ = Sε in (4.10) fulfills

IEζ = 0, V 2 def= Var(ζ) = SΣ0S>, IE‖ζ‖2 = trV 2 = tr

(SΣ0S>

).

Moreover, if Σ = Σ0 = σ2IIn , then

V 2 = σ2(ΨΨ>

)−1, IE‖ζ‖2 = tr(V 2) = σ2 tr

[(ΨΨ>

)−1]. (4.11)

Similarly for the estimate θ it holds

IEθ = Sf , Var(θ) = V 2.

If the errors ε are Gaussian, then the both ζ and θ are Gaussian as well:

ζ ∼ N(0, V 2) θ ∼ N(Sf , V 2).

Proof. For the variance V 2 of ζ holds

Var(ζ) = IEζζ> = IESεε>S> = SΣ0S>.

Next we use that IE‖ζ‖2 = IEζ>ζ = IE tr(ζζ>) = trV 2 . If Σ = Σ0 = σ2IIn , then (4.11)

follows by simple algebra.

If ε is a Gaussian vector, then ζ as its linear transformation is Gaussian as well.

The properties of θ follow directly from the decomposition (4.10).

With Σ0 6= σ2IIn , the variance V 2 can be represented as

V 2 =(ΨΨ>

)−1ΨΣ0Ψ

>(ΨΨ>)−1.Exercise 4.4.1. Let ζ be the stochastic component of θ built for the misspecified linear

model Y = Ψ>θ∗ + ε with Var(ε) = Σ . Let also the true noise variance is Σ0 . Then

Var(θ) = V 2 with

V 2 =(ΨΣ−1Ψ>

)−1ΨΣ−1Σ0Σ

−1Ψ>(ΨΣ−1Ψ>

)−1.

The main finding in the presented study is that the stochastic part ζ = Sε of the

estimate θ is completely independent of the structure of the vector f . In other words,

the behavior of the stochastic component ζ does not change even if the linear parametric

assumption is misspecified.

114

4.4.2 Properties of the deterministic component

Now we study the deterministic term starting with the parametric situation f = Ψ>θ∗ .

Here we only specify the results for the case 1 with Σ = σ2IIn .

Theorem 4.4.2. Let f = Ψ>θ∗ . Then θ = SY with S =(ΨΨ>

)−1Ψ is unbiased, that

is, IEθ = Sf = θ∗ .

Proof. For the proof, just observe that Sf =(ΨΨ>

)−1ΨΨ>θ∗ = θ∗ .

Now we briefly discuss what happens when the linear parametric assumption is not

fulfilled, that is, f cannot be represented as Ψ>θ∗ . In this case it is not yet clear what

θ really estimates. The answer is given in the context of the general theory of minimum

contrast estimation. Namely, define θ† as the point which maximizes the expectation of

the (quasi) log-likelihood L(θ) :

θ† = argmaxθ

IEL(θ). (4.12)

Theorem 4.4.3. The solution θ† of the optimization problem (4.12) is given by

θ† = Sf =(ΨΨ>

)−1Ψf .

Moreover,

Ψ>θ† = Πf = Ψ>(ΨΨ>

)−1Ψf .

In particular, if f = Ψ>θ∗ , then θ† = θ∗ and Ψ>θ† = f .

Proof. The use of the model equation Y = f + ε and of the properties of the stochastic

component ζ yield by simple algebra

argmaxθ

IEL(θ) = argminθ

IE(f − Ψ>θ + ε

)>(f − Ψ>θ + ε

)= argmin

θ

{(f − Ψ>θ)>(f − Ψ>θ) + IE

(ε>ε

)}= argmin

θ

{(f − Ψ>θ)>(f − Ψ>θ)

}.

Differentiating w.r.t. θ leads to the equation

Ψ(f − Ψ>θ) = 0

and the solution θ† =(ΨΨ>

)−1Ψf which is exactly the expected value of θ by Theo-

rem 4.4.1.

115

Exercise 4.4.2. State the result of Theorems 4.4.2 and 4.4.3 for the MLE θ built in

the model Y = Ψ>θ∗ + ε with Var(ε) = Σ .

Hint: check that the statements continue to apply with S =(ΨΣ−1Ψ>

)−1ΨΣ−1 .

The last results and the decomposition (4.10) explain the behavior of the estimate θ

in a very general situation. The considered model is Y = f + ε . We assume a linear

parametric structure and independent homogeneous noise. The estimation procedure

means in fact a kind of projection of the data Y on a p -dimensional linear subspace in

IRn spanned by the given basis vectors ψ1, . . . ,ψp . This projection, as a linear operator,

can be decomposed into a projection of the deterministic vector f and a projection of

the random noise ε . If the linear parametric assumption f ∈⟨ψ1, . . . ,ψp

⟩is correct,

that is, f = θ∗1ψ1 + . . . + θ∗pψp , then this projection keeps f unchanged and only the

random noise is reduced via this projection. If f cannot be exactly expanded using the

basis ψ1, . . . ,ψp , then the procedure recovers the projection of f onto this subspace.

The latter projection can be written as Ψ>θ† and the vector θ† can be viewed as the

target of estimation.

4.4.3 Risk of estimation. R-efficiency

This section briefly discusses how the obtained properties of the estimate θ can be used

to evaluate the risk of estimation. A particularly important question is the optimality of

the MLE θ . The main result of the section claims that θ is R-efficient if the model is

correctly specified and is not if there is a misspecification.

We start with the case of a correct parametric specification Y = Ψ>θ∗ + ε , that

is, the linear parametric assumption f = Ψ>θ∗ is exactly fulfilled and the noise ε is

homogeneous: ε ∼ N(0, σ2IIn) . Later we extend the result to the case when the LPA

f = Ψ>θ∗ is not fulfilled and to the case when the noise is not homogeneous but still

correctly specified. Finally we discuss the case when the noise structure is misspecified.

Under LPA Y = Ψ>θ∗ + ε with ε ∼ N(0, σ2IIn) , the estimate θ is also normal

with mean θ∗ and the variance V 2 = σ2SS> = σ2(ΨΨ>

)−1. Define a p× p symmetric

matrix D by the equation

D2 =1

σ2

n∑i=1

ΨiΨ>i =

1

σ2ΨΨ>.

Clearly V 2 = D−2 .

Now we show that θ is R -efficient. Actually this fact can be derived from the

Cramer-Rao Theorem because the Gaussian model is a special case of an exponential

family. However, we check this statement directly by computing the Cramer-Rao ef-

116

ficiency bound. Recall that the Fisher information matrix I(θ) for the log-likelihood

L(θ) is defined as the variance of ∇L(θ) under IPθ .

Theorem 4.4.4 (Gauss-Markov). Let Y = Ψ>θ∗ + ε with ε ∼ N(0, σ2IIn) . Then θ is

R-efficient estimate of θ∗ : IEθ = θ∗ ,

IE[(θ − θ∗

)(θ − θ∗

)>]= Var

(θ)

= D−2.

and for any unbiased linear estimate θ satisfying IEθ = θ∗ , it holds

Var(θ)≥ Var

(θ)

= D−2.

Proof. Theorems 4.4.1 and 4.4.2 imply that θ ∼ N(θ∗, V 2) with V 2 = σ2(ΨΨ>)−1 =

D−2 . Next we show that for any θ

Var[∇L(θ)

]= D2,

that is, the Fisher information does not depend on the model function f . The log-

likelihood L(θ) for the model Y ∼ N(Ψ>θ∗, σ2IIn) reads as

L(θ) = − 1

2σ2(Y − Ψ>θ)>(Y − Ψ>θ)− n

2log(2πσ2).

This yields for its gradient ∇L(θ) :

∇L(θ) = σ−2Ψ(Y − Ψ>θ)

and in view of Var(Y ) = Σ = σ2IIn , it holds

Var[∇L(θ)

]= σ−4Ψ Var(Y )Ψ> = σ−2ΨΨ>

as required.

The R-efficiency θ follows from the Cramer-Rao efficiency bound because{

Var(θ)}−1

=

Var{∇L(θ)

}. However, we present an independent proof of this fact. Actually we prove

a sharper result that the variance of a linear unbiased estimate θ coincides with the

variance of θ only if θ coincides almost surely with θ , otherwise its is larger. The idea

of the proof is quite simple. Consider the difference θ − θ and show that the condition

IEθ = IEθ = θ∗ implies orthogonality IE{θ(θ − θ)>

}= 0 . This, in turns, implies

Var(θ) = Var(θ) + Var(θ − θ) ≥ Var(θ) . So, it remains to check the orthogonality of

θ and θ − θ . Let θ = AY for a p × n matrix A and IEθ ≡ θ∗ and all θ∗ . These

two equalities and IEY = Ψ>θ∗ imply that AΨ>θ∗ ≡ θ∗ , i.e. AΨ> is the identity

117

p × p matrix. The same is true for θ = SY yielding SΨ> = IIp . Next, in view of

IEθ = IEθ = θ∗

IE{

(θ − θ)θ>}

= IE(A− S)εε>S> = σ2(A− S)Ψ>(ΨΨ>)−1 = 0,

and the assertion follows.

Exercise 4.4.3. Check the details of the proof of the theorem. Show that the statement

Var(θ) ≥ Var(θ)

only uses that θ is unbiased and that IEY = Ψ>θ∗ and Var(Y ) =

σ2IIn .

Exercise 4.4.4. Compute ∇2L(θ) . Check that it is non-random, does not depend on

θ , and fulfills for every θ the identity

∇2L(θ) ≡ −Var[∇L(θ)

]= −D2.

A colored noise

The majority of the presented results continue to apply in the case of heterogeneous

and even dependent noise with Var(ε) = Σ0 . The key facts behind this extension

are the decomposition (4.10) and the properties of the stochastic component ζ from

Section 4.4.1: ζ ∼ N(0, V 2) . In the case of a colored noise, the definition of V and D

is changed for

D2 def= V −2 = ΨΣ−10 Ψ>.

Exercise 4.4.5. State and prove the analog of Theorem 4.4.4 for the colored noise

ε ∼ N(0, Σ0) .

A misspecified LPA

An interesting feature of our results so far is that they equally apply for the correct

linear specification f = Ψ>θ∗ and for the case when the identity f = Ψ>θ is not

precisely fulfilled whatever θ is taken. In this situation the target of analysis is the

vector θ† describing the best linear approximation of f by Ψ>θ . We already know

from the results of Section 4.4.1 and 4.4.2 that the estimate θ is also normal with mean

θ† = Sf =(ΨΨ>

)−1Ψf and the variance V 2 = σ2SS> = σ2

(ΨΨ>

)−1.

Theorem 4.4.5. Assume Y = f + ε with ε ∼ N(0, σ2IIn) . Let θ† = Sf . Then θ is

R-efficient estimate of θ† : IEθ = θ† ,

IE[(θ − θ†

)(θ − θ†

)>]= Var

(θ)

= D−2,

118

and for any unbiased linear estimate θ satisfying IEθ = θ† , it holds

Var(θ)≥ Var

(θ)

= D−2.

Proof. The proofs only utilize that θ ∼ N(θ†, V 2) with V 2 = D−2 . The only small

remark concerns the equality Var[∇L(θ)

]= D2 from Theorem 4.4.4.

Exercise 4.4.6. Check the identity Var[∇L(θ)

]= D2 from Theorem 4.4.4 for ε ∼

N(0, Σ0) .

4.4.4 The case of a misspecified noise

Here we again consider the linear parametric assumption Y = Ψ>θ∗ + ε . However,

contrary to the previous section, we admit that the noise ε is not homogeneous normal:

ε ∼ N(0, Σ0) while our estimation procedure is the quasi MLE based on the assumption

of noise homogeneity ε ∼ N(0, σ2IIn) . We already know that the estimate θ is unbiased

with mean θ∗ and variance V 2 = SΣ0S> , where S =(ΨΨ>

)−1Ψ . This gives

V 2 =(ΨΨ>

)−1ΨΣ0Ψ

>(ΨΨ>)−1.The question is whether the estimate θ based on the misspecified distributional

assumption is efficient. The Cramer-Rao result delivers the lower bound for the quadratic

risk in form of Var(θ) ≥[Var(∇L(θ)

)]−1. We already know that the use of the correctly

specified covariance matrix of the errors leads to an R-efficient estimate θ . The next

result show that the use of a misspecified matrix Σ results in an estimate which is

unbiased but not R-efficient, that is, the best estimation risk is achieved if we apply the

correct model assumptions.

Theorem 4.4.6. Let Y = Ψ>θ∗ + ε with ε ∼ N(0, Σ0) . Then

Var[∇L(θ)

]= ΨΣ−10 Ψ>.

The estimate θ =(ΨΨ>

)−1ΨY is unbiased, that is, IEθ = θ∗ , but it is not R-efficient

unless Σ0 = Σ .

Proof. Let θ0 be the MLE for the correct model specification with the noise ε ∼N(0, Σ0) . As θ is unbiased, the difference θ − θ0 is orthogonal to θ0 and it holds

for the variance of θ

Var(θ) = Var(θ0) + Var(θ − θ0);

cf. with the proof of Gauss-Markov-Theorem 4.4.4.

Exercise 4.4.7. Compare directly the variances of θ and of θ0 .

119

4.5 Linear models and quadratic log-likelihood

Linear Gaussian modeling leads to a specific log-likelihood structure; see Section 4.2.

Namely, the log-likelihood function L(θ) is quadratic in θ , the coefficients of the

quadratic terms are deterministic and the cross term is linear both in θ and in the

observations Yi . Here we show that this geometric structure of the log-likelihood char-

acterizes linear models. We say that L(θ) is quadratic if it is a quadratic function of θ

and there is a deterministic symmetric matrix D2 such that for any θ◦,θ

L(θ)− L(θ◦) = (θ − θ◦)>∇L(θ◦)− (θ − θ◦)>D2(θ − θ◦)/2. (4.13)

Here ∇L(θ)def= dL(θ)

dθ . As usual we define

θdef= argmax

θL(θ),

θ∗ = argmaxθ

IEL(θ).

The next result describes some properties of the estimate θ which are entirely based on

the geometric (quadratic) structure of the function L(θ) . All the results are stated by

using the matrix D2 and the vector ζ = ∇L(θ∗) .

Theorem 4.5.1. Let L(θ) be quadratic for a non-degenerated matrix D2 . Then

θ − θ∗ = D−2ζ. (4.14)

with ζdef= ∇L(θ∗) . Moreover, IEζ = 0 , and it holds with V 2 = Var(ζ) = IEζζ>

IEθ = θ∗

Var(θ)

= D−2V 2D−2.

Further, for any θ ,

L(θ)− L(θ) = (θ − θ)>D2(θ − θ)/2 = ‖D(θ − θ)‖2/2. (4.15)

Finally, it holds for the excess L(θ,θ∗)def= L(θ)− L(θ∗)

2L(θ,θ∗) = (θ − θ∗)>D2(θ − θ∗) = ζ>D−2ζ = ‖ξ‖2 (4.16)

with ξ = D−1ζ .

Proof. The equation (4.13) with θ◦ = θ∗ implies for any θ

∇L(θ) = ∇L(θ◦)−D2(θ − θ◦) = ζ −D−2(θ − θ∗). (4.17)

120

Therefore, it holds for the expectation IEL(θ)

∇IEL(θ) = IEζ −D−2(θ − θ∗),

and the equation ∇IEL(θ∗) = 0 implies IEζ = 0 .

To show (4.15), apply again the property (4.13) with θ◦ = θ :

L(θ)− L(θ) = (θ − θ)>∇L(θ)− (θ − θ)>D2(θ − θ)/2

= −(θ − θ)>D2(θ − θ)/2.

Here we used that ∇L(θ) = 0 because θ is an extreme point of L(θ) . The last result

(4.16) is a special case with θ = θ∗ in view of (4.14).

This theorem delivers an important message: the main properties of the MLE θ can

be explained via the geometric (quadratic) structure of the log-likelihood. An interesting

question to clarify is whether a quadratic log-likelihood structure specific for linear Gaus-

sian model. The answer is positive: there is one-to-one correspondence between linear

Gaussian models and quadratic log-likelihood functions. Indeed, the identity (4.17) with

θ◦ = θ∗ can be rewritten as

∇L(θ)−D2θ ≡ ζ +D2θ∗.

If we fix any θ and define Y = ∇L(θ)−D2θ , this yields

Y = D2θ∗ + ζ.

Similarly, Ydef= D−1

{∇L(θ)−D2θ

}yields the equation

Y = Dθ∗ + ξ, (4.18)

where ξ = D−1ζ . We can summarize as follows.

Theorem 4.5.2. Let L(θ) be quadratic with a non-degenerated matrix D2 . Then Ydef=

D−1{∇L(θ)−D2θ

}does not depend on θ and L(θ)−L(θ∗) is the quasi log-likelihood

ratio for the linear Gaussian model (4.18) with ξ standard normal. It is the true log-

likelihood if and only if ζ ∼ N(0, D2) .

Proof. The model (4.18) with ξ ∼ N(0, IIp) leads to the log-likelihood ratio

(θ − θ∗)>DY − ‖D(θ − θ∗)‖2/2 = (θ − θ∗)>ζ − ‖D(θ − θ∗)‖2/2

which coincides with L(θ) − L(θ∗) from (4.13). Also ζ ∼ N(0, D2) if and only if

ξ = D−1ζ is standard normal.

121

4.6 Inference based on the maximum likelihood

All the results presented above for linear models were based on the explicit representation

of the (quasi) MLE θ . Here we present the approach based on the analysis of the

maximum likelihood. This approach does not require to fix any analytic expression for

the point of maximum of the (quasi) likelihood process L(θ) . Instead we work directly

with the maximum of this process. We establish exponential inequalities for the “fitted

likelihood” L(θ,θ∗) . We also show how these results can be used to study the accuracy

of the MLE θ , in particular, for building confidence sets.

One more benefit of the ML-based approach is that it equally applies to a homoge-

neous and to a heterogeneous noise provided that the noise structure is not misspecified.

The celebrated chi-squared result about the maximum likelihood L(θ,θ∗) claims that

the distribution of 2L(θ,θ∗) is chi-squared with p degrees of freedom χ2p and it does

not depend on the noise covariance; see Section 4.6.

Now we specify the setup. The starting point of the ML-approach is the linear

Gaussian model assumption Y = Ψ>θ∗ + ε with ε ∼ N(0, Σ) . The corresponding

log-likelihood ratio L(θ) can be written as

L(θ) = −1

2(Y − Ψ>θ)>Σ−1(Y − Ψ>θ) +R, (4.19)

where the remainder term R does not depend on θ . Now one can see that L(θ) is a

quadratic function of θ . Moreover, ∇2L(θ) = ΨΣ−1Ψ> , so that L(θ) is quadratic with

D2 = ΨΣ−1Ψ> . This enables us to apply the general results of Section 4.5 which are

only based on the geometric (quadratic) structure of the log-likelihood L(θ) : the true

data distribution can be arbitrary.

Theorem 4.6.1. Consider L(θ) from (4.19). For any θ , it holds with D2 = ΨΣ−1Ψ>

L(θ,θ) = (θ − θ)>D2(θ − θ)/2. (4.20)

In particular, if Σ = σ2IIn then the fitted log-likelihood is proportional to the quadratic

loss ‖f − fθ‖2 for f = Ψ>θ and fθ = Ψ>θ :

L(θ,θ) =1

2σ2∥∥Ψ>(θ − θ)

∥∥2 =1

2σ2∥∥f − fθ∥∥2.

If θ∗def= argmaxθ IEL(θ) = D−2ΨΣ−1f for f = IEY , then

2L(θ,θ∗) = ζ>D−2ζ = ‖ξ‖2 (4.21)

with ζ = ∇L(θ∗) and ξdef= D−1ζ . Moreover, if the model Y = Ψ>θ∗ + ε with

ε ∼ N(0, Σ) is correct, then ξ ∼ N(0, IIp) and 2L(θ,θ∗) ∼ χ2p is chi-squared with p

degrees of freedom.

122

Proof. The results (4.20) and (4.21) follow from Theorem 4.5.1; see (4.15) and (4.16).

Further,

ζ = ∇L(θ∗) = ΨΣ−1(Y − Ψ>θ∗) = ΨΣ−1ε.

So, if Y is Gaussian then ζ is Gaussian as well as linear transformation of a Gaussian

vector. By Theorem 4.5.1, IEζ = 0 . Moreover, Var(ε) = Σ implies

Var(ζ) = IEΨ>Σ−1εε>Σ−1Ψ> = ΨΣ−1Ψ> = D2

yielding that ξ = D−1ζ is standard normal.

The last result 2L(θ,θ∗) ∼ χ2p is sometimes called the “chi-squared phenomenon”:

the distribution of the maximum likelihood only depends on the number of parameters

to be estimated and is independent of the design Ψ , of the noise covariance matrix Σ ,

etc. This particularly explains the use of word “phenomenon” in the name of the result.

Exercise 4.6.1. Check that the linear transformation Y = Σ−1/2Y of the data does

not change the value of the log-likelihood ratio L(θ,θ∗) and hence, of the maximum

likelihood L(θ,θ∗) .

Hint: use the representation

L(θ) =1

2(Y − Ψ>θ)>Σ−1(Y − Ψ>θ) +R

=1

2(Y − Ψ>θ)>(Y − Ψ>θ) +R

and check that the transformed data Y is described by the model Y = Ψ>θ∗ + ε with

Ψ = ΨΣ−1/2 and ε = Σ−1/2ε ∼ N(0, IIn) yielding the same log-likelihood ratio as in the

original model.

Exercise 4.6.2. Assume homogeneous noise in (4.19) with Σ = σ2IIn . Then it holds

2L(θ,θ∗) = σ−2‖Πε‖2

where Π = Ψ>(ΨΨ>

)−1Ψ is the projector in IRn on the subspace spanned by the vectors

ψ1, . . . ,ψp .

Hint: use that ζ = σ−2Ψε , D2 = σ2ΨΨ> , and

σ−2‖Πε‖2 = σ−2ε>Π>Πε = σ−2ε>Πε = ζ>D−2ζ.

We write the result of Theorem 4.6.1 in the form 2L(θ,θ∗) ∼ χ2p , where χ2

p stands

for the chi-squared distribution with p degrees of freedom. This result can be used to

123

build likelihood-based confidence ellipsoids for the parameter θ∗ . Given z > 0 , define

E(z) ={θ : L(θ,θ) ≤ z

}={θ : sup

θ′L(θ′)− L(θ) ≤ z

}. (4.22)

Theorem 4.6.2. Assume Y = Ψ>θ∗ + ε with ε ∼ N(0, Σ) and consider the MLE θ .

Define zα by P(χ2p > 2zα

)= α . Then E(zα) from (4.22) is an α -confidence set for θ∗ .

Exercise 4.6.3. Let D2 = ΨΣ−1Ψ> . Check that the likelihood-based CS E(zα) and

estimate-based CS E(zα) = {θ : ‖D(θ − θ)‖ ≤ zα} , z2α = 2zα , coincide in the case of

the linear modeling:

E(zα) ={θ :∥∥D(θ − θ)

∥∥2 ≤ 2zα}.

Another corollary of the chi-squared result is a concentration bound for the maximum

likelihood. A similar result was stated for the univariate exponential family model: the

value L(θ, θ∗) is stochastically bounded with exponential moments, and the bound does

not depend on the particular family, parameter value, sample size, etc. Now we can

extend this result to the case of a linear Gaussian model. Indeed, Theorem 4.6.1 states

that the distribution of 2L(θ,θ∗) is chi-squared and only depends on the number of

parameters to be estimated. The latter distribution concentrates on the ball of radius of

order p1/2 and the deviation probability is exponentially small.

Theorem 4.6.3. Assume Y = Ψ>θ∗ + ε with ε ∼ N(0, Σ) . Then for every x > 0 , it

holds with κ ≥ 6.6

IP(2L(θ,θ∗) > p+

√κxp ∨ (κx)

)= IP

(∥∥D(θ − θ)∥∥2 > p+

√κxp ∨ (κx)

)≤ exp(−x). (4.23)

Proof. Define ξdef= D(θ − θ∗) . By Theorem 4.4.4 ξ is standard normal vector in IRp

and by Theorem 4.6.1 2L(θ,θ∗) = ‖ξ‖2 . Now the statement (4.23) follows from the

general deviation bound for the Gaussian quadratic forms; see Theorem 9.1.1.

The main message of this result can be explained as follows: the deviation probability

that the estimate θ does not belong to the elliptic set E(z) = {θ : ‖D(θ − θ)‖ ≤ z}starts to vanish when z2 exceeds the dimensionality p of the parameter space. Similarly,

the coverage probability that the true parameter θ∗ is not covered by the confidence set

E(z) starts to vanish when 2z exceeds p .

Corollary 4.6.4. Assume Y = Ψ>θ∗ + ε with ε ∼ N(0, Σ) . Then for every x > 0 , it

holds with 2z = p+√κxp ∨ (κx) for κ ≥ 6.6

IP(E(z) 63 θ∗

)≤ exp(−x).

124

Exercise 4.6.4. Compute z ensuring the covering of 95% in the dimension p = 1, 2, 10, 20 .

4.6.1 A misspecified LPA

Now we discuss the behavior of the fitted log-likelihood for the misspecified linear para-

metric assumption IEY = Ψ>θ∗ . Let the response function f not be linearly ex-

pandable as f = Ψ>θ∗ . Following to Theorem 4.4.3, define θ† = Sf with S =(ΨΣ−1Ψ>

)−1ΨΣ−1 . This point provides the best approximation of the nonlinear re-

sponse f by a linear parametric fit Ψ>θ .

Theorem 4.6.5. Assume Y = f + ε with ε ∼ N(0, Σ) . Let θ† = Sf . Then

2L(θ,θ†) = ζ>D−2ζ = ‖ξ‖2 ∼ χ2p ,

where D2 = ΨΣ−1Ψ> , ζ = ∇L(θ†) = ΨΣ−1ε , ξ = D−1ζ is standard normal vector in

IRp and χ2p is a chi-squared random variable with p degrees of freedom. In particular,

E(zα) is an α -CS for the vector θ† and the bound of Corollary 4.6.4 applies.


4.6.2 A misspecified noise structure

This section addresses the question about the features of the maximum likelihood in the

case when the likelihood is built under a wrong assumption about the noise structure. To

be more specific, we consider the likelihood for the homogeneous noise Σ = σ2IIn while

the true noise covariance is only assumed to be non-degenerated. As one can expect,

the chi-squared result is not valid anymore in this situation and the distribution of the

maximum likelihood depends on the true noise covariance. However, the nice geometric

structure of the maximum likelihood manifested by Theorems 4.6.1 and 4.6.2 does not

rely on the true data distribution and it is only based on our structural assumptions

on the considered model. This helps to get rigorous results about the behaviors of the

maximum likelihood and particularly about its concentration properties.

Recall the notation D2 = σ−2ΨΨ> . It is a symmetric p × p matrix describing the

covariance structure of the estimate θ under the homogeneous noise.

Theorem 4.6.6. Let θ be built for the model Y = Ψ>θ∗ + ε with ε ∼ N(0, σ2IIn) ,

while the true noise covariance is Σ0 : IEε = 0 and Var(ε) = Σ0 . Then

2L(θ,θ∗) = ‖D(θ − θ∗)‖2 = ‖ξ‖2, (4.24)

where ξ is a random vector in IRp with IEξ = 0 and

Var(ξ) = Bdef= σ−2D−1ΨΣ0Ψ

>D−1.

125

Moreover, if ε ∼ N(0, Σ0) , then ξ ∼ N(0, B) .

Proof. The equality 2L(θ,θ∗) = ‖D(θ − θ∗)‖2 = ‖ξ‖2 has been already proved in

Theorem 4.6.1. Moreover, by Theorem 4.4.1 θ− θ∗ = Sε with S =(ΨΨ>

)−1Ψ , so that

and

Var(θ) = S Var(ε)S> = SΣ0S> = σ−4D−2ΨΣ0Ψ>D−2

This implies that

Var(ξ) = IEξξ> = DVar(θ)D = DSΣ0S>D = σ−2D−1ΨΣ0Ψ>D−1.

It remains to note that if ε is a Gaussian vector, then ξ = DSε is Gaussian as well.

One can see that the chi-squared result is not valid any more if the noise structure is

misspecified. An interesting question is whether the CS E(z) can be applied in the case

of a misspecified noise under some proper adjustment of the value z . Surprisingly, the

answer is not entirely negative. The reason is that the vector ξ from (4.24) is zero mean

and its norm has a similar behavior as in the case of the correct noise specification: the

probability IP(‖ξ‖ > z

)starts to degenerate when z2 exceeds IE‖ξ‖2 . A general bound

from Theorem 9.1.2 in Section 9 implies the following bound for the coverage probability.

Corollary 4.6.7. Under the conditions of Theorem 4.6.6, for every x > 0 , it holds with

p = tr(B) , v2 = 2 tr(B2) , and a∗ = ‖B‖∞

IP(2L(θ,θ∗) > p + (2vx1/2) ∨ (6a∗x)

)≤ exp(−x).

Exercise 4.6.6. Show that an overestimation of the noise in the sense Σ ≥ Σ0 preserves

the coverage probability for the CS E(zα) , that is, if 2zα is the 1 − α quantile of χ2p ,

then IP(E(zα) 63 θ∗

)≤ α .

4.7 Ridge regression, projection, and shrinkage

This section discusses the important situation when the number of predictors ψj and

hence the number of parameters p in the linear model Y = Ψ>θ∗ + ε is not small

relative to the sample size. Then the application of the least square or the maximum

likelihood approach meets serious problems. The first one relates to the numerical issues.

The definition of the LSE θ involves the inversion of the p×p matrix ΨΨ> and such an

inversion becomes a delicate task for p large. The other problem concerns the inference

for the estimated parameter θ∗ . The risk bound and the width of the confidence set

are proportional to the parameter dimension p and thus, with large p , the inference

126

statements become almost uninformative. In particular, if p is of order the sample size

n , even consistency is not achievable. One faces a really critical situation. We already

know that the MLE is the efficient estimate in the class of all unbiased estimates. At

the same time it is highly inefficient in overparametrized models. The only way out

of this situation is to sacrifice the unbiasedness property in favor of reducing the model

complexity: some procedures can be more efficient than MLE even if they are biased. This

section discusses one way of resolving these problems by regularization or shrinkage. To

be more specific, for the rest of the section we consider the following setup. The observed

vector Y follows the model

Y = f + ε (4.25)

with a homogeneous error vector ε : IEε = 0 , Var(ε) = σ2IIn . Noise misspecification is

not considered in this section.

Furthermore, we assume a basis or a collection of basis vectors ψ1, . . . ,ψp is given

with p large. This allows for approximating the response vector f = IEY in the form

f = Ψ>θ∗ , or, equivalently,

f = θ∗1ψ1 + . . .+ θ∗pψp .

In many cases we will assume that the basis is already orthogonalized: ΨΨ> = IIp . The

model (4.25) can be rewritten as

Y = Ψ>θ∗ + ε, Var(ε) = σ2IIn .

The MLE or oLSE of the parameter vector θ∗ for this model reads as

θ =(ΨΨ>

)−1ΨY , f = Ψ>θ = Ψ>

(ΨΨ>

)−1ΨY .

If the matrix ΨΨ> is degenerate or badly posed, computing the MLE θ meets serious

problems. Below we discuss how these problems can be treated.

4.7.1 Regularization and ridge regression

Let R be a positive symmetric p × p matrix. Then the sum ΨΨ> + R is positive

symmetric as well and can be inverted whatever the matrix Ψ is. This suggests to

replace(ΨΨ>

)−1by(ΨΨ>+R

)−1leading to the regularized least squares estimate θR

of the parameter vector θ and the corresponding response estimate fR :

θRdef=(ΨΨ> +R

)−1ΨY , fR

def= Ψ>

(ΨΨ> +R

)−1ΨY . (4.26)

127

Such a method is also called ridge regression. An example of choosing R is the multiple

of the unit matrix: R = αIIp where α > 0 and IIp stands for the unit matrix. This

method is also called Tikhonov regularization and it results in the parameter estimate

θα and the response estimate fα :

θαdef=(ΨΨ> + αIIp

)−1ΨY , fα

def= Ψ>

(ΨΨ> + αIIp

)−1ΨY . (4.27)

A proper choice of the matrix R for the ridge regression method (4.26) or the parameter

α for the Tikhonov regularization (4.27) is an important issue. Below we discuss several

approaches which lead to the estimate (4.26) with a specific choice of the matrix R . The

properties of the estimates θR and fR will be studied in context of penalized likelihood

estimation in the next section.

4.7.2 Penalized likelihood. Bias and variance

The estimate (4.26) can be obtained in a natural way within the (quasi) ML approach us-

ing the penalized least squares. The classical unpenalized method is based on minimizing

the sum of residuals squared:

θ = argmaxθ

L(θ) = arginfθ‖Y − Ψ>θ‖2

with L(θ) = σ−2‖Y − Ψ>θ‖2/2 . (Here we omit the terms which do not depend on θ .)

Now we introduce an additional penalty on the objective function which penalizes for

the complexity of the candidate vector θ which is expressed by the value ‖Gθ‖2/2 for a

given symmetric matrix G . This choice of complexity measure implicitly assumes that

the vector θ ≡ 0 has the smallest complexity equal to zero and this complexity increases

with the norm of Gθ . Define the penalized log-likelihood

LG(θ)def= L(θ)− ‖Gθ‖2/2

= −(2σ2)−1‖Y − Ψ>θ‖2 − ‖Gθ‖2/2− (n/2) log(2πσ2). (4.28)

The penalized ML problem reads as

θG = argmaxθ

LG(θ) = argminθ

{(2σ2)−1‖Y − Ψ>θ‖2 + ‖Gθ‖2/2

}.

A straightforward calculus leads to the expression (4.26) for θG with R = σ2G2 :

θGdef=(ΨΨ> + σ2G2

)−1ΨY . (4.29)

128

We see that θG is again a linear estimate: θG = SGY with SG =(ΨΨ> + σ2G2

)−1Ψ .

The results of Section 4.4 explains that θG in fact estimates the value θG defined by

θG = argmaxθ

IELG(θ)

= arginfθ

IE{‖Y − Ψ>θ‖2 + σ2‖Gθ‖2

}=(ΨΨ> + σ2G2

)−1Ψf = SGf . (4.30)

In particular, if f = Ψ>θ∗ , then

θG =(ΨΨ> + σ2G2

)−1ΨΨ>θ∗ (4.31)

and θG 6= θ∗ unless G = 0 . In other words, the penalized MLE θG is biased.

Exercise 4.7.1. Check that IEθα = θα for θα =(ΨΨ> + αIIp

)−1ΨΨ>θ∗ , the bias

‖θα − θ∗‖ grows with the regularization parameter α .

The penalized MLE θG leads to the response estimate fG = Ψ θG .

Exercise 4.7.2. Check that the penalized ML approach leads to the response estimate

fG = Ψ>θG = Ψ>(ΨΨ> + σ2G2

)−1ΨY = ΠGY

with ΠG = Ψ>(ΨΨ> + σ2G2

)−1Ψ . Show that ΠG is a sub-projector in the sense that

‖ΠGu‖ ≤ ‖u‖ for any u ∈ IRn .

Exercise 4.7.3. Let Ψ be orthonormal: ΨΨ> = IIp . Then the penalized MLE θG can

be represented as

θG = (IIp + σ2G2)−1Z,

where Z = ΨY is the vector of empirical Fourier coefficients. Specify the result for the

case of a diagonal matrix G = diag(g1, . . . , gp) and describe the corresponding response

estimate fG .

The previous results indicate that introducing the penalization leads to some bias

of estimation. One can ask about a benefit of using a penalized procedure. The next

result shows that penalization decreases the variance of estimation and thus, makes the

procedure more stable.

Theorem 4.7.1. Let θG be a penalized MLE from (4.29). Under noise homogeneity

Var(ε) = σ2IIn , it holds IEθG = θG , see (4.31), and

Var(θG) = σ2SGS>G = σ2(ΨΨ> + σ2G2

)−1ΨΨ>

(ΨΨ> + σ2G2

)−1.

129

In particular, Var(θG) ≤ Var(θ) , Var(θG) ≤(σ−2ΨΨ> + G2

)−1. Moreover, the bias

‖θG − θ∗‖ monotonously increases in G2 while the variance monotonously decreases

with the penalization G .

If ε ∼ N(0, σ2IIn) , then θG is also normal with mean θG and the variance σ2SGS>G .

Proof. The first two moments of θG are computed from θG = SGY . Monotonicity of

the bias and variance of θG is proved below in Exercise 4.7.6.

Exercise 4.7.4. Let Ψ be orthonormal: ΨΨ> = IIp . Describe Var(θG) . Show that the

variance decreases with the penalization G in the sense that G1 ≥ G implies Var(θG1) ≤Var(θG) .

Exercise 4.7.5. Let ΨΨ> = IIp and let G = diag(g1, . . . , gp) be a diagonal matrix.

Compute the squared bias ‖θG − θ∗‖2 and show that it monotonously increases in each

gj for j = 1, . . . , p .

Exercise 4.7.6. Let G be a symmetric matrix and θG the corresponding penalized

MLE. Show that the variance Var(θG) decreases while the bias ‖θG − θ∗‖ increases in

G2 .

Hint: first reduce the situation to the case of the orthogonal design matrix Ψ with

ΨΨ> = Λ = diag(λ1, . . . , λp) by an orthonormal basis transformation. For ΨΨ> = Λ ,

show that for any vector w ∈ IRp and u = Λ1/2w , it holds

w>Var(θG)w = σ2u>(IIp + σ2Λ−1/2G2Λ−1/2)−2u

and this value decreases with G2 because IIp + σ2Λ−1/2G2Λ−1/2 increases. Show in a

similar way that

‖θG − θ∗‖2 = σ4‖(Λ+ σ2G2)−1G2θ∗‖2 = σ2u>B(IIp +B)−1Bu

with u = Λ1/2θ∗ and B = Λ−1/2G2Λ−1/2 . Show that the matrix B(IIp + B)−1B is

monotonously increasing in B and thus in G2 using the diagonalization arguments and

monotonicity of the function x2/(1 + x) in x ≥ 0 .

Putting together the results about the bias and the variance of θG yields the state-

ment about the quadratic risk.

Theorem 4.7.2. Assume the model Y = Ψ>θ∗ + ε with Var(ε) = σ2IIn . Then the

estimate θG fulfills

IE‖θG − θ∗‖2 = ‖θG − θ∗‖2 + σ2 tr(SGS>G

).

130

This result is called the bias-variance decomposition. The choice of a proper regular-

ization is usually based on this decomposition: one selects a regularization from a given

class to provide the minimal possible risk. This approach is referred to as bias-variance

trade-off.

4.7.3 Inference for the penalized MLE

Here we discuss some properties of the penalized MLE θG . In particular, we focus on the

construction of confidence and concentration sets based on the penalized log-likelihood.

We know that the regularized estimate θG is the empirical counterpart of the value θG

which solves the regularized deterministic problem (4.30). We also know that the key

results are expressed via the value of the supremum supθ LG(θ) − LG(θG) . The next

result extends Theorem 4.6.1 to the penalized likelihood.

Theorem 4.7.3. Let LG(θ) be the penalized log-likelihood from (4.28). Then

2LG(θG,θG) =(θG − θG

)>(σ−2ΨΨ> +G2

)(θG − θG

)(4.32)

= σ−2ε>ΠG ε (4.33)

with ΠG = Ψ>(ΨΨ> + σ2G2

)−1Ψ .

In general the matrix ΠG is not a projector and hence, σ−2ε>ΠG ε is not χ2 -

distributed, the chi-squared result does not apply.


Hint: apply the Taylor expansion to LG(θ) at θG . Use that ∇LG(θG) = 0 and

∇2LG(θ) ≡ σ−2ΨΨ> +G2 .


Hint: show that θG − θG = SGε with SG =(ΨΨ> + σ2G2

)−1Ψ .

The straightforward corollaries of Theorem 4.7.3 are the concentration and confidence

probabilities. Define the confidence set EG(z) for θG as

EG(z)def={θ : LG(θG,θ) ≤ z

}.

The definition implies the following result for the coverage probability: IP(EG(z) 63 θG

)≤

IP(LG(θG,θG) > z

). Now the representation (4.33) for LG(θG,θG) reduces the problem

to a deviation bound for a quadratic form. We apply the general result of Section 9.

Theorem 4.7.4. Let LG(θ) be the penalized log-likelihood from (4.28) and let ε ∼N(0, σ2IIn) . Then it holds with pG = tr(ΠG) and v2G = 2 tr(Π2

G) that

IP(2LG(θG,θG) > pG + (2vGx

1/2) ∨ (6x))≤ exp(−x).

131

Similarly one can state the concentration result. Define D2G = σ−2ΨΨ> +G2 . Then

2LG(θG,θG) =∥∥DG

(θG − θG

)∥∥2and the result of Theorem 4.7.4 can be restated as the concentration bound:

IP(‖DG(θG − θG)‖2 > pG + (2vGx

1/2) ∨ (6x))≤ exp(−x).

In other words, θG concentrates on the set A(z,θ∗) ={θ : ‖θ−θG‖2 ≤ 2z

}for 2z > pG .

4.7.4 Projection and shrinkage estimates

Consider a linear model Y = Ψ>θ∗ + ε in which the matrix Ψ is orthonormal in the

sense ΨΨ> = IIp . Then the multiplication with Ψ maps this model in the sequence

space model Z = θ∗ + ξ , where Z = ΨY = (z1, . . . , zp)> is the vector of empirical

Fourier coefficients zj = ψ>j Y . The noise ξ = Ψε borrows the feature of the original

noise ε : if ε is zero mean and homogeneous, the same applies to ξ . The number of

coefficients p can be large or even infinite. To get a sensible estimate, one has to apply

some regularization method. The simplest one is called projection: one just considers

the first m empirical coefficients z1, . . . , zm and drop the others. The corresponding

parameter estimate θm reads as

θm,j =

zj if j ≤ m,

0 otherwise.

The response vector f = IEY is estimated by Ψ>θm leading to the representation

fm = z1ψ1 + . . .+ zmψm

with zj = ΨY . In other words, fm is just a projection of the observed vector Y onto the

subspace Lm spanned by the first m basis vectors ψ1, . . . ,ψm : Lm =⟨ψ1, . . . ,ψm

⟩.

This explains the name of the method. Clearly one can study the properties of θm

or fm using the methods of previous sections. However, one more question for this

approach is still open: a proper choice of m . The standard way of accessing this issue is

based on the analysis of the quadratic risk.

Consider first the prediction risk defined as R(fm) = IE‖fm − f‖2 . Below we focus

on the case of a homogeneous noise with Var(ε) = σ2IIp . An extension to the colored

noise is possible. Recall that fm effectively estimates the vector fm = Πmf , where

Πm is the projector on Lm ; see Section 4.3.3. Moreover, the quadratic risk R(fm) can

132

be decomposed as

R(fm) = ‖f −Πmf‖2 + σ2m = σ2m+

p∑j=m+1

θ∗j2.

Obviously the squared bias ‖f − Πmf‖2 decreases with m while the variance σ2m

linearly grows with m . Risk minimization leads to the so called bias-variance trade-off :

one selects m which minimizes the risk R(fm) over all possible m :

m∗def= argmin

mR(fm) = argmin

m

{‖f −Πmf‖2 + σ2m

}.

Unfortunately this choice requires some information about the bias ‖f −Πmf‖ which

depends on the unknown vector f . As this information is not available in typical situa-

tion, the value m∗ is also called an oracle choice. A data-driven choice of m is one of

the central issue in the nonparametric statistics.

The situation is not changed if we consider the estimation risk IE‖θm−θ∗‖2 . Indeed,

the basis orthogonality ΨΨ> = IIp implies for f = Ψ>θ∗

‖fm − f‖2 = ‖Ψ>θm − Ψ>θ∗‖2 = ‖θm − θ∗‖2

and minimization of the estimation risk coincides with minimization of the prediction

risk.

A disadvantage of the projection method is that it either keeps each empirical co-

efficient zm or completely discards it. An extension of the projection method is called

shrinkage: one multiplies every empirical coefficient zj with a factor αj ∈ (0, 1) . This

leads to the shrinkage estimate θα with

θα,j = αjzj .

Here α stands for the vector of coefficients αj for j = 1, . . . , p . A projection method

is a special case of this shrinkage with αj equal to one or zero. Another popular choice

of the coefficients αj is given by

αj = (1− j/m)β1(j ≤ m) (4.34)

for some β > 0 and m ≤ p . This choice ensures that the coefficients αj smoothly

approach zero as j approach the value m , and αj vanish for j > m . In this case,

the vector α is completely specified by two parameters m and β . The projection

method corresponds to β = 0 . The design orthogonality ΨΨ> = IIp yields again that

the estimation risk IE‖θα − θ∗‖2 coincides with the prediction risk IE‖fα − f‖2 .

133

Exercise 4.7.9. Let Var(ε) = σ2IIp . The risk R(fα) of the shrinkage estimate fα

fulfills

R(fα)def= IE‖fα − f‖2 =

p∑j=1

θ∗j2(1− αj)2 +

p∑j=1

α2jσ

2.

Specify the cases of α = α(m,β) from (4.34). Evaluate the variance term∑

j α2jσ

2 .

Hint: approximate the sum over j by the integral∫

(1− x/m)2β+ dx .

The oracle choice is again defined by risk minimization:

α∗def= argmin

αR(fα),

where minimization is taken over the class of all considered coefficient vectors α .

One way of obtaining a shrinkage estimate in the sequence space model Z = θ∗ + ξ

is by using a roughness penalization. Let G be a symmetric matrix. Consider the

regularized estimate θG from (4.26). The next result claims that if G is a diagonal

matrix, then θG is a shrinkage estimate. Moreover, a general penalized MLE can be

represented as shrinkage by an orthogonal basis transformation.

Theorem 4.7.5. Let G be a diagonal matrix, G = diag(g1, . . . , gp) . The penalized MLE

θG in the sequence space model Z = θ∗ + ξ with ξ ∼ N(0, σ2IIp) coincides with the

shrinkage estimate θα for αj = (1 + σ2g2j )−1 ≤ 1 . Moreover, a penalized MLE θG for

a general matrix G can be reduced to a shrinkage estimate by a basis transformation in

the sequence space model.

Proof. The first statement for a diagonal matrix G follows from the representation

θG = (IIp + σ2G2)−1Z . Next, let U be an orthogonal transform leading to the diagonal

representation G2 = U>D2U with D2 = diag(g1, . . . , gp) . Then

U θG = (IIp + σ2D2)−1UZ

that is, U θG is a shrinkage estimate in the transformed model UZ = Uθ∗ + Uξ .

In other words, roughness penalization results in some kind of shrinkage. Interestingly,

the inverse statement holds as well.

Exercise 4.7.10. Let θα is a shrinkage estimate for a vector α = (αj) . Then there is

a diagonal penalty matrix G such that θα = θG .

Hint: define the j th diagonal entry gj by the equation αj = (1 + σ2g2j )−1 .

134

4.7.5 Smoothness constraints and roughness penalty approach

Another way of reducing the complexity of the estimation procedure is based on smooth-

ness constraints. The notion of smoothness originates from regression estimation. A

non-linear regression function f is expanded using a Fourier or some other functional

basis and θ∗ is the corresponding vector of coefficients. Smoothness properties of the

regression function imply certain rate of decay of the corresponding Fourier coefficients:

the larger frequency is, the fewer amount of information about the regression function is

contained in the related coefficient. This leads to the natural idea to replace the original

optimization problem over the whole parameter space with the constrained optimization

over a subset of “smooth” parameter vectors. Here we consider one popular example of

Sobolev smoothness constraints which effectively means that the s th derivative of the

function f has a bounded L2 -norm. A general Sobolev ball can be defined using a

diagonal matrix G :

BG(R)def= ‖Gθ‖ ≤ R.

Now we consider a constrained ML problem:

θG,R = argmaxθ∈BG(R)

L(θ) = argminθ∈Θ: ‖Gθ‖≤R

‖Y − Ψ>θ‖2. (4.35)

The Lagrange multiplier method leads to an unconstrained problem

θG,λ = argminθ

{‖Y − Ψ>θ‖2 + λ‖Gθ‖2

}.

A proper choice of λ ensures that the solution θG,λ belongs to BG(R) and solves also

the problem (4.35). So, the approach based on a Sobolev smoothness assumption, leads

back to regularization and shrinkage.

4.8 Shrinkage in a linear inverse problem

This section extends the previous approaches to the situation with indirect observations.

More precisely, we focus on the model

Y = Af + ε (4.36)

where A is a given linear operator (matrix) and f is the target of analysis. With the

obvious change of notation this problem can be put back in the general linear setup

Y = Ψ>θ + ε . The special focus is due to the facts that the target can be high

dimensional or even functional and that the product A>A is usually badly posed and

135

its inversion is a hard task. Below we consider separately the cases when the spectral

representation for this problem is available and the general case.

4.8.1 Spectral cut-off and spectral penalization. Diagonal estimates

Suppose that the eigenvectors of the matrix A>A are available. This allows for reduc-

ing the model to the spectral representation by an orthogonal change of the coordinate

system: Z = Λu + Λ1/2ξ with a diagonal matrix Λ = diag{λ1, . . . , λp} and a ho-

mogeneous noise Var(ξ) = σ2IIp ; see Section 4.2.4. Below we assume without loss of

generality that the eigenvalues λj are ordered and decrease with j . This spectral rep-

resentation means that one observes empirical Fourier coefficients zm described by the

equation zj = λjuj +λ1/2j ξj for j = 1, . . . , p . The LSE or qMLE estimate of the spectral

parameter u is given by

u = Λ−1Z = (λ−11 z1, . . . , λ−1p zp)

>.

Exercise 4.8.1. Consider the spectral representation Z = Λu + Λ1/2ξ . The LSE u

reads as u = Λ−1Z .

If the dimension p of the model is high or, specifically, if the spectral values λj

rapidly go to zero, it might be useful to only track few coefficients u1, . . . , um and to

set all the remaining ones to zero. The corresponding estimate um = (um,1, . . . , um,p)>

reads as

um,jdef=

λ−1j zj if j ≤ m,

0 otherwise.

It is usually referred to as a spectral cut-off estimate.

Exercise 4.8.2. Consider the linear model Y = Af + ε . Let U be an orthogonal

transform in IRp providing UA>AU> = Λ with a diagonal matrix Λ leading to the

spectral representation for Z = UAY . Write the corresponding spectral cut-off estimate

fm for the original vector f . Show that computing this estimate only requires to know

the first m eigenvalues and eigenvectors of the matrix A>A .

Similarly to the direct case, a spectral cut-off can be extended to spectral shrinkage:

one multiplies every empirical coefficient zj with a factor αj ∈ (0, 1) . This leads to

the spectral shrinkage estimate uα with uα,j = αjλ−1j zj . Here α stands for the vector

of coefficients αj for j = 1, . . . , p . A spectral cut-off method is a special case of this

shrinkage with αj equal to one or zero.

136

Exercise 4.8.3. Specify the spectral shrinkage uα with a given vector α for the situ-

ation of Exercise 4.8.2.

The spectral cut-off method can be described as follows. Let ψ1,ψ2, . . . be the

intrinsic orthonormal basis of the problem composed of the standardized eigenvectors of

A>A and leading to the spectral representation Z = Λu+Λ1/2ξ with the target vector

u . In terms of the original target f , one is looking for a solution or an estimate in the

form f =∑

j ujψj . The design orthogonality allows to estimate every coefficient uj

independently of the others using the empirical Fourier coefficient ψ>j Y . Namely, uj =

λ−1j ψ>j Y = λ−1j zj . The LSE procedure tries to recover f as the full sum f =

∑j ujψj .

The projection method suggests to cut this sum at the index m : fm =∑

j≤m ujψj ,

while the shrinkage procedure is based on downweighting the empirical coefficients uj :

fα =∑

j αj ujψj .

Next we study the risk of the shrinkage method. Orthonormality of the basis ψj

allows to represent the loss as ‖uα − u∗‖2 = ‖fα − f‖2 . Under the noise homogeneity

one obtains the following result.

Theorem 4.8.1. Let Z = Λu∗ + Λ1/2ξ with Var(ξ) = σ2IIp . It holds for the shrinkage

estimate uα

R(uα)def= IE‖uα − u∗‖2 =

p∑j=1

|αj − 1|2u∗j2 +

p∑j=1

α2jσ

2λ−1j .

Proof. The empirical Fourier coefficients zj are uncorrelated and IEzj = λju∗j , Var zj =

σ2λj . This implies

IE‖uα − u∗‖2 =

p∑j=1

IE|αjλ−1j zj − u∗j |2 =

p∑j=1

{|αj − 1|2u∗j

2 + α2jσ

2λ−1j}

as required.

Risk minimization leads to the oracle choice of the vector α or

α∗ = argminα

R(uα)

where the minimum is taken over the set of all admissible vectors α .

Similar analysis can be done for the spectral cut-off method.

Exercise 4.8.4. The risk of the spectral cut-off estimate um fulfills

R(um) =m∑j=1

λ−1j σ2 +

p∑j=m+1

|u∗j |2.

Specify the choice of the oracle cut-off index m∗ .

137

4.8.2 Galerkin method

A general problem with the spectral shrinkage approach is that it requires to precisely

know the intrinsic basis ψ1,ψ2, . . . or equivalently the eigenvalue decomposition of A

leading to the spectral representation. After this basis is fixed, one can apply the projec-

tion or shrinkage method using the corresponding Fourier coefficients. In some situations

this basis is hardly available or difficult to compute. A possible way out of this problem

is to take some other orthogonal basis φ1,φ2, . . . which is tractable and convenient but

does not lead to the spectral representation of the model. The Galerkin method is based

on projecting the original high dimensional problem to a lower dimensional problem in

term of the new basic {φj} . Namely, without loss of generality suppose that the target

function f can be decomposed as

f =∑j

ujφj .

This can be achieved e.g. if f belongs to some Hilbert space and {φj} is an orthonormal

basis in this space. Now we cut this sum and replace this exact decomposition by a finite

approximation

f ≈ fm =∑j≤m

ujφj = Φ>mum ,

where um = (u1, . . . , um)> and the matrix Φm is built of the vectors φ1, . . . ,φm : Φm =

(φ1, . . . ,φm) . Now we plug this decomposition in the original equation Y = Af + ε .

This leads to the linear model Y = AΦ>mum + ε = Ψ>mum + ε with Ψm = ΦmA> . The

corresponding (quasi) MLE reads

um =(ΨmΨ

>m

)−1ΨmY .

Note that for computing this estimate one only needs to evaluate the action of the

operator A on the basis functions φ1, . . . ,φm and on the data Y . With this estimate

um of the vector u∗ , one obtains the response estimate fm of the form

fm = Φ>mum = u1φ1 + . . .+ umφm .

The properties of this estimate can be studied in the same way as for a general qMLE in

a linear model: the true data distribution follows (4.36) while we use the approximating

model Y = Afm + ε with ε ∼ N(0, σ2I) for building the quasi likelihood.

A further extension of the qMLE approach concerns the case when the operator A

is not precisely known. Instead, an approximation or an estimate A is available. The

pragmatic way of tackling this problem is to use the model Y = Afm+ε for building the

138

quasi likelihood. The use of the Galerkin method is quite natural in this situation because

the spectral representation for A will not necessarily result in a similar representation

for the true operator A .

4.9 Semiparametric estimation

This section discusses the situation when the target of estimation does not coincide with

the parameter vector. This problem is usually referred to as semiparametric estimation.

One typical example is the problem of estimating a part of the parameter vector. More

generally one can try to estimate a given function/functional of the unknown parameter.

We focus here on linear modeling, that is, the considered model and the considered

mapping of the parameter space to the target space are linear. For the ease of presentation

we assume everywhere the homogeneous noise with Var(ε) = σ2IIn .

4.9.1 (θ,η) - and υ -setup

This section presents two equivalent descriptions of the semiparametric problem. The

first one assumes that the total parameter vector can be decomposed into the target

parameter θ and the nuisance parameter η . The second one operates with the total

parameter υ and the target θ is a linear mapping of υ .

We start with the (θ,η) -setup. Let the response Y be modeled in dependence of two

sets of factors: {ψj , j = 1, . . . , p} and {φm,m = 1, . . . , p1} . We are mostly interested

in understanding the impact of the first set {ψj} but we cannot ignore the influence of

the {φm} ’s. Otherwise the model would be incomplete. This situation can be described

by the linear model

Y = Ψ>θ∗ + Φ>η∗ + ε, (4.37)

where Ψ is the p× n matrix with the columns ψj , while Φ is the p1 × n -matrix with

the columns φm . We primarily aim at recovering the vector θ∗ , while the coefficients

η∗ are of secondary importance. The corresponding (quasi) log-likelihood reads as

L(θ,η) = −(2σ2)−1‖Y − Ψ>θ − Φ>η‖2 +R,

where R denotes the remainder term which does not depend on the parameters θ,η .

The more general υ -setup considers a general linear model

Y = Υ>υ∗ + ε, (4.38)

where Υ is p∗×n matrix of p∗ factors, and the target of estimation is a linear mapping

θ∗ = Pυ∗ for a given operator P from IRp∗

to IRp . Obviously the (θ,η) -setup is a

139

special case of the υ -setup. However, a general υ -setup can be reduced back to the

(θ,η) -setup by a change of variable.

Exercise 4.9.1. Consider the sequence space model Y = υ∗ + ξ in IRp and let the

target of estimation be the sum of the coefficients υ∗1 + . . . + υ∗p . Describe the υ -setup

for the problem. Reduce to (θ,η) -setup by an orthogonal change of the basis.

In the υ -setup, the (quasi) log-likelihood reads as

L(υ) = −(2σ2)−1‖Y − Υ>υ‖2 +R,

where R is the remainder which does not depend on υ . It implies quadraticity of the

log-likelihood L(υ) with the matrix D2 = ∇2L(υ) and the gradient ∇L(υ∗) given by

D2 = σ−2ΥΥ>, ∇L(υ∗) = σ−2Υε.

Exercise 4.9.2. Show that for the model (4.37) holds with Υ =(ΨΦ

)D2 = σ−2ΥΥ> = σ−2

(ΨΨ> ΨΦ>

ΦΨ> ΦΦ>

),

∇L(υ∗) = σ−2Υε = σ−2

(Ψε

Φε

).

4.9.2 Orthogonality and product structure

Consider the model (4.37) under the orthogonality condition ΨΦ> = 0 . This condition

effectively means that the factors of interest {ψj} are orthogonal to the nuisance factors

{φm} . An important feature of this orthogonal case is that the model has the product

structure leading to the additive form of the log-likelihood. Consider the partial θ -model

Y = Ψ>θ + ε with the (quasi) log-likelihood

L(θ) = −(2σ2)−1‖Y − Ψ>θ‖2 +R

Similarly L1(η) = −(2σ2)−1‖Y − Φ>η‖2 + R1 denotes the log-likelihood in the partial

η -model Y = Φ>θ + ε .

Theorem 4.9.1. Assume the condition ΨΦ> = 0 . Then

L(θ,η) = L(θ) + L1(η) +R(Y ) (4.39)

where R(Y ) is independent of θ and η . This implies the block diagonal structure of

the matrix D2 = σ−2ΥΥ> :

D2 = σ−2

(ΨΨ> 0

0 ΦΦ>

)=

(D2 0

0 H2

),

140

with D2 = σ−2ΨΨ> , H2 = σ−2ΦΦ> . Moreover, for any υ = (θ,η)

∇L(υ) =

(∇L(θ)

∇L1(η)

).

Now we demonstrate how the general case can be reduced to the orthogonal one by

a linear transformation of the nuisance parameter. Let C be a p × p1 matrix. Define

η = η + C>θ . Then the model equation Y = Ψ>θ + Φ>η + ε can be rewritten as

Y = Ψ>θ + Φ>(η − C>θ) + ε = (Ψ − CΦ)>θ + Φ>η + ε.

Now we select C to ensure the orthogonality. This leads to the equation

(Ψ − CΦ)Φ> = 0

or C = ΨΦ>(ΦΦ>

)−1. So, the original model can be rewritten as

Y = Ψ>θ + Φ>η + ε,

Ψ = Ψ − CΦ = Ψ(I −Πη) (4.40)

where Πη = Φ>(ΦΦ>

)−1Φ being the projector on the linear subspace spanned by the

nuisance factors {φm} . This construction has a natural interpretation: correction the θ -

factors ψ1, . . . ,ψp by removing their interaction with the nuisance factors φ1, . . . ,φp1

reduces the general case to the orthogonal one. We summarize:

Theorem 4.9.2. The linear model (4.37) can be represented in the orthogonal form

Y = Ψ>θ + Φ>η + ε

where Ψ from (4.40) satisfies ΨΦ> = 0 and η = η + C>θ for C = ΨΦ>(ΦΦ>

)−1.

Moreover, it holds for υ = (θ,η)

L(υ) = L(θ) + L1(η) +R(Y ) (4.41)

with

L(θ) = −(2σ2)−1‖Y − Ψ>θ‖2 +R,

L1(η) = −(2σ2)−1‖Y − Φ>η‖2 +R1.

Exercise 4.9.3. Show that for C = ΨΦ>(ΦΦ>

)−1∇L(θ) = ∇θL(υ)− C∇ηL(υ).

141

Exercise 4.9.4. Show that the remainder term R(Y ) in the last equation is the same

as in the orthogonal case (4.41).

Exercise 4.9.5. Show that Ψ Ψ> < ΨΨ> if ΨΦ> 6= 0 .

4.9.3 Partial estimation

This section explains the important notion of partial estimation which is quite natural

and transparent in the (θ,η) -setup. Let some value η◦ of the nuisance parameter be

fixed. A particular case of this sort is just ignoring the factors {φm} corresponding to

the nuisance component, that is, one uses η◦ ≡ 0 . This approach is reasonable in certain

situation, e.g. in context of projection method or spectral cut-off.

Define the estimate θ(η◦) by partial optimization of the joint log-likelihood L(θ,η◦)

w.r.t. the first parameter θ :

θ(η◦) = argmaxθ

L(θ,η◦).

Obviously θ(η◦) is the MLE in the residual model Y − Φ>η◦ = Ψ>θ∗ + ε :

θ(η◦) =(ΨΨ>

)−1Ψ(Y − Φ>η◦).

This allows for describing the properties of the partial estimate θ(η◦) similarly to the

usual parametric situation.

Theorem 4.9.3. Consider the model (4.37). Then the partial estimate θ(η◦) fulfills

IEθ(η◦) = θ∗ +(ΨΨ>

)−1ΨΦ>(η∗ − η◦), Var

{θ(η◦)

}= σ2

(ΨΨ>

)−1.

In words, θ(η) has the same variance as the MLE in the partial model Y = Ψ>θ∗+ε

but it is biased if ΨΦ>(η∗ − η◦) 6= 0 . The ideal situation corresponds to the case

when η◦ = η∗ . Then θ(η∗) is the MLE in the correctly specified θ -model: with

Y (η∗)def= Y − Φ>η∗ ,

Y (η∗) = Ψ>θ∗ + ε.

An interesting and natural question is a legitimation of the partial estimation method:

under which conditions it is justified and does not produce any estimation bias. The

answer is given by Theorem 4.9.1: the orthogonality condition ΨΦ> = 0 would ensure

the desired feature because of the decomposition (4.41).

142

Theorem 4.9.4. Assume orthogonality ΨΦ> = 0 . Then the partial estimate θ(η◦)

does not depend on the nuisance parameter η◦ used:

θ = θ(η◦) = θ(η∗) =(ΨΨ>

)−1ΨY .

In particular, one can ignore the nuisance parameter and estimate θ∗ from the partial

incomplete model Y = Ψ>θ∗ + ε .

Exercise 4.9.6. Check that the partial derivative ∂∂θL(θ,η) does not depend on η

under the orthogonality condition.

The partial estimation can be considered in context of estimating the nuisance pa-

rameter η by inverting the role of θ and η . Namely, given a fixed value θ◦ , one can

optimize the joint log-likelihood L(θ,η) w.r.t. the second argument η leading to the

estimate

η(θ◦)def= argmax

ηL(θ◦,η)

In the orthogonal situation the initial point θ◦ is not important and one can use the

partial incomplete model Y = Φ>η∗ + ε .

4.9.4 Profile estimation

This section discusses one general profile likelihood method of estimating the target pa-

rameter θ in the semiparametric situation. Later we show its optimality and R-efficiency.

The method suggests to first estimate the entire parameter vector υ by using the (quasi)

ML method. Then the operator P is applied to the obtained estimate υ to produce the

estimate θ . One can describe this method as

υ = argmaxυ

L(υ), θ = P υ. (4.42)

The first step here is the usual LS estimation of υ∗ in the linear model (4.38):

υ = arginfυ‖Y − Υ>υ‖2 =

(ΥΥ>

)−1ΥY .

The estimate θ is obtained by applying P to υ :

θ = P υ = P(ΥΥ>

)−1ΥY = SY (4.43)

with S = P(ΥΥ>

)−1Υ . The properties of this estimate can be studied using the decom-

position Y = f + ε with f = IEY ; cf. Section 4.4. In particular, it holds

IEθ = Sf , Var(θ) = S Var(ε)S>. (4.44)

143

If the noise ε is homogeneous with Var(ε) = σ2IIn , then

Var(θ) = σ2SS> = σ2P(ΥΥ>

)−1P>. (4.45)

The next theorem summarizes our findings.

Theorem 4.9.5. Consider the model (4.38) with homogeneous error Var(ε) = σ2IIn .

The profile MLE θ follows (4.43). Its means and variance are given by (4.44) and

(4.45).

The profile MLE is usually written in the (θ,η) -setup. Let υ = (θ,η) . Then the

target θ is obtained by projecting the MLE (θ, η) on the θ -coordinates. This procedure

can be formalized as

θ = argmaxθ

maxη

L(θ,η).

Another way of describing the profile MLE is based on the partial optimization con-

sidered in the previous section. Define for each θ the value L(θ) by optimizing the

log-likelihood L(υ) under the condition Pυ = θ :

L(θ)def= sup

υ: Pυ=θL(υ) = sup

ηL(θ,η). (4.46)

Then θ is defined by maximizing the partial fit L(θ) :

θdef= argmax

θL(θ). (4.47)

Exercise 4.9.7. Check that (4.42) and (4.47) lead to the same estimate θ .

We use for the function L(θ) obtained by partial optimization (4.48) the same nota-

tion as for the function obtained by the orthogonal decomposition (4.41) in Section 4.9.2.

Later we show that these two functions indeed coincide. This helps in understanding the

structure of the profile estimate θ .

Consider first the orthogonal case ΨΦ> = 0 . This assumption gradually simplifies the

study. In particular, the result of Theorem 4.9.4 for partial estimation can be obviously

extended to the profile method in view of product structure (4.41): when estimating the

parameter θ , one can ignore the nuisance parameter η and proceed as if the partial

model Y = Ψ>θ∗ + ε were correct. Theorem 4.9.1 implies:

Theorem 4.9.6. Assume that ΨΦ> = 0 in the model (4.37). Then the profile MLE θ

from (4.47) coincides with the MLE from the partial model Y = Ψ>θ∗ + ε :

θ = argmaxθ

L(θ) = argminθ‖Y − Ψ>θ‖2 =

(ΨΨ>

)−1ΨY .

144

It holds IEθ = θ∗ and

θ − θ∗ = D−2ζ = D−1ξ

with D2 = σ−2ΨΨ> , ζ = σ−2Ψε , and ξ = D−1ζ . Finally, L(θ) from (4.48) fulfills

2{L(θ)− L(θ∗)

}= ‖D

(θ − θ∗

)‖2 = ζ>D−2ζ = ‖ξ‖2. (4.48)

The general case can be reduced to the orthogonal one by the construction from

Theorem 4.9.2. Let

Ψ = Ψ − ΨΠη = Ψ − ΨΦ>(ΦΦ>

)−1Φ

be the corrected Ψ -factors after removing their interactions with the Φ -factors.

Theorem 4.9.7. Consider the model (4.37), and let the matrix D2 = σ2(Ψ Ψ>

)−1is

non-degenerated. Then the profile MLE θ reads as

θ = argminθ‖Y − Ψ>θ‖2 =

(Ψ Ψ>

)−1ΨY . (4.49)

It holds IEθ = θ∗ and

θ − θ∗ =(Ψ Ψ>

)−1Ψε = D−2ζ = D−1ξ (4.50)

with D2 = σ−2Ψ Ψ> , ζ = σ−2Ψε , and ξ = D−1ζ . Finally, L(θ) from (4.48) fulfills

2{L(θ)− L(θ∗)

}= ‖D

(θ − θ∗

)‖2 = ζ

>D−2ζ = ‖ξ‖2. (4.51)

Finally we present the same result in terms of the original log-likelihood L(υ) .

Theorem 4.9.8. Write D2 = ∇2L(υ) for the model (4.37) in the block form

D2 =

(D2 A

A> H2

)(4.52)

Let D2 and H2 be invertible. Then D2 and ζ in (4.50) can be represented as

D2 = D2 −AH−2A>,

ζ = ∇θL(υ∗)−AH−2∇ηL(υ∗).

Proof. In view of Theorem 4.9.7, it suffices to check the formulas for D2 and ζ . One

has for Ψ = Ψ(IIn −Πη) and A = σ−2ΨΦ>

D2 = σ−2Ψ Ψ> = σ−2Ψ(IIn −Πη

)Ψ>

= σ−2ΨΨ> − σ−2ΨΦ>(ΦΦ>

)−1ΦΨ> = D2 −AH−2A>

145

Similarly, in view of AH−2 = ΨΦ>(ΦΦ>

)−1, ∇θL(υ∗) = Ψε , and ∇ηL(υ∗) = Φε

ζ = Ψε = Ψε− ΨΦ>(ΦΦ>

)−1Φε = ∇θL(υ∗)−AH−2∇ηL(υ∗).

as required.

It is worth stressing again that the result of Theorems 4.9.6 through 4.9.8 is purely

geometrical. We only used the condition IEε = 0 in the model (4.37) and the quadratic

structure of the log-likelihood function L(υ) . The distribution of the vector ε does not

enter in the results and proofs. However, the representation (4.50) allows for straightfor-

ward analysis of the probabilistic properties of the estimate θ .

Theorem 4.9.9. Consider the model (4.37) and let Var(Y ) = Var(ε) = Σ0 . Then

Var(θ) = σ−4D−2ΨΣ0Ψ>D−2, Var(ξ) = σ−4D−1ΨΣ0Ψ

>D−1.

In particular, if Var(Y ) = σ2IIn , this implies that

Var(θ) = D−2, Var(ξ) = IIp.

Exercise 4.9.8. Check the result of Theorem 4.9.9. Specify this result to the orthogonal

case ΨΦ> = 0 .

4.9.5 Semiparametric efficiency bound

The main goal of this section is to show that the profile method in the semiparametric

estimation leads to R-efficient procedures. Remind that the target of estimation is θ∗ =

Pυ∗ for a given linear mapping P . The profile MLE θ is one natural candidate. The

next result claims its optimality.

Theorem 4.9.10 (Gauss-Markov). Let Y follow Y = Υ>υ∗+ε for homogeneous errors

ε . Then the estimate θ of θ∗ = Pυ∗ from (4.43) is unbiased and

Var(θ) = σ2P(ΥΥ>

)−1P>

yielding

IE‖θ − θ∗‖2 = σ2 tr{P(ΥΥ>

)−1P>}.

Moreover, this risk is minimal in the class of all unbiased linear estimates of θ∗ .

Proof. The statements about the properties of θ have been already proved. The lower

bound can be proved by the same arguments as in the case of the MLE estimation in

146

Section 4.4.3. We only outline the main steps. Let θ be any unbiased linear estimate

of θ∗ . The idea is to show that the difference θ − θ is orthogonal to θ in the sense

IE(θ − θ

)θ>

= 0 . This implies that the variance of θ is the sum of Var(θ) and

Var(θ − θ

)and therefore larger than Var(θ) .

Let θ = BY for some matrix B . Then IEθ = BIEY = BΥ>υ∗ . The no-bias

property yields the identity IEθ = θ∗ = Pυ∗ and thus

BΥ> − P = 0. (4.53)

Next, IEθ = IEθ = θ∗ and thus

IEθθ>

= θ∗θ∗> + Var(θ),

IEθθ>

= θ∗θ∗> + IE(θ − IEθ)(θ − IEθ)>.

Obviously θ−IEθ = Bε and θ−IEθ = Sε yielding Var(θ) = σ2SS> and IEBε(Sε)> =

σ2BS> . So

IE(θ − θ)θ>

= σ2(B − S)S>.

The identity (4.53) implies

(B − S)S> ={B − P

(ΥΥ>

)−1Υ}Υ>(ΥΥ>

)−1P>

= (BΥ> − P )(ΥΥ>

)−1P> = 0


Now we specify the efficiency bound for the (θ,η) -setup (4.37). In this case P is

just the projector onto the θ -coordinates.

4.9.6 Inference for the profile likelihood approach

This section discusses the construction of confidence and concentration sets for the profile

ML estimation. The key fact behind this construction is the chi-squared result which

extends without any change from the parametric to semiparametric framework.

The definition θ from (4.47) suggests to define a CS for θ∗ as the level set of

L(θ) = supυ:Pυ=θ L(υ) :

E(z)def={θ : L(θ)− L(θ) ≤ z

}.

This definition can be rewritten as

E(z)def={θ : sup

υL(υ)− sup

υ:Pυ=θL(υ) ≤ z

}.

147

It is obvious that the unconstrained optimization of the log-likelihood L(υ) w.r.t. υ

is not smaller than the optimization under the constrain that Pυ = θ . The point

θ belongs to E(z) if the difference between these two values does not exceed z . As

usual, the main question is the choice of a value z which ensures the prescribed coverage

probability of θ∗ . This naturally leads to studying the deviation probability

IP(

supυ

L(υ)− supυ:Pυ=θ∗

L(υ) > z).

The study of this value is especially simple in the orthogonal case. The answer can be

expected: the expression and the value are exactly the same as in the case without any

nuisance parameter η , it simply has no impact. In particular, the chi-squared result still

holds.

In this section we follow the line and the notation of Section 4.9.4. In particular, we

use the block notation (4.52) for the matrix D2 = ∇2L(υ) .

Theorem 4.9.11. Consider the model (4.37). Let the matrix D2 be non-degenerated.

If ε ∼ N(0, σ2IIn) , then

2{L(θ)− L(θ∗)

}∼ χ2

p , (4.54)

that is, this 2{L(θ)− L(θ∗)

}is chi-squared with p degrees of freedom.

Proof. The result is based on representation (4.51) 2{L(θ)− L(θ∗)

}= ‖ξ‖2 from Theo-

rem 4.9.7. It remains to note that normality of ε implies normality of ξ and the moment

conditions IEξ = 0 , Var(ξ) = IIp imply (4.54).

This result means that the chi-squared result continues to hold in the general semi-

parametric framework as well. One possible explanation is as follows: it applies in the

orthogonal case, and the general situation can be reduced to the orthogonal case by a

change of coordinates which preserves the value of the maximum likelihood.

The statement (4.54) of Theorem 4.9.11 has an interesting geometric interpretation

which is often used in analysis of variance. Consider the expansion

L(θ)− L(θ∗) = L(θ)− L(θ∗,η∗)−{L(θ∗)− L(θ∗,η∗)

}.

The quantity L1def= L(θ) − L(υ∗) coincides with L(υ,υ∗) ; see (4.42). Thus, 2L1 chi-

squared with p∗ degrees of freedom by the chi-squared result. Moreover, 2σ2L(υ,υ∗) =

‖Πυε‖2 , where Πυ = Υ>(ΥΥ>

)−1Υ is the projector on the linear subspace spanned by

the joint collection of factors {ψj} and {φm} . Similarly, the quantity L2def= L(θ∗) −

L(θ∗,η∗) = supη L(θ∗,η)−L(θ∗,η∗) is the maximum likelihood in the partial η -model.

Therefore, 2L2 is also chi-squared distributed with p1 degrees of freedom, and 2σ2L2 =

148

‖Πηε‖2 , where Πη = Φ>(ΦΦ>

)−1Φ is the projector on the linear subspace spanned by

the η -factors {φm} . Now we use the decomposition Πυ = Πη + Πυ − Πη , in which

Πυ −Πη is also a projector on the subspace of dimension p . This explains the result

(4.54) that the difference of these two quantities is chi-squared with p = p∗− p1 degrees

of freedom. The above consideration leads to the following result.

Theorem 4.9.12. It holds for the model (4.37) with Πθ = Πυ −Πη

2L(θ)− 2L(θ∗) = σ−2(‖Πυε‖2 − ‖Πηε‖2

)= σ−2‖Πθε‖2 = σ−2ε>Πθε. (4.55)

Exercise 4.9.9. Check the formula (4.55). Show that it implies (4.54).

4.9.7 Plug-in method

Although the profile MLE can be represented in a closed form, its computing can be a

hard task if the dimensionality p1 of the nuisance parameter is high. Here we discuss an

approach which simplifies the computations but leads to a suboptimal solution.

We start with the approach called plug-in. It is based on the assumption that a pilot

estimate η of the nuisance parameter η is available. Then one obtains the estimate θ

of the target θ∗ from the residuals Y − Φ>η .

This means that the residual vector Y = Y − Φ>η is used as observations and the

estimate θ is defined as the best fit to such observations in the θ -model:

θ = argminθ‖Y − Ψ>θ‖2 =

(ΨΨ>

)−1Ψ Y . (4.56)

A very particular case of the plug-in method is partial estimation from Section 4.9.3 with

η ≡ η◦ .

The plug-in method can be naturally described in context of partial estimation. We

use the following representation of the plug-in method: θ = θ(η) .

Exercise 4.9.10. Check the identity θ = θ(η) for the plug-in method. Describe the

plug-in estimate for η ≡ 0 .

The behavior of the θ heavily depends upon the quality of the pilot η . A detailed

study is complicated and a closed form solution is only available for the special case of a

linear pilot estimate. Let η = AY . Then (4.56) implies

θ =(ΨΨ>

)−1Ψ(Y − Φ>AY ) = SY

with S =(ΨΨ>

)−1Ψ(IIn − Φ>A) . This is a linear estimate whose properties can be

studied in a usual way.

149

4.9.8 Two step procedure

The ideas of partial and plug-in estimation can be combined yielding the so called two

step procedures. One starts with the initial guess θ◦ for the target θ∗ . A very special

choice is θ◦ ≡ 0 . This leads to the partial η -model Y (θ◦) = Φ>η + ε for the residuals

Y (θ◦) = Y − Ψ>θ◦ . Next compute the partial MLE η(θ◦) =(ΦΦ>

)−1ΦY (θ◦) in this

model and use it as a pilot for the plug-in method: compute the residuals

Y (θ◦) = Y − Φ>η(θ◦) = Y −ΠηY (θ◦)

with Πη = Φ>(ΦΦ>

)−1Φ , and then estimate the target parameter θ by fitting Ψ>θ to

the residuals Y (θ◦) . This method results in the estimate

θ(θ◦) =(ΨΨ>

)−1Ψ Y (θ◦) (4.57)

A simple comparison with the formula (4.49) reveals that the pragmatic two step ap-

proach is sub-optimal: the resulting estimate does not fit the profile MLE θ unless we

have an orthogonal situation with ΨΠη = 0 . In particular, the estimate θ(θ◦) from

(4.57) is biased.

Exercise 4.9.11. Consider the orthogonal case with ΨΦ> = 0 . Show that the two step

estimate θ(θ◦) coincides with the partial MLE θ =(ΨΨ>

)−1ΨY .

Exercise 4.9.12. Compute the mean of θ(θ◦) . Show that there exists some θ∗ such

that IE{θ(θ◦)

}6= θ∗ unless the orthogonality condition ΨΦ> = 0 is fulfilled.

Exercise 4.9.13. Compute the variance of θ(θ◦) .

Hint: use that Var{Y (θ◦)

}= Var(Y ) = σ2IIn . Derive that Var

{Y (θ◦)

}= σ2(IIn−Πη) .

Exercise 4.9.14. Let Ψ be orthogonal, i.e. ΨΨ> = IIp . Show that Var{θ(θ◦)

}=

σ2(IIp − ΨΠηΨ>) .

4.9.9 Alternating method

The idea of partial and two step estimation can be applied in an iterative way. One

starts with some initial value for θ◦ and sequentially performs the two steps of partial

estimation. Set

η0 = η(θ◦) = argminη‖Y − Ψ>θ◦ − Φ>η‖2 =

(ΦΦ>

)−1Φ(Y − Ψ>θ◦).

150

With this estimate fixed, compute θ1 = θ(η1) and continue in this way. Generically,

with θk and ηk computed, one recomputes

θk+1 = θ(ηk) =(ΨΨ>

)−1Ψ(Y − Φ>ηk), (4.58)

ηk+1 = η(θk+1) =(ΦΦ>

)−1Φ(Y − Ψ>θk+1). (4.59)

The procedure is especially transparent if the partial design matrices Ψ and Φ are

orthonormal: ΨΨ> = IIp , ΦΦ> = Ip1 . Then

θk+1 = Ψ(Y − Φ>ηk),

ηk+1 = Φ(Y − Ψ>θk+1).

In words, having an estimate θ of the parameter θ∗ one computes the residuals Y =

Y −Ψ>θ and then build the estimate η of the nuisance η∗ by the empirical coefficients

ΦY . Then this estimate η is used in a similar way to recompute the estimate of θ∗ ,

and so on.

It is worth noting that every doubled step of alternation improves the current value

L(θk, ηk) . Indeed, θk+1 is defined by maximizing L(θ, ηk) , that is, L(θk+1, ηk) ≥L(θk, ηk) . Similarly, L(θk+1, ηk+1) ≥ L(θk+1, ηk) yielding

L(θk+1, ηk+1) ≥ L(θk, ηk). (4.60)

A very interesting question is whether the procedure (4.58), (4.59) converges and

whether it converges to the maximum likelihood solution. The answer is positive and in

the simplest orthogonal case the result is straightforward.

Exercise 4.9.15. Consider the orthogonal situation with ΨΦ> = 0 . Then the above

procedure stabilizes in one step with the solution from Theorem 4.9.4.

In the non-orthogonal case the situation is much more complicated. The idea is

to show that the alternating procedure can be represented a sequence of actions of a

shrinking linear operator to the data. The key observation behind the result is the

following recurrent formula for Ψ>θk and Φ>ηk :

Ψ>θk+1 = Πθ(Y − Φ>ηk) =(Πθ −ΠθΠη

)Y +ΠθΠηΨ

>θk, (4.61)

Φ>ηk+1 = Πη(Y − Ψ>θk+1) =(Πη −ΠηΠθ

)Y +ΠηΠθΦ

>ηk (4.62)

with Πθ = Ψ>(ΨΨ>

)−1Ψ and Πη = Φ>

(ΦΦ>

)−1Φ .

Exercise 4.9.16. Show (4.61) and (4.62).

151

This representation explains necessary and sufficient conditions for convergence of

the alternating procedure. Namely, the spectral norm ‖ΠηΠθ‖∞ (the largest singular

value) of the product operator ΠηΠθ should be strictly less than one, and similarly for

ΠθΠη .

Exercise 4.9.17. Show that ‖ΠθΠη‖∞ = ‖ΠηΠθ‖∞ .

Theorem 4.9.13. Suppose that ‖ΠηΠθ‖∞ = λ < 1 . Then the alternating procedure

converges geometrically, the limiting values θ and η are unique and fulfill

Ψ>θ = (IIn −ΠθΠη)−1(Πθ −ΠθΠη)Y ,

Φ>η = (IIn −ΠηΠθ)−1(Πη −ΠηΠθ)Y , (4.63)

and θ coincides with the profile MLE θ from (4.47).

Proof. The convergence will be discussed below. Now we comment on the identity θ = θ .

A direct comparison of the formulas for these two estimates can be a hard task. Instead we

use the monotonicity property (4.60). By definition, (θ, η) maximize globally L(θ,η) .

If we start the procedure with θ◦ = θ , we would improve the value L(θ, η) at every

step. By uniqueness, the procedure stabilizes with θk = θ and ηk = η for every k .

Exercise 4.9.18. 1. Show by induction arguments that

Φ>ηk+1 = Ak+1Y +(ΠηΠθ

)kΦ>η1,

where the linear operator Ak fulfills A1 = 0 and

Ak+1 = Πη −ΠηΠθ +ΠηΠθAk =k−1∑i=0

(ΠηΠθ)i(Πη −ΠηΠθ).

2. Show that Ak converges to A = (IIn−ΠηΠθ)−1(Πη−ΠηΠθ) and evaluate ‖A−Ak‖∞and ‖Φ>(ηk − η)‖ .

Hint: use that ‖Πη −ΠηΠθ‖∞ ≤ 1 and ‖(ΠηΠθ)i‖∞ ≤ ‖ΠηΠθ‖i∞ ≤ λi .

3. Prove (4.63) by inserting η in place of ηk and ηk+1 in (4.62).

152

Chapter 5

Bayes estimation

This chapter discusses the Bayes approach to parameter estimation. This approach

differs essentially from classical parametric modeling also called the frequentist approach.

Classical frequentist modeling assumes that the observed data Y follow a distribution

law IP from a given parametric family (IPθ,θ ∈ Θ ⊂ IRp) , that is,

IP = IPθ∗ ∈ (IPθ).

Suppose that the family (IPθ) is dominated by a measure µ0 and denote by p(y |θ)

the corresponding density:

p(y |θ) =dIPθdµ0

(y).

The likelihood is defined as the density at the observed point and the maximum likelihood

approach tries to recover the true parameter θ∗ by maximizing this likelihood over

θ ∈ Θ .

In the Bayes approach, the paradigm is changed and the true data distribution is not

assumed to be specified by a single parameter value θ∗ . Instead, the unknown parameter

is considered to be a random variable ϑ with a distribution π on the parameter space Θ

called a prior. The measure IPθ can be considered to be the data distribution conditioned

that the randomly selected parameter is exactly θ . The target of analysis is not a single

value θ∗ , this value is no longer defined. Instead one is interested in the posterior

distribution of the random parameter ϑ given the observed data:

what is the distribution of ϑ given the prior π and the data Y ?

In other words, one aims at inferring on the distribution of ϑ on the basis of the observed

data Y and our prior knowledge π . Below we distinguish between the random variable

ϑ and its particular values θ . However, one often uses the same symbol θ for denoting

the both objects.

153

154

5.1 Bayes formula

The Bayes modeling assumptions can be put together in the form

Y | θ ∼ p(· |θ),

ϑ ∼ π(·).

The first line has to be understood as the conditional distribution of Y given the par-

ticular value θ of the random parameter ϑ : Y | θ means Y | ϑ = θ . This section

formalizes and states the Bayes approach in a formal mathematical way. The answer is

given by the Bayes formula for the conditional distribution of ϑ given Y . First consider

the joint distribution IP of Y and ϑ . If B is a Borel set in the space Y of observations

and A is a measurable subset of Θ then

IP (B ×A) =

∫A

(∫BIPθ(dy)

)π(dθ)

The marginal or unconditional distribution of Y is given by averaging the joint proba-

bility w.r.t. the distribution of ϑ :

IP (B) =

∫Θ

∫BIPθ(dy)π(dθ) =

∫ΘIPθ(B)π(dθ).

The posterior (conditional) distribution of ϑ given the event Y ∈ B is defined as the

ratio of the joint and marginal probabilities:

IP (ϑ ∈ A | Y ∈ B) =IP (B ×A)

IP (B).

Equivalently one can write this formula in terms of the related densities. In what follows

we denote by the same letter π the prior measure π and its density w.r.t. some other

measure λ , e.g. the Lebesgue or uniform measure on Θ . Then the joint measure IP

has the density

p(y,θ) = p(y |θ)π(θ),

while the marginal density p(y) is the integral of the joint density w.r.t. the prior π :

p(y) =

∫Θp(y,θ)λ(dθ) =

∫Θp(y |θ)π(θ)λ(dθ).

Finally the posterior (conditional) density p(θ |y) of ϑ given y is defined as the ratio

of the joint density p(y,θ) and the marginal density p(y) :

p(θ |y) =p(y,θ)

p(y)=

p(y |θ)π(θ)∫Θ p(y |θ)π(θ)λ(dθ)

.

155

Our definitions are summarized in the next lines:

Y | θ ∼ p(y |θ),

ϑ ∼ π(θ),

Y ∼ p(y) =

∫Θp(y |θ)π(θ)λ(dθ),

ϑ | Y ∼ p(θ |Y ) =p(Y ,θ)

p(Y )=

p(Y |θ)π(θ)∫Θ p(Y |θ)π(θ)λ(dθ)

. (5.1)

Note that given the prior π and the observations Y , the posterior density p(θ |Y ) is

uniquely defined and can be viewed as the solution or target of analysis within the Bayes

approach. The expression (5.1) for the posterior density is called the Bayes formula.

The value p(y) of the marginal density of Y at y does not depend on the parameter

θ . Given the data Y , it is just a numeric normalizing factor. Often one skips this factor

writing

ϑ | Y ∝ p(Y |θ)π(θ).

Below we consider a couple of examples.

Example 5.1.1. Let Y = (Y1, . . . , Yn)> be a sequence of zeros and ones considered to

be a realization of a Bernoulli experiment for n = 10 . Let also the underlying parameter

θ be random and let it take the values 1/2 or 1 each with probability 1/2 , that is,

π(1/2) = π(1) = 1/2.

Then the probability of observing y = “10 ones” is

IP (y) =1

2IP (y | ϑ = 1/2) +

1

2IP (y | ϑ = 1).

The first probability is quite small, it is 2−10 , while the second one is just one. Therefore,

IP (y) = (2−10 + 1)/2 . If we observed y = (1, . . . , 1)> , then the posterior probability of

ϑ = 1 is

IP (ϑ = 1 | y) =IP (y | ϑ = 1)IP (ϑ = 1)

IP (y)=

1

2−10 + 1

that is, it is quite close to one.

Exercise 5.1.1. Consider the Bernoulli experiment Y = (Y1, . . . , Yn)> with n = 10

and let

π(1/2) = π(0.9) = 1/2.

156

Compute the posterior distribution of ϑ if we observe y = (y1, . . . , yn)> with

• y = (1, . . . , 1)>

• the number of successes S = y1 + . . .+ yn is 5.

Show that the posterior density p(θ |y) only depends on the numbers of successes S .

5.2 Conjugated priors

Let (IPθ) be a dominated parametric family with the density function p(y |θ) . For a

prior π with the density π(θ) , the posterior density is proportional to p(y |θ)π(θ) . Now

consider the case when the prior π belongs to some other parametric family indexed by a

parameter α , that is, π(θ) = π(θ,α) . An very desirable feature of the Bayes approach

is that the posterior density also belongs to this family. Then computing the posterior

is equivalent to fixing the related parameter α = α(Y ) . Such priors are usually called

conjugated.

5.2.1 Examples

To illustrate this notion, we present some examples.

Example 5.2.1. [Gaussian Shift] Let Y ∼ N(θ, σ2) with σ known. Consider ϑ ∼N(τ, g) , α = (τ, g2) . Then

p(y | θ)π(θ,α) ∝ exp{−(y − θ)2/(2σ2)− (θ − τ)2/(2g2)

}The expression in the exponent is a quadratic form of θ and the Taylor expansion w.r.t.

θ at θ = τ implies

π(θ |Y ) ∝ exp{−(y − τ)2/(2σ2) + (y − τ)(θ − τ)/σ2 − 0.5(σ−2 + g−2)(θ − τ)2

}This representation indicates that the conditional distribution of θ given y is normal.

The parameters of the posterior will be computed in the next section.

Example 5.2.2. [Bernoulli] Let Y be a Bernoulli r.v. with IP (Y = 1) = θ . Then

p(y | θ) = θy(1 − θ)1−Y . Consider the family of priors with the Beta-distribution:

π(θ,α) = θa(1− θ)b for α = (a, b) .

Example 5.2.3. [Exponential] Let

Example 5.2.4. [Poisson] Let

Example 5.2.5. [Volatility] Let

157

5.2.2 Exponential families and conjugated priors

All the previous examples can be systematically treated as special case

5.3 Linear Gaussian model and Gaussian priors

An interesting and important class of prior distributions is given by Gaussian priors.

The very nice and desirable feature of this class is that the posterior distribution for the

Gaussian model and Gaussian prior is also Gaussian.

5.3.1 Univariate case

We start with the case of a univariate parameter and one observation Y ∼ N(θ, σ2) ,

where the variance σ2 is known and only the mean θ is unknown. The Bayes approach

suggests to treat θ as a random variable. Suppose that the prior π is also normal with

mean τ and variance r2 .

Theorem 5.3.1. Let Y ∼ N(θ, σ2) , and let the prior π be the normal distribution

N(τ, r2) :

Y | θ ∼ N(θ, σ2),

ϑ ∼ N(τ, r2).

Then the joint, marginal, and posterior distributions are normal as well. Moreover, it

holds

Y ∼ N(τ, σ2 + r2),

ϑ | Y ∼ N

(τσ2 + Y r2

σ2 + r2,σ2r2

σ2 + r2

).

Proof. It holds Y = ϑ + ε with ϑ ∼ N(τ, r2) and ε ∼ N(0, σ2) independent of ϑ .

Therefore, Y is normal with mean IEY = IEϑ+ IEε = τ and the variance is

Var(Y ) = IE(Y − τ)2 = r2 + σ2.

This implies the formula for the marginal density p(Y ) . Next, for ρ = σ2/(r2 + σ2) ,

IE[(ϑ− τ)(Y − τ)

]= IE(ϑ− τ)2 = r2 = (1− ρ) Var(Y ).

Thus, the random variables Y − τ and ζ with

ζ = ϑ− τ − (1− ρ)(Y − τ) = ρ(ϑ− τ)− (1− ρ)ε

158

are Gaussian and uncorrelated and therefore, independent. The conditional distribution

of ζ given Y coincides with the unconditional distribution and hence, it is normal with

mean zero and variance

Var(ζ) = ρ2 Var(ϑ) + (1− ρ)2 Var(ε) = ρ2r2 + (1− ρ)2σ2 =σ2r2

σ2 + r2.

This yields the result because ϑ = ζ + ρτ + (1− ρ)Y .

Exercise 5.3.1. Check the result of Theorem 5.3.1 by direct calculation using Bayes

formula (5.1).

So the posterior mean of ϑ is a weighted average of the prior mean τ and the

sample estimate Y ; the sample estimate is pulled back (or shrunk) toward the prior

mean. Moreover, the weight ρ on the prior mean is close to one if σ2 is large relative

to r2 (i.e. our prior knowledge is more precise than the data information), producing

substantial shrinkage. If σ2 is small (i.e., our prior knowledge is imprecise relative to

the data information), ρ is close to zero and the direct estimate Y is moved very little

towards the prior mean.

Now consider the i.i.d. model from N(θ, σ2) where the variance σ2 is known.

Theorem 5.3.2. Let Y = (Y1, . . . , Yn)> be i.i.d. and for each Yi

Yi | θ ∼ N(θ, σ2), (5.2)

ϑ ∼ N(τ, r2). (5.3)

Then for Y = (Y1 + . . .+ Yn)/n

ϑ | Y ∼ N

(τσ2/n+ Y r2

r2 + σ2/n,r2σ2/n

r2 + σ2/n

).

Exercise 5.3.2. Prove Theorem 5.3.2 using the technique of the proof of Theorem 5.3.1.

Hint: consider Yi = ϑ+ εi , Y = S/n , and define ζ = ϑ− τ − (1− ρ)(Y − τ) . Check

that ζ and Y are uncorrelated and hence, independent.

The result of Theorem 5.3.2 can formally be derived from Theorem 5.3.1 by replacing

n i.i.d. observations Y1, . . . , Yn with one single observation Y with conditional mean

θ and variance σ2/n .

5.3.2 Linear Gaussian model and Gaussian prior

Now we consider the general case when both Y and ϑ are vectors. Namely we consider

the linear model Y = Ψ>ϑ+ ε with Gaussian errors ε in which the random parameter

159

vector ϑ is multivariate normal as well:

ϑ ∼ N(τ ,R), Y | θ ∼ N(Ψ>θ, Σ). (5.4)

Here Ψ is a given p×n design matrix, and Σ is a given error covariance matrix. Below

we assume that both Σ and R are non-degenerate. The model (5.4) can be represented

in the form

ϑ = τ + ξ, ξ ∼ N(0,R), (5.5)

Y = Ψ>τ + Ψ>ξ + ε, ε ∼ N(0, Σ), ε ⊥ ξ, (5.6)

where ξ ⊥ ε means independence of the error vectors ξ and ε . This representation

makes clear that the vectors ϑ,Y are jointly normal. Now we state the result about the

conditional distribution of ϑ given Y .

Theorem 5.3.3. Assume (5.4). Then the joint distribution of ϑ,Y is normal with

IE =

(ϑ

Y

)=

(τ

Ψ>τ

)Var

(ϑ

Y

)=

( R Ψ>RRΨ Ψ>RΨ +Σ

).

Moreover, the posterior ϑ | Y is also normal. With B = R−1 + ΨΣ−1Ψ> ,

IE(ϑ | Y

)= τ +RΨ

(Ψ>RΨ +Σ

)−1(Y − Ψ>τ )

= B−1R−1τ +B−1ΨΣ−1Y , (5.7)

Var(ϑ | Y

)= B−1. (5.8)

Proof. The following technical lemma explains a very important property of the normal

law: normal conditioned on a normal is again a normal.

Lemma 5.3.4. Let ξ and η be jointly normal. Denote U = Var(ξ) , W = Var(η) ,

C = Cov(ξ,η) = IE(ξ − IEξ)(η − IEη)> . Then the conditional distribution of ξ given

η is also normal with

IE[ξ | η

]= IEξ + CW−1(η − IEη),

Var[ξ | η

]= U − CW−1C>.

Proof. First consider the case when ξ and η are zero-mean. Then the vector

ζdef= ξ − CW−1η

160

is also normal zero mean and fulfills

IE(ζη>) = IE[(ξ − CW−1η)η>

]= IE(ξη>)− CW−1IE(ηη>) = 0,

Var(ζ) = IE[(ξ − CW−1η)(ξ − CW−1η)>

]= U − CW−1C>,

The vectors ζ and η are jointly normal and uncorrelated, thus, independent. This

means that the conditional distribution of ζ given η coincides with the unconditional

one. It remains to note that the ξ = ζ + CW−1η , and conditioned on η , the vector

ξ is just a shift of the normal vector ζ by a fixed vector CW−1η . Therefore, the

conditional distribution of ξ given η is normal with mean CW−1η and the variance

Var(ζ) = U − CW−1C> .

Exercise 5.3.3. Extend the proof of Lemma 5.3.4 to the case when the vectors ξ and

η are not zero mean.

It remains to deduce the desired result about posterior distribution from this lemma.

The formulas for the first two moments of ϑ and Y follow directly from (5.5) and (5.6).

Now we apply Lemma 5.3.4 with U = R , C = RΨ , W = Ψ>RΨ + Σ . It follows that

the vector ϑ conditioned on Y is normal with

IE(ϑ | Y

)= τ +RΨW−1(Y − Ψ>τ ) (5.9)

Var(ϑ | Y

)= R−RΨW−1Ψ>R.

Straightforward calculus imply{R−RΨW−1Ψ>R

}B = Ip with B = R−1 + ΨΣ−1Ψ>

yielding by Σ = W − Ψ>RΨ

RΨW−1 = RΨW−1(W − Ψ>RΨ

)Σ−1

= RΨΣ−1 −RΨW−1Ψ>RΨΣ−1 = B−1ΨΣ−1.

This implies (5.7) by (5.9).

Exercise 5.3.4. Check the details of the proof of Theorem 5.3.3.

Exercise 5.3.5. Derive the result of Theorem 5.3.3 by direct computation of the density

of ϑ given Y .

Hint: use that ϑ and Y are jointly normal vectors. Consider their joint density

p(θ,Y ) for Y fixed and obtain the conditional density by analyzing its the linear and

quadratic terms w.r.t. θ .

Exercise 5.3.6. Show that Var(ϑ | Y ) < Var(ϑ) = R .

Hint: use that Var(ϑ | Y

)= B−1 and B

def= R−1 + ΨΣ−1Ψ> > R−1 .

161

The last exercise delivers an important message: the variance of the posterior is

smaller than the variance of the prior. This is intuitively clear because the posterior

utilizes the both sources of information: those contained in the prior and those we get

from the data Y . However, even in the simple Gaussian case, the proof is quite com-

plicated. Another interpretation of this fact will be given later: the Bayes approach

effectively performs a kind of regularization and thus, leads to a reduce of the variance;

cf. Section 4.7.

Another conclusion from the formulas (5.7), (5.8) is that the moments of the posterior

distribution approach the moments of the MLE θ =(ΨΣ−1Ψ>

)−1ΨΣ−1Y as R grows.

5.3.3 Homogeneous errors, orthogonal design

Consider a linear model Yi = Ψ>i ϑ+ εi for i = 1, . . . , n , where Ψi are given vectors in

IRp and εi are i.i.d. normal N(0, σ2) . This model is a special case of the model (5.4)

with Ψ = (Ψ1, . . . , Ψn) and uncorrelated homogeneous errors ε yielding Σ = σ2In .

Then Σ−1 = σ−2In , B = R−1 + σ−2ΨΨ>

IE(ϑ | Y

)= B−1R−1τ + σ−2B−1ΨY , (5.10)

Var(ϑ | Y

)= B−1,

where ΨΨ> =∑

i ΨiΨ>i . If the prior variance is also homogeneous, that is, R = r2Ip ,

then the formulas can be further simplified. In particular,

Var(ϑ | Y

)=(r−2Ip + σ−2ΨΨ>

)−1.

The most transparent case corresponds to the orthogonal design with ΨΨ> = η2Ip for

some η2 > 0 . Then

IE(ϑ | Y

)=

σ2/r2

η2 + σ2/r2τ +

1

η2 + σ2/r2ΨY , (5.11)

Var(ϑ | Y

)=

σ2

η2 + σ2/r2Ip. (5.12)

Exercise 5.3.7. Derive (5.11) and (5.12) from Theorem 5.3.3 with Σ = σ2In , R = r2Ip ,

and ΨΨ> = Ip .

Exercise 5.3.8. Show that the posterior mean is the convex combination of the MLE

θ = η−2ΨY and the prior mean τ :

IE(ϑ | Y

)= ρτ + (1− ρ)θ,

with ρ = (σ2/r2)/(η2 + σ2/r2) . Moreover, ρ → 0 as η → ∞ , that is, the posterior

mean approaches the MLE θ .

162

5.4 Non-informative priors

The Bayes approach requires to fix a prior distribution on the values of the parameter ϑ .

What happens if no such information is available? Is the Bayes approach still applicable?

An immediate answer is “no”, however it is a bit hasty. Actually one can still apply the

Bayes approach with the priors which do not give any preference to one point against the

others. Such priors are called non-informative. Consider first the case when the set Θ is

finite: Θ = {θ1, . . . ,θM} . Then the non-informative prior is just the uniform measure

on Θ giving to every point θm the equal probability 1/M . Then the joint probability

of Y and ϑ is the average of the measures IPθm and the same holds for the marginal

distribution of the data:

p(y) =1

M

M∑m=1

p(y |θm).

The posterior distribution is already “informative” and it differs from the uniform prior:

p(θk |y) =p(y |θk)π(θk)

p(y)=

p(y |θk)∑Mm=1 p(y |θm)

, k = 1, . . . ,M.

Exercise 5.4.1. Check that the posterior measure is non-informative iff all the measures

IPθm coincide.

A similar situation arises if the set Θ is a non-discrete bounded subset in IRp . A

typical example is given by the case of a univariate parameter restricted to a finite interval

[a, b] . Define π(θ) = 1/π(Θ) , where

π(Θ)def=

∫Θdθ.

Then

p(y) =1

π(Θ)

∫Θp(y |θ)dθ.

p(θ |y) =p(y |θ)π(θ)

p(y)=

p(y |θ)∫Θ p(y |θ)dθ

. (5.13)

In some cases the non-informative uniform prior can be used even for unbounded param-

eter sets. Indeed, what we really need is that the integrals in the denominator of the last

formula are finite: ∫Θp(y |θ)dθ <∞ ∀y.

Then we can apply (5.13) even if Θ is unbounded.

163

Exercise 5.4.2. Consider the Gaussian Shift model (5.2) and (5.3).

(i) Check that for n = 1 , the value∫∞−∞ p(y | θ)dθ is finite for every y and the

posterior distribution of ϑ coincides with the distribution of Y .

(ii) Compute the posterior for n > 1 .

Exercise 5.4.3. Consider the Gaussian regression model Y = Ψ>ϑ+ ε , ε ∼ N(0, Σ) ,

and the non-informative prior π which is the Lebesgue measure on the space IRp . Show

that the posterior for ϑ is normal with mean θ = (ΨΣ−1Ψ>)−1ΨΣ−1Y and variance

(ΨΣ−1Ψ>)−1 . Compare with the result of Theorem 5.3.3.

Note that the result of this exercise can be formally derived from Theorem 5.3.3 by

replacing R−1 with 0.

Another way of tackling the case of an unbounded parameter set is to consider a

sequence of priors that approaches the uniform distribution on the whole parameter set.

In the case of linear Gaussian models and normal priors, a natural way is to let the

prior variance tend to infinity. Consider first the univariate case; see Section 5.3.1. A

non-informative prior can be approximated by the normal distribution with mean zero

and variance r2 tending to infinity. Then

ϑ | Y ∼ N

(Y r2

σ2 + r2,σ2r2

σ2 + r2

)w−→ N(Y, σ2) r →∞.

It is interesting to note that the case of an i.i.d. sample in fact reduces the situation

to the case of a non-informative prior. Indeed, the result of Theorem 5.3.3 implies with

r2n = nr2

ϑ | Y ∼ N

(Y r2n

σ2 + r2n,σ2r2nσ2 + r2n

).

One says that the prior information “washes out” from the posterior distribution as the

sample size n tends to infinity.

5.5 Bayes estimate and posterior mean

Given a loss function ℘(θ,θ′) on Θ × Θ , the Bayes risk of an estimate θ = θ(Y ) is

defined as

Rπ(θ)def= IE℘(θ,ϑ) =

∫Θ

(∫Y

℘(θ(y),θ) p(y |θ)µ0(dy)

)π(θ)λ(dθ).

Note that ϑ in this formula is treated as a random variable that follows the prior

distribution π . One can represent this formula symbolically in the form

Rπ(θ) = IE[IE(℘(θ,ϑ) | θ

)]= IE R(θ,ϑ).

164

Here the external integration averages the pointwise risk R(θ,ϑ) over all possible values

of ϑ due to the prior distribution.

The Bayes formula p(y |θ)π(θ) = p(θ |y)p(y) and change of order of integration can

be used to represent the Bayes risk via the posterior density:

Rπ(θ) =

∫Y

(∫Θ℘(θ(y),ϑ) p(θ |y)λ(dθ)

)p(y)µ0(dy)

= IE[IE{℘(θ,ϑ) | Y

}].

The estimate θπ is called Bayes or π -Bayes if it minimizes the corresponding risk:

θπ = argminθ

Rπ(θ),

where the infimum is taken over the class of all feasible estimates. The most widespread

choice of the loss function is the quadratic one:

℘(θ,θ′)def= ‖θ − θ′‖2.

The great advantage of this choice is that the Bayes solution can be given explicitly: it

is the posterior mean:

θπdef= IE(ϑ | Y ) =

∫Θθ p(θ |Y )λ(dθ).

Note that due to Bayes’ formula, this value can be rewritten

θπ =1

p(Y )

∫Θθ p(Y |θ)π(θ)λ(dθ)

p(Y ) =

∫Θp(Y |θ)π(θ)λ(dθ).

Theorem 5.5.1. It holds for any estimate θ

Rπ(θ) ≥ Rπ(θπ).

Proof. The main feature of the posterior mean is that it provides a kind of projection of

the data. This property can be formalized as follows:

IE(θπ − ϑ | Y

)=

∫Θ

(θπ − θ

)p(θ |Y )λ(dθ) = 0

yielding for any estimate θ = θ(Y )

IE(‖θ − ϑ‖2 | Y

)= IE

(‖θπ − ϑ‖2 | Y

)+ IE

(‖θπ − θ‖2 | Y

)+ 2(θ − θπ)IE

(θπ − ϑ | Y

)= IE

(‖θπ − ϑ‖2 | Y

)+ IE

(‖θπ − θ‖2 | Y

)≥ IE

(‖θπ − ϑ‖2 | Y

).

165

Here we have used that both θ and θπ are functions of Y and can be considered as

constant when taking the conditional expectation w.r.t. Y . Now

Rπ(θ) = IE‖θ − ϑ‖2 = IE[IE(‖θ − ϑ‖2 | Y

)]≥ IE

[IE(‖θπ − ϑ‖2 | Y

)]= Rπ(θπ)


Exercise 5.5.1. Consider the univariate case with the loss function |θ−θ′| . Check that

the posterior median minimizes the Bayes risk.

5.5.1 Posterior mean and ridge regression

Here we again consider the case of a linear Gaussian model

Y = Ψ>ϑ+ ε, ε ∼ N(0, σ2In).

(To simplify the presentation, we focus here on the case of homogeneous errors with

Σ = σ2In .) Recall that the maximum likelihood estimate θ for this model reads as

θ =(ΨΨ>

)−1ΨY .

Regularized MLE

θR =(ΨΨ> +R2

)−1ΨY ,

where R is a regularizing matrix; cf. Section 4.7.1. It turns out that a similar estimate

appears in quite a natural way within the Bayes approach. Consider the normal prior

distribution ϑ ∼ N(0, R2) . The posterior will be normal as well with the posterior

mean::

θπ = σ−2B−1Y =(ΨΨ> + σ2R−2

)−1ΨY ;

see (5.10). It remains to check that θπ = θR for the normal prior π = N(0, σ−2R2) .

One can say that the Bayes approach leads to a regularization of the least squares

method. The degree of regularization is inversely proportional to the variance of the

prior. The larger the variance, the closer the prior is to the non-informative one and the

posterior mean θπ to the MLE θ .

166

Chapter 6

Testing a statistical hypothesis

Let Y be the observed sample. The hypothesis testing problem assumes that there is

some external information (hypothesis) about the distribution of this sample and the

target is to check this hypothesis on the basis of the available data.

6.1 Testing problem

This section specifies the main notions of the theory of hypothesis testing. We start

with a simple hypothesis. Afterwards a composite hypothesis will be discussed. We also

introduce the notions of the testing error, level, power, etc.

6.1.1 Simple hypothesis

The classical testing problem is to check by the available data a specific hypothesis that

the data indeed follow an external precisely known distribution. We illustrate this notion

by several examples.

Example 6.1.1. [Simple game] Let Y = (Y1, . . . , Yn)> be a Bernoulli sequence of zeros

and ones. This sequence can be viewed as the sequence of successes, or results of throwing

a coin, etc. The hypothesis about this sequence is that wins (associated with one) and

losses (associated with zero) are equally frequent in the long run. This hypothesis can

be formalized as follows: IP = IPθ∗ with θ∗ = 1/2 , where IPθ describes the Bernoulli

experiment with parameter θ .

Example 6.1.2. [No effect treatment] Let (Yi, Ψi) be experimental results, i = 1, . . . , n .

The linear regression model assumes certain dependence of the form Yi = Ψ>i θ+ εi with

errors εi having zero mean. The “no effect” hypothesis means that there is no systematic

dependence of Yi on the factors Ψi , i.e. θ = θ∗ = 0 and the observations Yi are just

noise.

167

168

Example 6.1.3. [Quality control] Let Yi be the results of a production process which

can be represented in the form Yi = θ∗ + εi , where θ∗ is a nominal value and εi is

a measurement error. The hypothesis is that the observed process indeed follows this

model.

The general problem of testing a simple hypothesis is stated as follows: to check on

the basis of the available observations Y that their distribution is described by a given

measure IP . The hypothesis is often called a null hypothesis or just null.

6.1.2 Composite hypothesis

More generally, one can speak about the problem of testing a composite hypothesis. Let

(IPθ,θ ∈ Θ ⊂ IRp) be a given parametric family, and let Θ0 ⊆ Θ be a subset in Θ . The

hypothesis is that the data distribution IP belongs to the set (IPθ,θ ∈ Θ0) .

We give some typical examples where such formulation is natural.

Example 6.1.4. [Testing a subvector] Let the vector θ ∈ Θ can be decomposed into

two parts: θ = (γ,η) . The subvector γ is the target of analysis while the subvector η

matters for the distribution of the data but is not the target of analysis. It is often called

the nuisance parameter. The hypothesis we want to test is γ = γ∗ for some fixed value

γ∗ . A typical situation in factor analysis where such problems arise is to check on “no

effect” for one particular factor in the presence of many different factors.

Example 6.1.5. [Interval testing] Let Θ be the real line and Θ0 be an interval. The

hypothesis is that IP = IPθ∗ for θ∗ ∈ Θ0 . Such problems are typical for quality control or

warning (monitoring) systems when the controlled parameter should be in the prescribed

range.

Example 6.1.6. [Testing a hypothesis about error distribution] Consider the regression

model Yi = Ψ>i θ + εi . The typical assumption about the errors εi is that they are

zero-mean normal. One can test this assumption having in mind the cases with discrete,

or heavy-tailed, or heteroscedastic errors.

6.1.3 A test

A test is a statistical decision on the basis of the available data whether the hypothesis is

accepted or rejected. So the decision space consists of only two points, which we denote

by zero and one. A decision φ is a mapping of the data Y to this space and is called a

test :

φ : Y→ {0, 1}.

169

The event φ = 1 means that the hypothesis is rejected and the opposite event means

the acceptance of the null. Usually the testing results are qualified in the following way:

rejection of the hypothesis means that the data are not consistent with the null, or,

equivalently, the data contain some evidence against the null hypothesis. Acceptance

simply means that the data do not contradict the null.

The region of acceptance is a subset of the observation space Y on which φ = 0 .

One also says that this region is the set of values for which we fail to reject the null

hypothesis. The region of rejection or critical region is on the other hand the subset of

Y on which φ = 1 .

6.1.4 Errors of the first kind, test level

In the hypothesis testing framework one distinguishes between error of the first and

second kind. The error of the first kind means that the hypothesis is wrongly rejected

when it was correct. We formalize this notion first for the case of a simple hypothesis

and then extend it to the general case.

Let H0 : Y ∼ IPθ∗ be a null hypothesis. The error of the first kind is the situation

when the data indeed follow the null, but the decision of the test is to reject this hypoth-

esis: φ = 1 . Clearly the probability of such an error is IPθ∗(φ = 1) . One says that φ is

a test of level α for some α ∈ (0, 1) if

IPθ∗(φ = 1) = α.

The value α is called level (size) of the test or significance level.

If the hypothesis is composite, then the level of the test is the maximum rejection

probability over the null subset. A test φ is of level α if

supθ∈Θ0

IPθ(φ = 1) ≤ α.

6.1.5 A randomized test

In some situations it is difficult to decide about acceptance or rejection of the hypothesis.

A randomized test can be viewed as a weighted decision: with a certain probability the

hypothesis is rejected, otherwise accepted. The decision space for a randomized test φ

is an interval [0, 1] , that is, φ(Y ) is a number between zero and one. The hypothesis

H0 is rejected with probability φ(Y ) on the basis of the observed data Y . If φ(Y )

only admits the binary values 0 and 1 for every Y , then we come back to the usual

non-randomized test. The probability of the first-kind error is naturally given by the

170

value IEφ(Y ) . For a simple hypothesis H0 : IP = IPθ∗ , a test φ is of level α if

IEφ(Y ) = α.

In the case of a composite hypothesis H0 : IP ∈ (IPθ,θ ∈ Θ0) , the level condition reads

as

supθ∈Θ0

IEφ(Y ) ≤ α.

In what follows we mostly consider non-randomized tests and only comment on whether

a randomization can be useful. Note that any randomized test can be reduced to a

non-randomized test by extending the probability space.

Exercise 6.1.1. Construct for any randomized test φ its non-randomized version using

a random data generator.

6.1.6 An alternative, error of the second kind, power of the test

The set-up of hypothesis testing focuses on the null hypothesis. However, for a complete

analysis, one has to specify the data distribution when the hypothesis is wrong. Within

the parametric framework, one usually makes the assumption that the unknown data

distribution belongs to some parametric family (IPθ,θ ∈ Θ ⊆ IRp) . This assumption has

to be fulfilled independently of whether the hypothesis is true or false. In other words,

we assume that IP ∈ (IPθ,θ ∈ Θ) and there is a subset Θ0 ⊂ Θ corresponding to the

null hypothesis. The measure IP = IPθ for θ 6∈ Θ0 is called an alternative.

Now we can consider the performance of a test φ when the hypothesis H0 is wrong.

The decision to accept the hypothesis when it is wrong is called the error of the second

kind. The probability of such error is equal to IP (φ = 0) . This value certainly depends

on the alternative IP = IPθ for θ 6∈ Θ0 . The value β(θ) = 1− IPθ(φ = 0) is often called

the test power at θ 6∈ Θ0 . The function β(θ) of θ ∈ Θ \Θ0 given by

β(θ)def= 1− IPθ(φ = 0)

is called a power function. Ideally one would desire to build a test which simultaneously

and separately minimizes the level and maximizes the power. These two wishes are

somehow contradictory. A decrease of the level usually results in a decrease of the power

and vice versa. Usually one imposes the level α constraint on the test and tries to

optimize its power.

171

Definition 6.1.1. A test φ∗ is called uniformly most power (UMP) of level α if it is of

level α and for any other test of level α , it holds

1− IPθ(φ∗ = 0) ≥ 1− IPθ(φ = 0), θ 6∈ Θ0.

Unfortunately, such UMP tests exist only in very few special models; otherwise,

optimization of the power given the level is a complicated task.

In the case of a univariate parameter θ ∈ Θ ⊂ IR1 and a simple hypothesis θ = θ∗ ,

one often considers one-sided alternatives

H1 : θ ≥ θ∗ or H1 : θ ≤ θ∗

or a two-sided alternative

H1 : θ 6= θ∗

6.2 Neyman-Pearson test for two simple hypotheses

This section discusses one very special case of hypothesis testing when both the hypothesis

and alternative are simple one-point sets. This special situation by itself can be viewed

as a toy problem, but it is very important from the methodological point of view. In

particular, it introduces and justifies the so-called likelihood ratio test and demonstrates

its efficiency.

For simplicity we write IP0 for the null hypothesis and IP1 for the alternative measure.

A test φ is a measurable function of the observations with values in the two-point set

{0, 1} . The event φ = 0 is treated as acceptance of the null hypothesis H0 while φ = 1

means rejection of the null hypothesis against H1 .

For ease of presentation we assume that the measure IP1 is absolutely continuous

w.r.t. the measure IP0 and denote by Z(Y ) the corresponding derivative at the obser-

vation point:

Z(Y )def=

dIP1

dIP0(Y ).

Similarly L(Y ) means the log-density:

L(Y )def= logZ(Y ) = log

dIP1

dIP0(Y ).

The solution of the test problem in the case of two simple hypotheses is known as the

Neyman-Pearson test: reject the hypothesis H0 if the log-likelihood ratio L(Y ) exceeds

a specific critical value t :

φ∗tdef= 1

(Z(Y ) > t

).

172

The Neyman-Pearson test is known as the one minimizing the weighted sum of the errors

of the first and second kind. For a non-randomized test this sum is equal to

℘0IP0(φ = 1) + ℘1IP1(φ = 0),

while the weighted error of a randomized test φ is

℘0IE0φ+ ℘1IE1(1− φ). (6.1)

Theorem 6.2.1. For every two positive values ℘0 and ℘1 , the test φ∗t with t = ℘0/℘1

minimizes (6.1) over all possible (randomized) tests φ :

φ∗tdef= 1(Z(Y ) ≥ t) = argmin

φ

{℘0IE0φ+ ℘1IE1(1− φ)

}.

Proof. We use the formula for a change of measure:

IE1ξ = IE0

[ξZ(Y )

]for any r.v. ξ . It holds for any test φ with t = ℘0/℘1

℘0IE0φ+ ℘1IE1(1− φ) = IE0

[℘0φ− ℘1Z(Y )φ

]+ ℘1

= −℘1IE0[Z(Y )− t]φ+ ℘1

≥ −℘1IE0[Z(Y )− t]+ + ℘1

with the equality for φ = 1(Z(Y ) ≥ t) .

The Neyman-Pearson test belongs to a large class of tests of the form

φ = 1(T ≥ t),

where T is a function of the observations Y . This random variable is usually called a

test statistic while the threshold t is called a critical value. The hypothesis is rejected

if the test statistic exceeds the critical value. For the Neyman-Pearson test, the test

statistic is the likelihood ratio Z(Y ) and the critical value is selected as its quantile.

The next result shows that the Neyman-Pearson test φ∗t with a proper critical value

t can be constructed to maximize the power IE1φ under the level constraint IE0φ ≤ α .

Theorem 6.2.2. Given α ∈ (0, 1) , let tα be such that

IP0(Z(Y ) ≥ tα) = α. (6.2)

Then it holds

φ∗tαdef= 1

(Z(Y ) ≥ tα

)= argmax

φ:IE0φ≤α

{IE1φ

}.

173

Proof. Let φ satisfy IE0φ ≤ α . Then

IE1φ− αtα ≤ IE0

{Z(Y )φ

}− tαIE0φ

= IE0

{(Z(Y )− tα)φ

}≤ IE0[Z(Y )− tα]+

with the equality for φ = 1(Z(Y ) ≥ tα) .

The previous result assumes that for a given α there is a critical value tα such that

(6.2) is fulfilled. However, this is not always the case.

Exercise 6.2.1. Let Z(Y ) = dIP1(Y )/dIP0 .

• Show that the relation (6.2) can always be fulfilled with a proper choice of tα if the

pdf of Z(Y ) under IP0 is a continuous function.

• Suppose that the pdf of Z(Y ) is discontinuous and tα fulfills

IP0(Z(Y ) ≥ tα) > α, IP0(Z(Y ) < tα) < 1− α.

Construct a randomized test that fulfills IE0φ = α and maximizes the test power IE1(1−φ) among all such tests.

The Neyman-Pearson test can be viewed as a special case of the general likelihood

ratio test. Indeed, it decides in favor of the null or the alternative by looking at the

likelihood ratio. Informally one can say: we select the null if it is more likely at the point

of observation Y .

An interesting question that arises in relation with the Neyman-Pearson result is how

to interpret it when the true distribution IP does not coincide either with IP0 or with

IP1 and probably it is not even within the considered parametric family (IPθ) . Wald

called this situation the third-kind error. It is worth mentioning that the test φ∗t remains

meaningful: it decides which of two given measures IP0 and IP1 better describes the given

data. However, it is not any more a likelihood ratio test. In analogy with estimation

theory, one can call it a quasi likelihood ratio test.

6.2.1 Neyman-Pearson test for an i.i.d. sample

Let Y = (Y1, . . . , Yn)> be an i.i.d. sample from a measure P . Suppose that P belongs

to some parametric family (Pθ,θ ∈ Θ ⊂ IRp) , that is, P = Pθ∗ for θ∗ ∈ Θ . Let also a

special point θ0 (a null) be fixed. The null hypothesis can be formulated as θ∗ = θ0 .

Similarly, a simple alternative is θ∗ = θ1 for some other point θ1 ∈ Θ . The Neyman-

Pearson test situation is a bit artificial: one reduces the whole parameter set Θ to just

these two points θ0 and θ1 and tests θ0 against θ1 .

174

As usual, the distribution of the data Y is described by the product measure IPθ =

P⊗nθ . If µ0 is a dominating measure for (Pθ) and `(y,θ)def= log[dPθ(y)/dµ0] , then the

log-likelihood L(Y ,θ) is

L(Y ,θ)def= log

dIPθµ0

(Y ) =∑i

`(Yi,θ),

where µ0 = µ⊗n0 . The log-likelihood ratio of IPθ1 w.r.t. IPθ0 can be defined as

L(Y ,θ1,θ0)def= L(Y ,θ1)− L(Y ,θ0),

The related Neyman-Pearson test can be written as

φ∗tdef= 1

(L(Y ,θ1,θ0) > z

)with z = log t .

6.3 Likelihood ratio test

This section introduces a general likelihood ratio test in the framework of parametric

testing theory. Let, as usual, Y be the observed data, and IP be their distribution. The

parametric assumption is that IP ∈ (IPθ,θ ∈ Θ) , that is, IP = IPθ∗ for θ∗ ∈ Θ . Let

now two subsets Θ0 and Θ1 of the set Θ be given. The hypothesis H0 that we would

like to test is that IP ∈ (IPθ,θ ∈ Θ0) , or equivalently, θ∗ ∈ Θ0 . The alternative is that

θ∗ ∈ Θ1 .

The general likelihood approach leads to comparing the likelihood values L(Y ,θ) on

the hypothesis and alternative sets. Namely, the hypothesis is rejected if there is one

alternative point θ1 ∈ Θ1 such that the value L(Y ,θ) exceeds all similar values for

θ ∈ Θ0 . In other words, observing the sample Y under alternative IPθ1 is more likely

than under any measure IPθ from the null. Formally this relation can be written as:

supθ∈Θ0

L(Y ,θ) < supθ∈Θ1

L(Y ,θ).

In particular, a simple hypothesis means that the set Θ0 consists of one point θ0 and

this relation becomes of the form

L(Y ,θ0) < supθ∈Θ1

L(Y ,θ).

In general, the likelihood ratio (LR) test corresponds to the test statistic

Tdef= sup

θ∈Θ1

L(Y ,θ)− supθ∈Θ0

L(Y ,θ). (6.3)

175

The hypothesis is rejected if this test statistic exceeds some critical value z . Usually this

critical value is selected to ensure the level condition:

IP(T > zα

)≤ α

for a given level α .

We have already seen that the LR test is optimal in testing of two simple hypothesis.

Later we show that this optimality property can be extended to some more general

situations. Now we consider further examples of a LR test.

6.3.1 Gaussian shift model

For all examples considered in this section, we assume that the data Y in form of an

i.i.d. sample (Y1, . . . , Yn)> follow the model Yi = θ∗ + εi with εi ∼ N(0, σ2) for σ2

known. Equivalently Yi ∼ N(θ∗, σ2) . The log-likelihood L(Y , θ) (which we also denote

by L(θ) ) reads as

L(θ) = −n2

log(2πσ2)− 1

2σ2(Yi − θ)2 (6.4)

and the log-likelihood ratio L(θ, θ0) = L(θ)− L(θ0) is given by

L(θ, θ0) = σ−2[(S − nθ0)(θ − θ0)− n(θ − θ0)2/2

](6.5)

with Sdef= Y1 + . . . + Yn . Moreover, under the measure IPθ0 , the variable S − nθ0 is

normal zero-mean with the variance nσ2 . This particularly implies that (S−nθ0)/√nσ2

is standard normal under IPθ0 :

L

(1

σ√n

(S − nθ0) | IPθ0)

= N(0, 1).

We start with the simplest case of a simple null and simple alternative.

Simple null and simple alternative Let the null H0 : θ∗ = θ0 be tested against the

alternative H1 : θ∗ = θ1 for some fixed θ1 6= θ0 . The log-likelihood L(θ1, θ0) is given

by (6.5) leading to the test statistic

T = σ−2[(S − nθ0)(θ1 − θ0)− n(θ1 − θ0)2/2

].

The proper critical value z can be selected from the condition of α -level: IPθ0(T > zα) =

α . We use that the sum S−nθ0 is under the null normal zero-mean with variance nσ2 .

With ξ = (S − nθ0)/√nσ2 ∼ N(0, 1) , the level condition can be rewritten as

IP

(ξ >

1

|θ1 − θ0|σ√n

[σ2zα + n(θ1 − θ0)2/2

])= α.

176

As ξ is standard normal, the proper zα can be computed as a quantile of the standard

normal law: if zα is defined by IP (ξ > zα) = α , then

1

|θ1 − θ0|σ√n

[σ2zα + n|θ1 − θ0|2/2

]= zα

or

zα = σ−2[zα|θ1 − θ0|σ

√n− n|θ1 − θ0|2/2

].

It is worth noting that this value actually does not depend on θ0 . It only depends on the

difference |θ1 − θ0| between the null and the alternative. This is a very important and

useful property of the normal family and it is called pivotality. Another way of selecting

the critical value z is given by minimizing the sum of the first and second-kind error

probabilities. Theorem 6.2.1 leads to the choice z = 0 , or equivalently, to the test

φ = 1(S/n > θ1/2) = 1(θ > θ1/2).

This test is also called the Fisher discrimination. It naturally appears in classification

problems.

Two-sided test Now we consider a more general situation when the simple null θ∗ =

θ0 is tested against the alternative θ∗ 6= θ0 . Then the LR test compares the likelihood at

θ0 with the maximum likelihood over Θ\{θ0} which clearly coincides with the maximum

over the whole parameter set. This leads to the test statistic:

T = maxθL(θ, θ0) =

n

2σ2|θ − θ0|2.

(see Section 2.9), where θ = S/n is the MLE. Now for a critical value z , the LR test

rejects the null if T ≥ z . The value z can be selected from the level condition:

IPθ0(T > z

)= IPθ0

(nσ−2|θ − θ0|2 > 2z

)= α.

Now we use that nσ−2|θ− θ0|2 is χ21 -distributed. If zα is defined by IP (ξ2 ≥ 2zα) = α

for standard normal ξ , then the test φ = 1(T > zα) is of level α . Again, this value

does not depend on the null point θ0 , and the LR test is pivotal.

Exercise 6.3.1. Compute the power function of the test φ = 1(T > zα) .

6.3.2 One-sided test

Now we consider the problem of testing the null θ∗ = θ0 against the one-sided alternative

H1 : θ > θ0 . To apply the LR test we have to compute the maximum of the log-likelihood

ratio L(θ, θ0) over the set Θ1 = {θ > θ0} .

177

Exercise 6.3.2. Check that

supθ>θ0

L(θ, θ0) =

nσ−2|θ − θ0|2/2 if θ ≥ θ0,

0 otherwise.

Hint: if θ ≥ θ0 , then the maximum over Θ0 coincides with the global maximum,

otherwise it is attained at the edge θ0 .

Now the LR test rejects the null if θ > θ0 and nσ−2|θ− θ0|2 > 2z for a CV z . That

is,

φ = 1(θ − θ0 > σ

√2z/n

).

The CV z can be again chosen by the level condition. As ξ =√nσ−2(θ−θ0) is standard

normal under IPθ0 , one has to select z to ensure IP (ξ >√

2z) = α .

6.3.3 Testing the mean when the variance is unknown

This section discusses the two-sided testing problem H0 : θ∗ = θ0 against H1 : θ1 6= θ0

for the Gaussian shift model Yi = θ∗+σ∗εi with standard normal errors εi and unknown

variance σ∗2 . Here the null hypothesis is composite because it involves the unknown

variance σ∗2 .

The log-likelihood function is still given by (6.4) but now σ∗2 is a part of the pa-

rameter vector. Maximizing the log-likelihood L(θ, σ2) under the null leads to the value

L(θ0, σ20) with

σ20def= argmax

σ2

L(θ0, σ2) = n−1

∑i

(Yi − θ0)2.

As in Section 2.9.2 for the problem of variance estimation, it holds for any σ

L(θ0, σ20)− L(θ0, σ

2) = nK(σ20, σ2).

At the same time, maximizing L(θ, σ2) over the alternative is equivalent to the global

maximization leading to the value L(θ, σ2) with

θ = S/n, σ2 =1

n

∑i

(Yi − θ)2.

The LR test statistic reads as

T = L(θ, σ2)− L(θ0, σ20).

178

This expression can be decomposed in the following way:

T = L(θ, σ2)− L(θ0, σ2) + L(θ0, σ

2)− L(θ0, σ20) =

1

2σ2(θ − θ0)2 − nK(σ20, σ

2).

Often one considers another test in which the variance is only estimated under the alter-

native, that is, σ is used in place of σ0 . This is quite natural because the null can be

viewed as a particular case of the alternative. This leads to the test statistic

T ∗ = L(θ, σ2)− L(θ0, σ2) =

1

2σ2(θ − θ0)2.

An advantage of this expression is that its distribution under the measure IPθ0,σ2 does

not depend on θ0 or on σ2 . It is known as Fisher distribution and will be discussed in

Chapter 7.

6.3.4 LR-tests. Examples

(to be insereted)

6.4 Testing problem for a univariate exponential family

Let (Pθ, θ ∈ Θ ⊆ IR1) be a univariate exponential family. The choice of parametriza-

tion is unimportant, any of parametrization can be taken. To be specific, we assume

the natural parametrization that simplifies the expression for the maximum likelihood

estimate.

We assume that the two function of θ are fixed: C(θ) and B(θ) , with which the

log-density of Pθ can be written in the form:

`(y, θ)def= log p(y, θ) = yC(θ)−B(θ)− `(y)

for some other function `(y) . The function C(θ) is monotonic in θ and C(θ) and B(θ)

are related (for the case of an EFn) by the identity B′(θ) = θC ′(θ) ; see Section 2.11.

Let now Y = (Y1, . . . , Yn) be an i.i.d. sample from Pθ∗ for θ∗ ∈ Θ . The task is to

test a simple hypothesis θ∗ = θ0 against an alternative θ∗ ∈ Θ1 for some subset Θ1

that does not contain θ0 .

6.4.1 Two-sided alternative

We start with the case of a simple hypothesis H0 : θ∗ = θ0 against a full two-sided alter-

native H1 : θ∗ 6= θ0 . The likelihood ratio approach suggests to compare the likelihood at

θ0 with the maximum of the likelihood over the alternative, that effectively means the

179

maximum over the whole parameter set. In the case of a univariate exponential family,

this maximum is computed in Section 2.11. For

L(θ, θ0)def= L(θ)− L(θ0) = S

[C(θ)− C(θ0)

]− n

[B(θ)−B(θ0)

]with S = Y1 + . . .+ Yn , it holds

Tdef= sup

θL(θ, θ0) = nK(θ, θ0),

where K(θ, θ′) = Eθ`(θ, θ′) is the Kullback-Leibler divergence between the measures Pθ

and Pθ′ . For an EFn, the MLE θ is the empirical mean of the observations Yi , θ = S/n ,

and the KL divergence K(θ, θ0) is of the form

K(θ, θ0) = θ[C(θ)− C(θ0)

]−[B(θ)−B(θ0)

].

Therefore, the test statistic T is a function of the empirical mean θ = S/n :

T = nK(θ, θ0) = nθ[C(θ)− C(θ0)

]− n

[B(θ)−B(θ0)

]. (6.6)

The LR test rejects H0 if the test statistic T exceeds a critical value z . Given α ∈ (0, 1) ,

a proper CV zα can be specified by the level condition

IPθ0(T > zα) = α.

In view of (6.6), the LR test rejects the null if the “distance” K(θ, θ0) between the

estimate θ and the null θ0 is significantly larger than zero. In the case of an exponential

family, one can simplify the test just by considering the estimate θ as test statistic. We

use the following technical result for the KL divergence K(θ, θ0) :

Lemma 6.4.1. Let (Pθ) be an EFn. Then for every z there are two positive values

t−(z) and t+(z) such that

{θ : K(θ, θ0) ≤ z} = {θ : θ0 − t−(z) ≤ θ ≤ θ0 + t+(z)}. (6.7)

In other words, the conditions K(θ, θ0) ≤ z and θ0 − t−(z) ≤ θ ≤ θ0 + t+(z) are

equivalent.

Proof. The function K(θ, θ0) of the first argument θ fulfills

∂K(θ, θ0)

∂θ= C(θ)− C(θ0),

∂2K(θ, θ0)

∂θ2= C ′(θ) > 0.

Therefore, it is convex in θ with minimum at θ0 , and it can cross the level z only once

from the left of θ0 and once from the right. This yields that for any z > 0 , there are

180

two positive values t−(z) and t+(z) such that (6.7) holds. Note that one or even both

of these values can be infinite.

Due to the result of this lemma, the LR test can be rewritten as

φ = 1− 1(−t−(z) ≤ θ − θ0 ≤ t+(z)

)= 1

(θ > θ0 + t+(z)

)+ 1(θ < θ0 − t−(z)

),

that is, the test rejects the null if the estimate θ deviates significantly from θ0 .

6.4.2 One-sided alternative

Now we consider the problem of testing the same null H0 : θ∗ = θ0 against the one-sided

alternative H1 : θ∗ > θ0 . Of course, the other one-sided alternative H1 : θ∗ < θ0 can be

considered as well.

The LR test requires computing the maximum of the log-likelihood over the alterna-

tive set {θ : θ > θ0} . This can be done as in the Gaussian shift model. If θ > θ0 then

this maximum coincides with the global maximum over all θ . Otherwise, it is attained

at θ = θ0 .

Lemma 6.4.2. Let (Pθ) be an EFn. Then

supθ>θ0

L(θ, θ0) =

nK(θ, θ0) if θ ≥ θ0,

0 otherwise.

Proof. It is only necessary to consider the case θ < θ0 . The difference L(θ)− L(θ) can

be represented as nK(θ, θ) . Next, one more use of the identity B′(θ) = θC ′(θ) yields

∂K(θ, θ)

∂θ= (θ − θ)C ′(θ) < 0

for any θ > θ0 . This means that L(θ)−L(θ) decreases as θ grows beyond θ0 , attaining

its maximum at θ = θ0 .

This fact implies the following representation of the LR test in the case of a one-sided

alternative.

Theorem 6.4.3. Let (Pθ) be an EFn. Then the α -level LR test for the null H0 : θ∗ =

θ0 against the one-sided alternative H1 : θ∗ > θ0 is

φ = 1(θ > θ0 + tα), (6.8)

where tα is selected to ensure IPθ0(θ > θ0 + tα

)= α .

181

Proof. Let T be the LR test statistic. Due to Lemma 6.4.2, the inequality T ≥ t can

be rewritten as θ > θ0 + t(z) for some t(z) . It remains to select a proper value t(z) to

ensure the level condition.

This result can be extended naturally to the case of a composite null hypothesis

H0 : θ∗ ≤ θ0 .

Theorem 6.4.4. Let (Pθ) be an EFn. Then the α -level LR test for the composite null

H0 : θ∗ ≤ θ0 against the one-sided alternative H1 : θ∗ > θ0 is

φ∗α = 1(θ > θ0 + tα), (6.9)

where tα is selected to ensure IPθ0(θ > θ0 + tα

)= α .

Proof. The same arguments as in the proof of Theorem 6.4.3 lead to exactly the same

LR test statistic T and thus to the test of the form (6.8). In particular, the estimate θ

should significantly deviate from the null set. It remains to check that the level condition

for the edge point θ0 ensures the level for all θ < θ0 . This follows from the next

monotonicity property.

Lemma 6.4.5. Let (Pθ) be an EFn. Then for any t ≥ 0

IPθ(θ > θ0 + t) ≤ IPθ0(θ > θ0 + t), ∀θ < θ0 .

Proof. Let θ < θ0 . We apply

IPθ(θ > θ0 + t) = IEθ0 exp{L(θ, θ0)

}1(θ > θ0 + t).

Now the monotonicity of L(θ, θ) w.r.t. the second argument (see Lemma 6.4.2), implies

L(θ, θ0) < 0 on the set {θ < θ0 < θ} . This yields the result.

Therefore, if the level is controlled under IPθ0 , it is well checked for all other points

in the null set.

A very nice feature of the LR test is that it can be universally represented in terms

of θ independently of the form of the alternative set. In particular, for the case of a

one-sided alternative, this test just compares the estimate θ with the value θ0 + tα .

Moreover, the value tα only depends on the distribution of θ under IPθ0 via the level

condition. This and monotonicity of the error probability from Lemma 6.4.5 allow us to

state the nice optimality property of this test: φ∗α is uniformly most power in the sense

of Definition 6.1.1, that is, it maximizes the test power under the level constraint.

182

Theorem 6.4.6. Let (Pθ) be an EFn, and let φ∗α be the test from (6.9) for testing

H0 : θ∗ ≤ θ0 against H1 : θ∗ > θ0 . For any (randomized) test φ satisfying IEθ0 ≤ α

and any θ > θ0 , it holds

IEθφ ≤ IPθ(φ∗α = 1).

In fact, this theorem repeats the Neyman-Pearson result of Theorem 6.2.2 because

the test φ∗α is at the same time the LR α -level test of the simple hypothesis θ∗ = θ0

against θ∗ = θ .

6.4.3 Interval hypothesis

In some applications, the null hypothesis is naturally formulated in the form that the

parameter θ∗ belongs to a given interval [θ0, θ1] . The alternative H1 : θ∗ ∈ Θ \ [θ0, θ1]

is the complement of this interval. The likelihood ratio test is based on the test statistic

T from (6.3) which compares the maximum of the log-likelihood L(θ) under the null

[θ0, θ1] with the maximum over the alternative set. The special structure of the log-

likelihood in the case of an EFn permits representing this test statistics in terms of the

estimate θ : the hypothesis is rejected if the estimate θ significantly deviates from the

interval [θ0, θ1] .

Theorem 6.4.7. Let (Pθ) be an EFn. Then the α -level LR test for the null H0 : θ ∈[θ0, θ1] against the alternative H1 : θ 6∈ [θ0, θ1] can be written as

φ = 1(θ > θ1 + t+α ) + 1(θ < θ0 − t−α ), (6.10)

where t+α and t−α are selected to ensure IPθ0−t−α (φ = 1) = α/2 and IPθ0+t+α (φ = 1) =

α/2 . ???


Hint: Consider three cases: θ ∈ [θ0, θ1] , θ > θ1 , and θ < θ0 . For every case, apply

the monotonicity of L(θ, θ) in θ .

One can consider the alternative of the interval hypothesis as a combination of two

one-sided alternatives. The LR test φ from (6.10) involves only one critical value z and

the parameters t−α and t+α are related via the structure of this test: they are obtained

by transforming the inequality T > z into θ > θ1 + t+θ and θ < θ0 − t−α . However, one

can just apply two one-sided tests independently: for the alternative H−1 : θ∗ < θ0 and

H+1 : θ∗ > θ1 . This leads to two separate tests:

φ−def= 1

(θ < θ0 − t−

), φ+

def= 1

(θ > θ1 + t+

).

183

The values t−, t+ can be chosen by the so-called Bonferroni rule: just perform each of

the two tests at level α/2 .

Exercise 6.4.2. Let the values t−, t+ be selected to ensure

IPθ0(θ < θ0 − t−

)= α/2, IPθ1

(θ > θ1 + t+

)= α/2

Then for any θ ∈ [θ0, θ1] , the test φ fulfills

IPθ(φ = 1) ≤ α.

Hint: use the monotonicity from Lemma 6.4.5.

184

Chapter 7

Testing in linear models

This chapter discusses the testing problem for linear Gaussian models given by the equa-

tion

Y = f + ε (7.1)

with the vector of observations Y , response vector f , and vector of errors ε in IRn .

The linear parametric assumption (linear PA) means that

Y = Ψ>θ + ε (7.2)

where Ψ is the p×n design matrix. By θ we denote the p -dimensional target parameter

vector, θ ∈ Θ ⊆ IRp . Usually we assume that the parameter set coincides with the

whole space IRp , i.e. Θ = IRp . The most general assumption about the vector of errors

ε = (ε1, . . . , εn)> is Var(ε) = Σ , which permits for inhomogeneous and correlated

errors. However, for most results we assume i.i.d. errors εi ∼ N(0, σ2) . The variance

σ2 could be unknown as well. As in previous chapters, θ∗ denotes the true value of the

parameter vector (assumed that the model (7.2) is correct).

7.1 Likelihood ratio test for a simple null

This section discusses the problem of testing a simple hypothesis H0 : θ∗ = θ0 for a

given vector θ0 . A natural “non-informative” alternative is H1 : θ∗ 6= θ0 .

7.1.1 General errors

We start from the case of general errors with known covariance matrix Σ . The results

obtained for the estimation problem in Chapter 4 will be heavily used in our study. In

185

186

particular, the MLE θ of θ∗ is

θ =(ΨΣ−1Ψ>

)−1ΨΣ−1Y

and the corresponding maximum likelihood is

L(θ,θ0) =1

2

(θ − θ0

)>B(θ − θ0

)with a p× p - matrix B given by

B = ΨΣ−1Ψ>.

This immediately leads to the following representation for the likelihood ratio (LR) test

in this set-up:

Tdef= sup

θL(θ,θ0) =

1

2

(θ − θ0

)>B(θ − θ0

)(7.3)

Moreover, Wilks’ phenomenon claims that under IPθ0 , the test statistic T has a fixed

distribution: namely 2T is χ2p -distributed (chi-squared with p degrees of freedom).

Theorem 7.1.1. Consider the model (7.2) with ε ∼ N(0, Σ) for a known matrix Σ .

Then the LR test statistic T is given by (7.3). Moreover, if zα fulfills IP (ζp > 2zα) = α

with ζp ∼ χ2p , then the LR test φ with

φdef= 1

(T > zα

)(7.4)

is of exact level α :

IPθ0(φ = 1) = α.

This result follows directly from Theorem 4.6.1. We see again the important pivotal

property of the test: the critical value zα only depends on the dimension of the parameter

space Θ . It does not depend on the design matrix Ψ , error covariance Σ , and the null

value θ0 .

7.1.2 I.i.d. errors, known variance

We now specify this result for the case of i.i.d. errors. We also focus on the residuals

εdef= Y − Ψ>θ = Y − f ,

where f = Ψ>θ is the estimated response of the true regression function f = Ψ>θ∗ .

We start with some geometric properties of the residuals ε and the test statistic T

from (7.3).

187

Theorem 7.1.2. Consider the model (7.1). Let T be the LR test statistic built under

the assumptions f = Ψ>θ∗ and Var(ε) = σ2In with a known value σ2 . Then T is

given by

T =1

2σ2∥∥Ψ>(θ − θ0)

∥∥2 =1

2σ2∥∥f − f0

∥∥2. (7.5)

Moreover, the following decompositions for the vector of observations Y and for the

errors ε = Y − f0 hold:

Y − f0 = (f − f0) + ε, (7.6)

‖Y − f0‖2 = ‖f − f0‖2 + ‖ε‖2, (7.7)

where f − f0 is the estimation error and ε = Y − f is the vector of residuals.

Proof. The key step of the proof is the representation of the estimated response f under

the model assumption Y = f+ε as a projection of the data on the p -dimensional linear

subspace L in IRn spanned by the rows of the matrix Ψ :

f = ΠY = Π(f + ε)

where Π = Ψ>(ΨΨ>

)−1Ψ is the projector onto L ; see Section 4.3. Note that this

decomposition is valid for the general linear model; the parametric form of the response

f and the noise normality is not required. The identity Ψ>(θ − θ0) = f − f0 follows

directly from the definition implying the representation (7.5) for the test statistic T . The

identity (7.6) follows from the definition. Next, Πf0 = f0 and thus f−f0 = Π(Y −f0) .

Similarly,

ε = Y − f = (In −Π)Y .

As Π and In −Π are orthogonal projectors, it follows

‖Y − f0‖2 = ‖(In −Π)Y +Π(Y − f0)‖2 = ‖(In −Π)Y ‖2 + ‖Π(Y − f0)‖2

and the decomposition (7.7) follows.

The decomposition (7.6), although straightforward, is very important for understand-

ing the structure of the residuals under the null and under the alternative. Under the

null H0 , the response f is assumed to be known and coincides with f0 , so the residuals

ε coincide with the errors ε . The sum of squared residuals is usually abbreviated as

RSS:

RSS0def= ‖Y − f0‖2

188

Under the alternative, the response is unknown and is estimated by f . The residuals

are ε = Y − f resulting in the RSS

RSSdef= ‖Y − f‖2.

The decomposition (7.7) can be rewritten as

RSS0 = RSS +‖f − f0‖2. (7.8)

We see that the RSS under the null and the alternative can be essentially different only

if the estimate f significantly deviates from the null assumption f = f0 . The test

statistic T from (7.3) can be written as

T =RSS0−RSS

2σ2.

Now we show that if the model assumptions are correct, the test T has the exact

level α and is pivotal.

Theorem 7.1.3. Consider the model (7.1) with ε ∼ N(0, σ2In) for a known value σ2 ,

i.e. εi are i.i.d. normal. The LR test φ from (7.4) is of exact level α . Moreover,

f − f0 and ε are under IPθ0 zero-mean independent Gaussian vectors satisfying

2T = σ−2‖f − f0‖2 ∼ χ2p , σ−2‖ε‖2 ∼ χ2

n−p . (7.9)

Proof. The null assumption f = f0 together with Πf0 = f0 implies now the following

decomposition:

f − f0 = Πε, ε = ε−Πε = (In −Π)ε.

Next, Π and In −Π are orthogonal projectors implying orthogonal and thus uncorre-

lated vectors Πε and (In −Π)ε . Under normality of ε , these vectors are also normal,

and uncorrelation implies independence. The property (7.9) for the distribution of Πε

was proved in Theorem 4.6.1. For ε = (In −Π)ε , the proof is similar.

Next we discuss the power of the LR test φ defined as the probability of detecting

the alternative when the response f deviates from the null f0 . In the next result we

do not assume that the true response f follows the linear PA f = Ψ>θ and show that

the test power depends on the value ‖Π(f − f0)‖2 .

Theorem 7.1.4. Consider the model (7.1) with Var(ε) = σ2In for a known value σ2 .

Define

∆ = σ−2‖Π(f − f0)‖2.

189

Then the power of the LR test φ only depends on ∆ , i.e. it is the same for all f with

equal ∆ -value. It holds

IP(φ = 0

)= IP

(∣∣ξ1 +√∆∣∣2 + ξ22 + . . .+ ξ2p > 2zα

)with ξ = (ξ1, . . . , ξp)

> ∼ N(0, IIp) .

Proof. It follows from f = ΠY = Π(f + ε) and f0 = Πf0 for the test statistic

T = (2σ2)−1‖f − f0‖2 that

T = (2σ2)−1‖Π(f − f0) +Πε‖2

Now we show that the distribution of T depends on the response f only via the value

∆ . For this we compute the Laplace transform of T .

Lemma 7.1.5. It holds for µ < 1

g(µ)def= log IE exp

{µT}

=µ∆

2(1− µ)− p

2log(1− µ).

Proof. For a standard Gaussian random variable ξ and any a , it holds

IE exp{µ|ξ + a|2/2

}= eµa

2/2(2π)−1/2∫

exp{µax+ µx2/2− x2/2

}dx

= exp

{µa2

2+

µ2a2

2(1− µ)

}1√2π

∫exp

{−1− µ

2

(x− aµ

1− µ

)2}dx

= exp

{µa2

2(1− µ)

}(1− µ)−1/2.

The projector Π can be represented as Π = U>ΛpU for an orthogonal transform U

and the diagonal matrix Λp = diag(1, . . . , 1, 0, . . . , 0) with only p unit eigenvalues. This

permits representing T in the form

T =

p∑j=1

(ξj + aj)2/2

with i.i.d. standard normal r.v. ξj and numbers aj satisfying∑

j a2j = ∆ . The

independence of the ξj ’s implies

g(µ) =

p∑j=1

[µa2j2

+µ2a2j

2(1− µ)− 1

2log(1− µ)

]=

µ∆

2(1− µ)− p

2log(1− µ)

as required.

190

The result of Lemma 7.1.5 claims that the Laplace transform of T depends on f

only via ∆ and so this also holds for the distribution of T .

The distribution of the squared norm ‖ξ + h‖2 for ξ ∼ N(0, IIp) and any fixed

vector h ∈ IRp with ‖h‖2 = ∆ is called non-central chi-squared with the non-centrality

parameter ∆ . In particular, for each α, α1 one can define the minimal value ∆ providing

the prescribed error α1 of the second kind by the equation under the given level α :

IP(‖ξ + h‖2 ≥ 2zα

)≥ 1− α1 subject to IP

(‖ξ‖2 ≥ 2zα

)≤ α (7.10)

with ‖h‖2 ≥ ∆ . The results from Section 4.6 indicate that the value zα can be bounded

from above by p+√

2p logα−1 for the moderate values of α−1 . For evaluating the value

∆ , the following decomposition is useful:

‖ξ + h‖2 − ‖h‖2 − p = ‖ξ‖2 − p+ 2h>ξ.

The right hand-side of this equality is a sum of the centered Gaussian quadratic and

linear forms. In particular, the cross term 2h>ξ is a centered Gaussian r.v. with the

variance 4‖h‖2 , while Var(‖ξ‖2

)= 2p . These arguments suggest to take ∆ of order p

to ensure the prescribed power α1 .

Theorem 7.1.6. For each α, α1 ∈ (0, 1) , there are absolute constants C and C1 such

that (7.10) fulfills for ‖h‖2 ≥ ∆ with

∆1/2 =√Cp logα−1 +

√C1p logα−11

7.1.3 Smooth Wald test

The result of Theorem 7.1.6 reveals some problem with the power of the LR-test when

the dimensionality of the parameter space grows. Indeed, the test remains insensitive for

all alternatives in the zone σ−2‖Π(f − f0)‖2 ≤ Const.p and this zone becomes larger

and larger with p . (to be done)

7.1.4 I.i.d. errors with unknown variance

This section briefly discusses the case when the errors εi are i.i.d. but the variance

σ2 = Var(εi) is unknown. A natural idea in this case is to estimate the variance from

the data. The decomposition (7.8) and independence of RSS = ‖Y − f‖2 and ‖f−f0‖2

are particularly helpful. Theorem 7.1.3 suggests to estimate σ2 from RSS by

σ2 =1

n− pRSS =

1

n− p‖Y − f‖2.

191

Indeed, due to the result (7.9), σ−2 RSS ∼ χ2p yielding

IEσ2 = σ2, Var σ2 =2

n− pσ4 (7.11)

and therefore, σ2 is an unbiased, root-n consistent estimate of σ2 .

Exercise 7.1.1. Check (7.11). Show that σ2 − σ2 IP−→ 0 .

Now we consider the LR test (7.5) in which the true variance is replaced by its

estimate σ2 :

Tdef=

1

2σ2∥∥f − f0

∥∥2 =(n− p)

∥∥f − f0

∥∥22‖Y − f‖2

=RSS0−RSS

2 RSS /(n− p).

The result of Theorem 7.1.3 implies the pivotal property for this test statistic as well.

Theorem 7.1.7. Consider the model (7.1) with ε ∼ N(0, σ2In) for an unknown value

σ2 . Then the distribution of the test statistic under IPθ0 only depends on p and n− p :

p−1T ∼ Fp,n−p ,

where Fp,n−p denotes the Fisher distribution with parameters p, n− p :

Fp,n−p = L

( ‖ξp‖2/p‖ξn−p‖2/(n− p)

)where ξp and ξn−p are two independent standard Gaussian vectors of dimension p and

n− p . In particular, it does not depend on the design matrix Ψ , the noise variance σ2 ,

and the true parameter θ0 .

This result suggests to fix the critical value z for the test statistic T using the

quantiles of the Fisher distribution: if tα is such that Fp,n−p(tα) = 1−α , then zα = ptα .

Theorem 7.1.8. Consider the model (7.1) with ε ∼ N(0, σ2In) for a unknown value

σ2 . If Fp,n−p(tα) = 1−α and zα = ptα , then the test φ = 1(T ≥ zα) is a level-α test:

IPθ0(φ = 1

)= IPθ0

(T ≥ zα

)= α.


If the sample size n is sufficiently large, then σ2 is very close to σ2 and one can

apply an approximate choice of the critical value zα from the case of σ2 known:

φ = 1(T ≥ zα).

This test is not of exact level α but it is of asymptotic level α . Its power function is

also close to the power function of the test φ corresponding to the known variance σ2 .

192

Theorem 7.1.9. Consider the model (7.1) with ε ∼ N(0, σ2In) for a unknown value

σ2 . Then

limn→∞

IPθ0(φ = 1

)= α. (7.12)

Moreover,

limn→∞

supf

∣∣IPθ0(φ = 1)− IPθ0

(φ = 1

)∣∣ = 0. (7.13)

Exercise 7.1.3. Consider the model (7.1) with ε ∼ N(0, σ2In) for σ2 unknown .

• Prove (7.12).

• Prove (7.13).

Hint:

• The consistency of σ2 permits to restrict to the case∣∣σ2/σ2 − 1

∣∣ ≤ δn for δn → 0 .

• The independence of ‖f − f0‖2 and σ2 permits to consider the distribution of 2T =

‖f − f0‖2/σ2 as if σ2 were a fixed number close to δ .

• use that for ζp ∼ χ2p

IP(ζp ≥ zα(1 + δn)

)− IP

(ζp ≥ zα

)→ 0, n→∞.

7.2 Likelihood ratio test for a linear hypothesis

The previous section dealt with the case of a simple null hypothesis. This section consid-

ers a more general situation when the null hypothesis concerns a subvector of the vector.

This means that the whole model is given by (7.2) but the vector θ is decomposed into

two parts: θ = (γ,η) , where γ is of dimension p0 < p . The null hypothesis assumes

that η = η0 for all γ . Usually η0 = 0 but the particular value of η0 is not important.

To simplify the presentation we assume η0 = 0 leading to the subset Θ0 of Θ

Θ0 = {θ = (γ, 0)}.

Under the null hypothesis, the model is still linear:

Y = Ψ>γ γ + ε,

where Ψγ denotes a submatrix of Ψ composed by the rows of Ψ corresponding to the

γ -components of θ .

193

Fix any point θ0 ∈ Θ0 , e.g. θ0 = 0 and define the corresponding response f0 =

Ψ>θ0 . The LR test T can be written in the form

T = maxθ∈Θ

L(θ,θ0)− maxθ∈Θ0

L(θ,θ0). (7.14)

The results of both maximization problems is known:

maxθ∈Θ

L(θ,θ0) =1

2σ2‖f − f0‖2, max

θ∈Θ0

L(θ,θ0) =1

2σ2‖f0 − f0‖2,

where f and f0 are estimates of the response under the null and the alternative respec-

tively. As in Theorem 7.1.2 we can establish the following geometric decomposition.

Theorem 7.2.1. Consider the model (7.1). Let T be the LR test statistic from (7.14)

built under the assumptions f = Ψ>θ∗ and Var(ε) = σ2In with a known value σ2 .

Then T is given by

T =1

2σ2∥∥f − f0

∥∥2.Moreover, the following decompositions for the vector of observations Y and for the

residuals ε0 = Y − f0 from the null hold:

Y − f0 = (f − f0) + ε,

‖Y − f0‖2 = ‖f − f0‖2 + ‖ε‖2, (7.15)

where f − f0 is the difference between the estimated response under the null and under

the alternative, and ε = Y − f is the vector of residuals from the alternative.

Proof. The proof is similar to the proof of Theorem 7.1.2. We use that f = ΠY where

Π = Ψ>(ΨΨ>

)−1Ψ is the projector on the space L spanned by the rows of Ψ . Similarly

f0 = Π0Y where Π0 = Ψ>γ(ΨγΨ

>γ

)−1Ψγ is the projector on the subspace L0 spanned

by the rows of Ψγ . This yields the decomposition f − f0 = Π(Y − f0) . Similarly,

f − f0 = (Π −Π0)Y , ε = Y − f = (In −Π)Y .

As Π −Π0 and In −Π are orthogonal projectors, it follows

‖Y − f0‖2 = ‖(In −Π)Y + (Π −Π0)Y ‖2 = ‖(In −Π)Y ‖2 + ‖(Π −Π0)Y ‖2

and the decomposition (7.15) follows.

The decomposition (7.15) can again be represented as RSS0 = RSS +2σ2T , where

RSS is the sum of squared residuals, while RSS0 is the same as in the case of a simple

null.

194

Now we show that if the model assumptions are correct, the test T has the exact

level α and is pivotal.

Theorem 7.2.2. Consider the model (7.1) with ε ∼ N(0, σ2In) for a known value σ2 ,

i.e. εi are i.i.d. normal. Then f − f0 and ε are under IPθ0 zero-mean independent

Gaussian vectors satisfying

2T = σ−2‖f − f0‖2 ∼ χ2p−p0 , σ−2‖ε‖2 ∼ χ2

n−p . (7.16)

Let zα fulfill IP (ζp−p0 ≥ zα) = α . Then the LR test φ = 1(T ≥ zα) is of exact level α .

Proof. The null assumption θ∗ ∈ Θ0 implies f ∈ L0 . This, together with Π0f = f

implies now the following decomposition:

f − f0 = (Π −Π0)ε, ε = ε−Πε = (In −Π)ε.

Next, Π − Π0 and In − Π are orthogonal projectors implying orthogonal and thus

uncorrelated vectors (Π − Π0)ε and (In − Π)ε . Under normality of ε , these vectors

are also normal, and uncorrelation implies independence. The property (7.16) is similar

to (7.9).

If the variance σ2 of the noise is unknown, one can proceed exactly as in the case of

a simple null: estimate the variance from the residuals using their independence of the

test statistic T . This leads to the estimate

σ2 =1

n− pRSS =

1

n− p‖Y − f‖2

and to the test statistic

T =RSS0−RSS

2 RSS /(n− p)=

(n− p)‖f − f0‖2

2‖Y − f‖2.

The property of pivotality is preserved here as well: properly scaled, the test statistic T

has a Fisher distribution.

Theorem 7.2.3. Consider the model (7.1) with ε ∼ N(0, σ2In) for an unknown value

σ2 . Then T /(p− p0) has the Fisher Fp−p0,n−p distribution with parameters p− p0 and

n−p . If tα is the 1−α quantile of this distribution then the test φ = 1(T > (p−p0)tα

)is of exact level α .

If the sample size is sufficiently large, one can proceed as if σ2 were the true variance

ignoring the error of variance estimation. This would lead to the critical value zα from

Theorem 7.2.2 and the corresponding test is of asymptotic level α .

195

Exercise 7.2.1. Prove Theorem 7.2.3.

The study of the power of the test T does not differ from the case of a simple

hypothesis. One needs to only redefine ∆ as

∆def= σ−2‖(Π −Π0)f‖2.

196

Chapter 8

Some other testing methods

This chapter discusses some classical testing procedures such as the Kolmogorov-Smirnov,

Cramer-Smirnov-von Mises, and chi-squared as particular cases of the substitution ap-

proach.

Let Y = (Y1, . . . , Yn)> be an i.i.d. sample from a distribution P . The joint distribu-

tion IP of Y is the n -fold product of P , so a hypothesis about IP can be formulated

as a hypothesis about the marginal measure P . A simple hypothesis H0 means the

assumption that P = P0 for a given measure P0 . The empirical measure Pn is a natu-

ral empirical counterpart of P leading to the idea of testing the hypothesis by checking

whether Pn significantly deviates from P0 . As in to the estimation problem, this substi-

tution idea can be realized in several different ways. We briefly discuss below the method

of moments and the minimal distance method.

8.1 Method of moments for an i.i.d. sample

Let g(·) be any d -vector function on IR1 . The assumption P = P0 leads to the

population moment

m0 = IE0g(Y1).

The empirical counterpart of this quantity is given by

Mn = IEng(Y ) =1

n

∑i

g(Yi).

The method of moments (MM) suggests to consider the difference Mn−m0 for building

a reasonable test. The properties of M were stated in Section 2.4. In particular, under

the null P = P0 , the first two moments of the vector Mn−m0 can be easily computed:

197

198

IE0(Mn −m0) = 0 and

Var0(M) = IE0

[(Mn −m0) (Mn −m0)

>)]

= n−1V,

Vdef= E0

[(g(Y )−m0

)(g(Y )−m0

)>].

For simplicity of presentation we assume that the moment function g is selected to

ensure a non-degenerate matrix V . Standardization by the covariance matrix leads to

the vector ξn

ξn = n1/2V −1/2(Mn −m0)

which has under null measure zero mean and a unit covariance matrix. Moreover, it is

asymptotically standard normal, i.e. its distribution is approximately standard normal

if the sample size n is sufficiently large; see Theorem 2.4.9. The MM test rejects the

null hypothesis if the vector ξn computed from the available data Y is very unlikely

standard normal, that is, if it deviates significantly from zero. We specify the procedure

separately for the univariate and multivariate cases.

Univariate case Let g(·) be a univariate function with E0g(Y ) = m0 and E0

[g(Y )−

m0

]2= σ2 . Define the linear test statistic

Tn =1√nσ2

∑i

[g(Yi)−m

]= n1/2σ−1(Mn −m0)

leading to the test

φ = 1(|Tn| > zα/2

), (8.1)

where zα denotes the corresponding quantile of the standard normal law.

Theorem 8.1.1. Let Y be an i.i.d. sample from P . Then the test statistic Tn is

asymptotically standard normal and the test φ from (8.1) of H0 : P = P0 is of asymptotic

level α , that is,

IP0

(φ = 1

)→∞, n→ α.

Similarly one can consider a one-sided alternative H+1 : m > m0 or H−1 : m < m0

about the moment m = Eg(Y ) of the distribution P

φ+ = 1(Tn > zα), φ− = 1(Tn < −zα).

As in Theorem 8.1.1, both tests φ+ and φ− are of asymptotic level α .

199

Multivariate case The components of the vector function g(·) ∈ IRd are usually

associated with “directions” in which the null hypothesis is tested. The multivariate

situation means that we test simultaneously in d > 1 directions. The most natural test

statistic is the squared Euclidean norm of the standardized vector ξn :

Tndef= ‖ξn‖2 = n‖V −1/2(Mn −m0)‖2. (8.2)

By Theorem 2.4.9 the vector ξn is asymptotically standard normal so that Tn is asymp-

totically chi-squared with d degrees of freedom. This yields the natural definition of the

test φ using quantiles of χ2d :

φ = 1(Tn > zα

). (8.3)

Theorem 8.1.2. Let Y be an i.i.d. sample from P . If zα fulfill IP(χ2d > zα

)= α ,

then the test statistic Tn from (8.2) is asymptotically χ2d normal and the test φ from

(8.3) of H0 : P = P0 is of asymptotic level α .

8.1.1 Series expansion

A standard method of building the moment tests or, alternatively, of choosing the di-

rections g(·) is based on some series expansion. Let ψ1, ψ2, . . . , be a given set of basis

functions in the related functional space. It is especially useful to select these basis

functions to be orthonormal under the measure P0 :∫ψj(y)P0(dy) = 0,

∫ψj(y)ψj′(y)P0(dy) = δj,j′ , ∀j, j′. (8.4)

Select a fixed index d and take the first basis functions ψ1, . . . , ψd as “directions” or

components of g . Then

mj,0def=

∫ψj(y)P0(dy) = 0

is the j th population moment under the null hypothesis H0 and it is tested by checking

whether the empirical moments Mj,n with

Mj,ndef=

1

n

∑i

ψj(Yi)

do not deviate significantly from zero. The condition (8.4) effectively permits to test

each direction ψj independently of the others.

For each d one obtains a test statistic Tn,d with

Tn,ddef= n

(M2

1,n + . . .+M2d,n

)

200

leading to the test

φd = 1(Tn,d > zα,d

), (8.5)

where zα,d is the α -quantile of ξ2d . In practical applications the choice of d is particu-

larly relevant and is subject of various studies.

8.1.2 Chi-squared test

A special but popular case of the previous series approach is the chi-squared test. Let

the observation space (which is a subset of IR1 ) be split into non-overlapping subsets

A1, . . . , Ad . Define for j = 1, . . . , d

ψj(y) =1

σj

[1(y ∈ Aj)− pj

](8.6)

with

pj = P0(Aj) =

∫Aj

P0(dy), σ2j = pj(1− pj). (8.7)

Then the conditions (8.4) are fulfilled.

Exercise 8.1.1. Check the conditions (8.4) for the functions ψj from (8.6) and (8.7).

The frequencies

νj,n = n−1∑i

1(Yi ∈ Aj)

are the empirical counterparts of the probabilities pj . The related test statistic Tn with

Tn,d =d∑j=1

n(νj,n − pj)2

σ2j=

d∑j=1

n(νj,n − pj)2

pj(1− pj)

is called the chi-squared test statistic leading to the so-called chi-squared test (8.5).

8.1.3 Testing a parametric hypothesis

The method of moments can be extended to the situation when the null hypothesis is

parametric: H0 : P ∈ (Pθ,θ ∈ Θ0) . It is natural to apply the method of moments both

to estimate the parameter θ under the null and to test the null. So, we assume two

different moment vector functions g0 and g1 to be given. The first one is selected to

fulfill

θ ≡ Eθg0(Y ), θ ∈ Θ0 .

201

This permits estimating the parameter θ directly by the empirical moment:

θ = Eng0(Y ) =1

n

∑i

g0(Yi).

The second vector of moment functions is composed by directional alternatives. An

identifiability condition suggests to select the directional alternative functions orthogonal

to g0 : (to be continued)

8.2 Minimum distance method for an i.i.d. sample

The method of moments is especially useful for the case of a simple hypothesis because

it compares the population moments computed under the null with their empirical coun-

terpart. However, if a more complicated composite hypothesis is tested, the population

moments can not be computed directly: the null measure is not specified precisely. In

this case, the minimum distance idea appears to be useful. Let (Pθ,θ ∈ Θ ⊂ IRp)

be a parametric family and Θ0 be a subset of Θ . The null hypothesis about an i.i.d.

sample Y from P is that P ∈ (Pθ,θ ∈ Θ0) . Let ρ(P, P ′) denote some functional

(distance) defined for measures P, P ′ on the real line. We assume that ρ satisfies the

following conditions: ρ(Pθ1 , Pθ2) ≥ 0 and ρ(Pθ1 , Pθ2) = 0 iff θ1 = θ2 . The condition

P ∈ (Pθ,θ ∈ Θ0) can be rewritten in the form

infθ∈Θ0

ρ(P, Pθ) = 0.

Now we can apply the substitution principle: use Pn in place of P . Define the value T

by

Tdef= inf

θ∈Θ0

ρ(Pn, Pθ). (8.8)

Large values of the test statistic T indicate a possible violation of the null hypothesis.

In particular, if H0 is a simple hypothesis, that is, if the set Θ0 consists of one point

θ0 , the test statistic reads as T = ρ(Pn, Pθ0) . The critical value for this test is usually

selected by the level condition:

IPθ0(ρ(Pn, Pθ0) > tα

)≤ α.

Note that the test statistic (8.8) can be viewed as a combination of two different steps.

First we estimate under the null the parameter θ ∈ Θ0 which provides the best possible

parametric fit under the assumption P ∈ (Pθ,θ ∈ Θ0) :

θ0 = arginfθ∈Θ0

ρ(Pn, Pθ).

202

Next we formally apply the minimum distance test with the simple hypothesis with

θ0 = θ0 .

Below we discuss some standard choices of the distance ρ .

8.2.1 Kolmogorov-Smirnov test

Let P0, P1 be two distributions on the real line with distribution functions F0, F1 :

Fj(y) = Pj(Y ≤ y) for j = 0, 1 . Define

ρ(P0, P1) = ρ(F0, F1) = supy

∣∣F0(y)− F1(y)∣∣. (8.9)

Now consider the related test starting from the case of a simple null hypothesis P = P0

with corresponding c.d.f. F0 . Then the distance ρ from (8.9) (properly scaled) leads to

the Kolmogorov-Smirnov test statistic

Tndef= sup

yn1/2

∣∣F0(y)− Fn(y)∣∣.

A nice feature of this test is the property of asymptotic pivotality.

Theorem 8.2.1 (Kolmogorov). Let F0 be a continuous c.d.f. Then

Tn = supyn1/2

∣∣F0(y)− Fn(y)∣∣ w−→ η

where η is a fixed random variable (maximum of a Brownian bridge on [0, 1] ).

Proof. Idea of the proof. The c.d.f. F0 is monotonic and continuous. Therefore, its

inverse function F−10 is uniquely defined. Consider the r.v.’s

Uidef= F0(Yi).

The basic fact about this transformation is that the Ui ’s are i.i.d. uniform on the interval

[0, 1] .

Lemma 8.2.2. The r.v.’s Ui are i.i.d. with values in [0, 1] and for any u ∈ [0, 1] it

holds

IP(Ui ≤ u

)= u.

By definition of F−10 , it holds for any u ∈ [0, 1]

F0

(F−10 (u)

)≡ u.

203

Moreover, if Gn is the c.d.f. of the Ui ’s, that is, if

Gn(u)def=

1

n

∑i

1(Ui ≤ u),

then

Gn(u) ≡ Fn[F−10 (u)

]. (8.10)

Exercise 8.2.1. Check Lemma 8.2.2 and (8.10).

Now by the change of variable y = F−10 (u) we obtain

Tn = supu∈[0,1]

n1/2∣∣F0(F

−10 (u))− Fn(F−10 (u))

∣∣ = supu∈[0,1]

n1/2∣∣u−Gn(u)

∣∣.It is obvious that the right-hand side of this expression does not depend on the orig-

inal model. Actually, it is for fixed n a precisely described random variable, so its

w-distribution only depends on n . It only remains to show that this distribution for

n large is close to some fixed limit distribution with a continuous c.d.f. allowing for a

choice of a proper critical value. We indicate the main steps of the proof.

Given a sample U1, . . . , Un , define the random function

ξn(u)def= n1/2

[u−Gn(u)

].

Clearly Tn = supu∈[0,1] ξn(u) . Next, the convergence of the random functions ξn(·)would imply the convergence of their maximum over u ∈ [0, 1] because the maximum is

a continuous function of a function. Finally, the weak convergence of ξn(·) w−→ ξ(·) can

be checked if for any continuous function h(u) , it holds

⟨ξn, h

⟩ def= n1/2

∫ 1

0h(u)

[u−Gn(u)

]du

w−→⟨ξ, h⟩ def

=

∫ 1

0h(u)ξ(u)du.

Now the result can be derived from the representation

⟨ξn, h

⟩= n1/2

∫ 1

0

[h(u)Gn(u)−m(h)

]du = n−1/2

n∑i=1

[Uih(Ui)−m(h)

]with m(h) =

∫ 10 h(u)ξ(u)du and from the central limit theorem for a sum of i.i.d. random

variables.

The case of a composite hypothesis If H0 : P ∈ (Pθ,θ ∈ Θ0) , then the test statistic

is described by (8.8). As we already mentioned, the case of a composite hypothesis can

be viewed (to be continued)

204

8.2.2 ω2 test (Cramer-Smirnov-von Mises)

Here we briefly discuss another distance also based on the c.d.f. of the null measure.

Namely, define for a measure P on the real line with c.d.f. F

ρ(Pn, P ) = ρ(Fn, F ) = n

∫ [Fn(y)− F (y)

]2dF (x). (8.11)

For the case of a simple hypothesis P = P0 , the Cramer-Smirnov-von Mises (CSvM)

test statistic is given by (8.11) with F = F0 . This is another functional of the path of the

random function n1/2[Fn(y)− F0(y)

]. The Kolmogorov test uses the maximum of this

function while the CSvM test uses the integral of this function squared. The property of

pivotality is preserved for the CSvM test statistic as well.

Theorem 8.2.3. Let F0 be a continuous c.d.f. Then

Tn = n

∫ [Fn(y)− F0(y)

]2dF (x)

w−→ η

where ηω is a fixed random variable (integral of a Brownian bridge squared on [0, 1] ).

Proof. The idea of the proof is the same as in the case of the Kolmogorov-Smirnov test.

First the transformation by F−10 translates the general case to the case of the uniform

distribution on [0, 1] . Next one can again use the functional convergence of the process

ξn(u) .

8.3 Partially Bayes tests and Bayes testing

In the above sections we mostly focused on the likelihood ratio testing approach. As in

estimation theory, the LR approach is very general and possesses some nice properties.

This section briefly discusses some possible alternative approaches including the quasi

likelihood ratio, partially Bayes and Bayes approaches.

8.3.1 Quasi LR approach

The structure of the LR test statistic T from (6.3) only uses the geometric properties of

the likelihood function L(θ) . The only point where the underlying data distribution is

called for, is the level condition: this condition must be checked under the null hypothesis

about the data. Now we consider the situation when L(θ) is a quasi log-likelihood. See

Section 2.10 for examples and details. Then the test statistic is still defined by (6.3)

leading to the test 1(T > z) . The first question to be addressed in this situation is what

the null hypothesis means. In the classical parametric approach it is the hypothesis about

the underlying data distribution which is described by the likelihood function L(θ) . Now

205

we consider the situation when the parametric assumption is possibly misspecified and

the process L(θ) is only a quasi log-likelihood. The answer depends on this type of

misspecification. (to be continued)

8.3.2 Partial Bayes approach and Bayes tests

Let Θ0 and Θ1 be two subsets of the parameter set Θ . We test the null hypothesis

H0 : θ∗ ∈ Θ0 against an alternative H1 : θ∗ ∈ Θ1 . The LR approach compares the

maximum of the likelihood process over Θ0 with the similar maximum over Θ1 . Let

now two measures π0 on Θ0 and π1 on Θ1 be given. Now instead of the maximum of

L(θ) we consider its weighted sum (integral) over Θ0 (resp. Θ1 ) with weights π0(θ)

resp. π1(θ) . More precisely, we consider the value

Tπ0,π1 =

∫Θ1

L(θ)π1(θ)λ(dθ)−∫Θ0

L(θ)π0(θ)λ(dθ).

Significantly positive values of this expression indicate that the hypothesis is likely wrong.

8.3.3 Bayes approach

Within the Bayes approach the true data distribution and the true parameter value are

not defined. Instead one considers the prior and posterior distribution of the parameter.

The parametric Bayes model can be represented as

Y | θ ∼ p(y|θ), θ ∼ π(θ).

The posterior density p(θ|Y ) can be computed via the Bayes formula:

p(θ|Y ) =p(Y |θ)π(θ)

p(Y )

with the marginal density p(Y ) =∫Θ p(Y |θ)π(θ)λ(dθ) . The Bayes approach suggests

instead of checking the hypothesis about the location of the parameter θ to look directly

at the posterior distribution. Namely, one can construct the so-called credible sets which

contain a prespecified fraction, say 1−α , of the mass of the whole posterior distribution.

Then one can say that the probability for the parameter θ to lie outside of this credible

set is at most α . So, the testing problem in the frequentist approach is replaced by the

problem of confidence estimation for the Bayes method. (to be continued)

206

Chapter 9

Deviation probability for

quadratic forms

The approximation results of the previous sections rely on the probability of the form

IP(‖ξ‖ > y

)for a given random vector ξ ∈ IRp . The only condition imposed on this

vector is that

log IE exp(γ>ξ

)≤ ν20‖γ‖2/2, γ ∈ IRp, ‖γ‖ ≤ g.

To simplify the presentation we rewrite this condition as

log IE exp(γ>ξ

)≤ ‖γ‖2/2, γ ∈ IRp, ‖γ‖ ≤ g. (9.1)

The general case can be reduced to ν0 = 1 by rescaling ξ and g :

log IE exp(γ>ξ/ν0

)≤ ‖γ‖2/2, γ ∈ IRp, ‖γ‖ ≤ ν0g

that is, ν−10 ξ fulfills (9.1) with a slightly increased g . In typical situations like in

Section ??, the value g is large (of order root-n ) while the value ν0 is close to one.

9.1 Gaussian case

Our benchmark will be a deviation bound for ‖ξ‖2 for a standard Gaussian vector ξ .

The ultimate goal is to show that under (9.1) the norm of the vector ξ exhibits behavior

expected for a Gaussian vector, at least in the region of moderate deviations. For the

reason of comparison, we begin by stating the result for a Gaussian vector ξ .

Theorem 9.1.1. Let ξ be a standard normal vector in IRp . Then for any u > 0 , it

holds

IP(‖ξ‖2 > p+ u

)≤ exp

{−(p/2)φ(u/p)

]}207

208

with

φ(t)def= t− log(1 + t).

Let φ−1(·) stand for the inverse of φ(·) . For any x ,

IP(‖ξ‖2 > p+ φ−1(2x/p)

)≤ exp(−x).

This particularly yields with κ = 6.6

IP(‖ξ‖2 > p+

√κxp ∨ (κx)

)≤ exp(−x).

Proof. The proof utilizes the following well known fact: for µ < 1

log IE exp(µ‖ξ‖2/2

)= −0.5p log(1− µ).

It can be obtained by straightforward calculus. Now consider any u > 0 . By the

exponential Chebyshev inequality

IP(‖ξ‖2 > p+ u

)≤ exp

{−µ(p+ u)/2

}IE exp

(µ‖ξ‖2/2

)(9.2)

= exp{−µ(p+ u)/2− (p/2) log(1− µ)

}.

It is easy to see that the value µ = u/(u+ p) maximizes µ(p+ u) + p log(1− µ) w.r.t.

µ yielding

µ(p+ u)− p log(1− µ) = u− p log(1 + u/p).

Further we use that x− log(1+x) ≥ a0x2 for x ≤ 1 and x− log(1+x) ≥ a0x for x > 1

with a0 = 1 − log(2) ≥ 0.3 . This implies with x = u/p for u =√κxp or u = κx and

κ = 2/a0 < 6.6 that

IP(‖ξ‖2 ≥ p+

√κxp ∨ (κx)

)≤ exp(−x)

as required.

The message of this result is that the squared norm of the Gaussian vector ξ con-

centrates around the value p and the deviation over the level p+√xp are exponentially

small in x .

A similar bound can be obtained for a norm of the vector IBξ where IB is some

given matrix. For notational simplicity we assume that IB is symmetric. Otherwise one

should replace it with (IB>IB)1/2 .

209

Theorem 9.1.2. Let ξ be standard normal in IRp . Then for every x > 0 and any

symmetric matrix IB , it holds with p = tr(IB2) , v2 = 2 tr(IB4) , and a∗ = ‖IB2‖∞

IP(‖IBξ‖2 > p + (2vx1/2) ∨ (6a∗x)

)≤ exp(−x).

Proof. The matrix IB2 can be represented as U> diag(a1, . . . , ap)U for an orthogonal

matrix U . The vector ξ = Uξ is also standard normal and ‖IBξ‖2 = ξ>UIB2U>ξ .

This means that one can reduce the situation to the case of a diagonal matrix IB2 =

diag(a1, . . . , ap) . We can also assume without loss of generality that a1 ≥ a2 ≥ . . . ≥ ap .

The expressions for the quantities p and v2 simplifies to

p = tr(IB2) = a1 + . . .+ ap,

v2 = 2 tr(IB4) = 2(a21 + . . .+ a2p).

Moreover, rescaling the matrix IB2 by a1 reduces the situation to the case with a1 = 1 .

Lemma 9.1.3. It holds

IE‖IBξ‖2 = tr(IB2), Var(‖IBξ‖2

)= 2 tr(IB4).

Moreover, for µ < 1

IE exp{µ‖IBξ‖2/2

}= det(1− µIB2)−1/2 =

p∏i=1

(1− µai)−1/2. (9.3)

Proof. If IB2 is diagonal, then ‖IBξ‖2 =∑

i aiξ2i and the summands aiξ

2i are indepen-

dent. It remains to note that IE(aiξ2i ) = ai , Var(aiξ

2i ) = 2a2i , and for µai < 1 ,

IE exp{µaiξ

2i /2}

= (1− µai)−1/2

yielding (9.3).

Given u , fix µ < 1 . The exponential Markov inequality yields

IP(‖IBξ‖2 > p + u

)≤ exp

{−µ(p + u)

2

}IE exp

(µ‖IBξ‖22

)≤ exp

{−µu

2− 1

2

p∑i=1

[µai + log

(1− µai

)]}.

We start with the case when x1/2 ≤ v/3 . Then u = 2x1/2v fulfills u ≤ 2v2/3 . Define

µ = u/v2 ≤ 2/3 and use that t+ log(1− t) ≥ −t2 for t ≤ 2/3 . This implies


)≤ exp

{−µu

2+

1

2

p∑i=1

µ2a2i

}= exp

(−u2/(4v2)

)= e−x. (9.4)

210

Next, let x1/2 > v/3 . Set µ = 2/3 . It holds similarly to the above

p∑i=1

[µai + log

(1− µai

)]≥ −

p∑i=1

µ2a2i ≥ −2v2/9 ≥ −2x.

Now, for u = 6x and µu/2 = 2x , (9.4) implies


)≤ exp

{−(2x− x

)}= exp(−x)

as required.

Below we establish similar bounds for a non-Gaussian vector ξ obeying (9.1).

9.2 A bound for the `2 -norm

This section presents a general exponential bound for the probability IP(‖ξ‖ > y

)under

(9.1). Given g and p , define the values w0 = gp−1/2 and wc by the equation

wc(1 + wc)

(1 + w2c )

1/2= w0 = gp−1/2. (9.5)

It is easy to see that w0/√

2 ≤ wc ≤ w0 . Further define

µcdef= w2

c/(1 + w2c )

ycdef=√

(1 + w2c )p,

xcdef= 0.5p

[w2c − log

(1 + w2

c

)]. (9.6)

Note that for g2 ≥ p , the quantities yc and xc can be evaluated as y2c ≥ w2cp ≥ g2/2

and xc & pw2c/2 ≥ g2/4 .

Theorem 9.2.1. Let ξ ∈ IRp fulfill (9.1). Then it holds for each x ≤ xc

IP(‖ξ‖2 > p+

√κxp ∨ (κx), ‖ξ‖ ≤ yc

)≤ 2 exp(−x),

where κ = 6.6 . Moreover, for y ≥ yc , it holds with gc = g−√µcp = gwc/(1 + wc)

IP(‖ξ‖ > y

)≤ 8.4 exp

{−gcy/2− (p/2) log(1− gc/y)

}≤ 8.4 exp

{−xc − gc(y− yc)/2

}.

Proof. The main step of the proof is the following exponential bound.

Lemma 9.2.2. Suppose (9.1). For any µ < 1 with g2 > pµ , it holds

IE exp(µ‖ξ‖2

2

)1I(‖ξ‖ ≤ g/µ−

√p/µ

)≤ 2(1− µ)−p/2. (9.7)

211

Proof. Let ε be a standard normal vector in IRp and u ∈ IRp . The bound IP(‖ε‖2 >

p)≤ 1/2 implies for any vector u and any r with r ≥ ‖u‖+ p1/2 that IP

(‖u+ ε‖ ≤

r)≥ 1/2 . Let us fix some ξ with ‖ξ‖ ≤ g/µ−

√p/µ and denote by IPξ the conditional

probability given ξ . It holds with cp = (2π)−p/2

cp

∫exp(γ>ξ − ‖γ‖

2

2µ

)1I(‖γ‖ ≤ g)dγ

= cp exp(µ‖ξ‖2/2

) ∫exp(−1

2

∥∥µ−1/2γ − µ1/2ξ∥∥2) 1I(µ−1/2‖γ‖ ≤ µ−1/2g)dγ

= µp/2 exp(µ‖ξ‖2/2

)IPξ(‖ε+ µ1/2ξ‖ ≤ µ−1/2g

)≥ 0.5µp/2 exp

(µ‖ξ‖2/2

),

because ‖µ1/2ξ‖+ p1/2 ≤ µ−1/2g . This implies in view of p < g2/µ that

exp(µ‖ξ‖2/2

)1I(‖ξ‖2 ≤ g/µ−

√p/µ

)≤ 2µ−p/2cp

∫exp(γ>ξ − ‖γ‖

2

2µ

)1I(‖γ‖ ≤ g)dγ.

Further, by (9.1)

cpIE

∫exp(γ>ξ − 1

2µ‖γ‖2

)1I(‖γ‖ ≤ g)dγ

≤ cp

∫exp(−µ−1 − 1

2‖γ‖2

)1I(‖γ‖ ≤ g)dγ

≤ cp

∫exp(−µ−1 − 1

2‖γ‖2

)dγ

≤ (µ−1 − 1)−p/2

and (9.7) follows.

Due to this result, the scaled squared norm µ‖ξ‖2/2 after a proper truncation pos-

sesses the same exponential moments as in the Gaussian case. A straightforward impli-

cation is the probability bound IP(‖ξ‖2 > p+ u

)for moderate values u . Namely, given

u > 0 , define µ = u/(u+ p) . This value optimizes the inequality (9.2) in the Gaussian

case. Now we can apply a similar bound under the constraints ‖ξ‖ ≤ g/µ −√p/µ .

Therefore, the bound is only meaningful if√u+ p ≤ g/µ−

√p/µ with µ = u/(u+ p) ,

or, with w =√u/p ≤ wc ; see (9.5).

The largest value u for which this constraint is still valid, is given by p + u = y2c .

212

Hence, (9.7) yields for p+ u ≤ y2c

IP(‖ξ‖2 > p+ u, ‖ξ‖ ≤ yc

)≤ exp

{−µ(p+ u)

2

}IE exp

(µ‖ξ‖22

)1I(‖ξ‖ ≤ g/µ−

√p/µ

)≤ 2 exp

{−0.5

[µ(p+ u) + p log(1− µ)

]}= 2 exp

{−0.5

[u− p log(1 + u/p)

]}.

Similarly to the Gaussian case, this implies with κ = 6.6 that

IP(‖ξ‖ ≥ p+

√κxp ∨ (κx), ‖ξ‖ ≤ yc

)≤ 2 exp(−x).

The Gaussian case means that (9.1) holds with g = ∞ yielding yc = ∞ . In the non-

Gaussian case with a finite g , we have to accompany the moderate deviation bound with

a large deviation bound IP(‖ξ‖ > y

)for y ≥ yc . This is done by combining the bound

(9.7) with the standard slicing arguments.

Lemma 9.2.3. Let µ0 ≤ g2/p . Define y0 = g/µ0−√p/µ0 and g0 = µ0y0 = g−√µ0p .

It holds for y ≥ y0

IP(‖ξ‖ > y

)≤ 8.4(1− g0/y)−p/2 exp

(−g0y/2

)(9.8)

≤ 8.4 exp{−x0 − g0(y− y0)/2

}. (9.9)

with x0 defined by

2x0 = µ0y20 + p log(1− µ0) = g2/µ0 − p+ p log(1− µ0).

Proof. Consider the growing sequence yk with y1 = y and g0yk+1 = g0y + k . Define

also µk = g0/yk . In particular, µk ≤ µ1 = g0/y . Obviously

IP(‖ξ‖ > y

)=

∞∑k=1

IP(‖ξ‖ > yk, ‖ξ‖ ≤ yk+1

).

Now we try to evaluate every slicing probability in this expression. We use that

µk+1y2k =

(g0y + k − 1)2

g0y + k≥ g0y + k − 2,

and also g/µk −√p/µk ≥ yk because g− g0 =

√µ0p >

√µkp and

g/µk −√p/µk − yk = µ−1k (g−√µkp− g0) ≥ 0.

213

Hence by (9.7)

IP(‖ξ‖ > y

)≤∞∑k=1

IP(‖ξ‖ > yk, ‖ξ‖ ≤ yk+1

)

≤∞∑k=1

exp(−µk+1y

2k

2

)IE exp

(µk+1‖ξ‖2

2

)1I(‖ξ‖ ≤ yk+1

)≤∞∑k=1

2(1− µk+1

)−p/2exp(−µk+1y

2k

2

)

≤ 2(1− µ1

)−p/2 ∞∑k=1

exp(−g0y + k − 2

2

)= 2e1/2(1− e−1/2)−1(1− µ1)−p/2 exp

(−g0y/2

)≤ 8.4(1− µ1)−p/2 exp

(−g0y/2

)and the first assertion follows. For y = y0 , it holds

g0y0 + p log(1− µ0) = µ0y20 + p log(1− µ0) = 2x0

and (9.8) implies IP(‖ξ‖ > y0

)≤ 8.4 exp(−x0) . Now observe that the function f(y) =

g0y/2 + (p/2) log(1 − g0/y

)fulfills f(y0) = x0 and f ′(y) ≥ g0/2 yielding f(y) ≥

x0 + g0(y− y0)/2 . This implies (9.9).

The statements of the theorem are obtained by applying the lemmas with µ0 = µc =

w2c/(1+w2

c ) . This also implies y0 = yc , x0 = xc , and g0 = gc = g−√µcp ; cf. (9.6).

The statements of Theorem 9.3.1 can be simplified under the assumption g2 ≥ p .

Corollary 9.2.4. Let ξ fulfill (9.1) and g2 ≥ p . Then it holds for x ≤ xc

IP(‖ξ‖2 ≥ z(x, p)

)≤ 2e−x + 8.4e−xc , (9.10)

z(x, p)def=

p+√κxp, x ≤ p/κ,

p+ κx p/κ < x ≤ xc,(9.11)

with κ = 6.6 . For x > xc

IP(‖ξ‖2 ≥ zc(x, p)

)≤ 8.4e−x, zc(x, p)

def=∣∣yc + 2(x− xc)/gc

∣∣2.This result implicitly assumes that p ≤ κxc which is fulfilled if w2

0 = g2/p ≥ 1 :

κxc = 0.5κ[w20 − log(1 + w2

0)]p ≥ 3.3

[1− log(2)

]p > p.

214

In the zone x ≤ p/κ we obtain sub-Gaussian behavior of the tail of ‖ξ‖2 − p , in the

zone p/κ < x ≤ xc it becomes sub-exponential. Note that the sub-exponential zone is

empty if g2 < p .

For x ≤ xc , the function z(x, p) mimics the quantile behavior of the chi-squared

distribution χ2p with p degrees of freedom. Moreover, increase of the value g yields a

growth of the sub-Gaussian zone. In particular, for g = ∞ , a general quadratic form

‖ξ‖2 has under (9.1) the same tail behavior as in the Gaussian case.

Finally, in the large deviation zone x > xc the deviation probability decays as e−cx1/2

for some fixed c . However, if the constant g in the condition (9.1) is sufficiently large

relative to p , then xc is large as well and the large deviation zone x > xc can be ignored

at a small price of 8.4e−xc and one can focus on the deviation bound described by (9.10)

and (9.11).

9.3 A bound for a quadratic form

Now we extend the result to more general bound for ‖IBξ‖2 = ξ>IB2ξ with a given

matrix IB and a vector ξ obeying the condition (9.1). Similarly to the Gaussian case

we assume that IB is symmetric. Define important characteristics of IB

p = tr(IB2), v2 = 2 tr(IB4), λ∗def= ‖IB2‖∞

def= λmax(IB2).

For simplicity of formulation we suppose that λ∗ = 1 , otherwise one has to replace p

and v2 with p/λ∗ and v2/λ∗ .

Let g be shown in (9.1). Define similarly to the `2 -case wc by the equation

wc(1 + wc)

(1 + w2c )

1/2= gp−1/2.

Define also µc = w2c/(1+w2

c )∧2/3 . Note that w2c ≥ 2 implies µc = 2/3 . Further define

y2c = (1 + w2c )p, 2xc = µcy

2c + log det{IIp − µcIB2}. (9.12)

Similarly to the case with IB = IIp , under the condition g2 ≥ p , one can bound y2c ≥ g2/2

and xc & g2/4 .

Theorem 9.3.1. Let a random vector ξ in IRp fulfill (9.1). Then for each x < xc

IP(‖IBξ‖2 > p + (2vx1/2) ∨ (6x), ‖IBξ‖ ≤ yc

)≤ 2 exp(−x).

Moreover, for y ≥ yc , with gc = g−√µcp = gwc/(1 + wc) , it holds

IP(‖IBξ‖ > y

)≤ 8.4 exp

(−xc − gc(y− yc)/2

).

215

Proof. The main steps of the proof are similar to the proof of Theorem 9.2.1.

Lemma 9.3.2. Suppose (9.1). For any µ < 1 with g2/µ ≥ p , it holds

IE exp(µ‖IBξ‖2/2

)1I(‖IB2ξ‖ ≤ g/µ−

√p/µ

)≤ 2det(IIp − µIB2)−1/2. (9.13)

Proof. With cp(IB) =(2π)−p/2

det(IB−1)

cp(IB)

∫exp(γ>ξ − 1

2µ‖IB−1γ‖2

)1I(‖γ‖ ≤ g)dγ

= cp(IB) exp(µ‖IBξ‖2

2

)∫exp(−1

2

∥∥µ1/2IBξ − µ−1/2IB−1γ∥∥2) 1I(‖γ‖ ≤ g)dγ

= µp/2 exp(µ‖IBξ‖2

2

)IPξ(‖µ−1/2IBε+ IB2ξ‖ ≤ g/µ

),

where ε denotes a standard normal vector in IRp and IPξ means the conditional prob-

ability given ξ . Moreover, for any u ∈ IRp and r ≥ p1/2 + ‖u‖ , it holds in view of

IP(‖IBε‖2 > p

)≤ 1/2

IP(‖IBε− u‖ ≤ r

)≥ IP

(‖IBε‖ ≤ √p

)≥ 1/2.

This implies

exp(µ‖IBξ‖2/2

)1I(‖IB2ξ‖ ≤ g/µ−

√p/µ

)≤ 2µ−p/2cp(IB)

∫exp(γ>ξ − 1

2µ‖IB−1γ‖2

)1I(‖γ‖ ≤ g)dγ.

Further, by (9.1)

cp(IB)IE

∫exp(γ>ξ − 1

2µ‖IB−1γ‖2

)1I(‖γ‖ ≤ g)dγ

≤ cp(IB)

∫exp(‖γ‖2

2− 1

2µ‖IB−1γ‖2

)dγ

≤ det(IB−1) det(µ−1IB−2 − IIp)−1/2 = µp/2 det(IIp − µIB2)−1/2

and (9.13) follows.

Now we evaluate the probability IP(‖IBξ‖ > y

)for moderate values of y .

Lemma 9.3.3. Let µ0 < 1∧ (g2/p) . With y0 = g/µ0 −√p/µ0 , it holds for any u > 0

IP(‖IBξ‖2 > p + u, ‖IB2ξ‖ ≤ y0

)≤ 2 exp

{−0.5µ0(p + u)− 0.5 log det(IIp − µ0IB2)

}. (9.14)

216

In particular, if IB2 is diagonal, that is, IB2 = diag(a1, . . . , ap

), then

IP(‖IBξ‖2 > p + u, ‖IB2ξ‖ ≤ y0

)≤ 2 exp

{−µ0u

2− 1

2

p∑i=1

[µ0ai + log

(1− µ0ai

)]}. (9.15)

Proof. The exponential Chebyshev inequality and (9.13) imply

IP(‖IBξ‖2 > p + u, ‖IB2ξ‖ ≤ y0

)≤ exp

{−µ0(p + u)

2

}IE exp

(µ0‖IBξ‖22

)1I(‖IB2ξ‖ ≤ g/µ0 −

√p/µ0

)≤ 2 exp

{−0.5µ0(p + u)− 0.5 log det(IIp − µ0IB2)

}.

Moreover, the standard change-of-basis arguments allow us to reduce the problem to the

case of a diagonal matrix IB2 = diag(a1, . . . , ap

)where 1 = a1 ≥ a2 ≥ . . . ≥ ap > 0 .

Note that p = a1 + . . .+ap . Then the claim (9.14) can be written in the form (9.15).

Now we evaluate a large deviation probability that ‖IBξ‖ > y for a large y . Note

that the condition ‖IB2‖∞ ≤ 1 implies ‖IB2ξ‖ ≤ ‖IBξ‖ . So, the bound (9.14) continues

to hold when ‖IB2ξ‖ ≤ y0 is replaced by ‖IBξ‖ ≤ y0 .

Lemma 9.3.4. Let µ0 < 1 and µ0p < g2 . Define g0 by g0 = g − √µ0p . For any

y ≥ y0def= g0/µ0 , it holds

IP(‖IBξ‖ > y

)≤ 8.4 det{IIp − (g0/y)IB2}−1/2 exp

(−g0y/2

).

≤ 8.4 exp(−x0 − g0(y− y0)/2

), (9.16)

where x0 is defined by

2x0 = g0y0 + log det{IIp − (g0/y0)IB2}.

Proof. The slicing arguments of Lemma 9.2.3 apply here in the same manner. One has

to replace ‖ξ‖ by ‖IBξ‖ and (1 − µ1)−p/2 by det{IIp − (g0/y)IB2}−1/2 . We omit the

details. In particular, with y = y0 = g0/µ , this yields

IP(‖IBξ‖ > y0

)≤ 8.4 exp(−x0).

Moreover, for the function f(y) = g0y + log det{IIp − (g0/y)IB2} , it holds f ′(y) ≥ g0

and hence, f(y) ≥ f(y0) + g0(y− y0) for y > y0 . This implies (9.16).

One important feature of the results of Lemma 9.3.3 and Lemma 9.3.4 is that the

value µ0 < 1∧ (g2/p) can be selected arbitrarily. In particular, for y ≥ yc , Lemma 9.3.4

217

with µ0 = µc yields the large deviation probability IP(‖IBξ‖ > y

). For bounding the

probability IP(‖IBξ‖2 > p + u, ‖IBξ‖ ≤ yc

), we use the inequality log(1− t) ≥ −t− t2

for t ≤ 2/3 . It implies for µ ≤ 2/3 that

− log IP(‖IBξ‖2 > p + u, ‖IBξ‖ ≤ yc

)≥ µ(p + u) +

p∑i=1

log(1− µai

)≥ µ(p + u)−

p∑i=1

(µai + µ2a2i ) ≥ µu− µ2v2/2. (9.17)

Now we distinguish between µc = 2/3 and µc < 2/3 starting with µc = 2/3 . The

bound (9.17) with µ = 2/3 and with u = (2vx1/2) ∨ (6x) yields

IP(‖IBξ‖2 > p + u, ‖IBξ‖ ≤ yc

)≤ 2 exp(−x);

see the proof of Theorem 9.1.2 for the Gaussian case.

Now consider µc < 2/3 . For x1/2 ≤ µcv/2 , use u = 2vx1/2 and µ0 = u/v2 . It

holds µ0 = u/v2 ≤ µc and u2/(4v2) = x yielding the desired bound by (9.17). For

x1/2 > µcv/2 , we select again µ0 = µc . It holds with u = 4µ−1c x that µcu/2−µ2cv2/4 ≥2x− x = x . This completes the proof.

Now we describe the value z(x, IB) ensuring a small value for the large deviation

probability IP(‖IBξ‖2 > z(x, IB)

). For ease of formulation, we suppose that g2 ≥ 2p

yielding µ−1c ≤ 3/2 . The other case can be easily adjusted.

Corollary 9.3.5. Let ξ fulfill (9.1) with g2 ≥ 2p . Then it holds for x ≤ xc with xc

from (9.12):

IP(‖IBξ‖2 ≥ z(x, IB)

)≤ 2e−x + 8.4e−xc ,

z(x, IB)def=

p + 2vx1/2, x ≤ v/18,

p + 6x v/18 < x ≤ xc.(9.18)

For x > xc

IP(‖IBξ‖2 ≥ zc(x, IB)

)≤ 8.4e−x, zc(x, IB)


∣∣2.9.4 Rescaling and regularity condition

The result of Theorem 9.3.1 can be extended to a more general situation when the

condition (9.1) is fulfilled for a vector ζ rescaled by a matrix V0 . More precisely, let the

218

random p -vector ζ fulfills for some p× p matrix V0 the condition

supγ∈IRp

log IE exp(λγ>ζ

‖V0γ‖

)≤ ν20λ2/2, |λ| ≤ g, (9.19)

with some constants g > 0 , ν0 ≥ 1 . Again, a simple change of variables reduces the case

of an arbitrary ν0 ≥ 1 to ν0 = 1 . Our aim is to bound the squared norm ‖D−10 ζ‖2 of a

vector D−10 ζ for another p×p positive symmetric matrix D20 . Note that condition (9.19)

implies (9.1) for the rescaled vector ξ = V −10 ζ . This leads to bounding the quadratic

form ‖D−10 V0ξ‖2 = ‖IBξ‖2 with IB2 = D−10 V 20 D−10 . It obviously holds

p = tr(IB2) = tr(D−20 V 20 ).

Now we can apply the result of Corollary 9.3.5.

Corollary 9.4.1. Let ζ fulfill (9.19) with some V0 and g . Given D0 , define IB2 =

D−10 V 20 D−10 , and let g2 ≥ 2p . Then it holds for x ≤ xc with xc from (9.12):

IP(‖D−10 ζ‖2 ≥ z(x, IB)

)≤ 2e−x + 8.4e−xc ,

with z(x, IB) from (9.18). For x > xc

IP(‖D−10 ζ‖2 ≥ zc(x, IB)

)≤ 8.4e−x, zc(x, IB)


∣∣2.Finally we briefly discuss the regular case with D0 ≥ aV0 for some a > 0 . This

implies ‖IB‖∞ ≤ a−1 and

v2 = 2 tr(IB4) ≤ 2a−2p.

9.5 A chi-squared bound with norm-constraints

This section extends the results to the case when the bound (9.1) requires some other

conditions than the `2 -norm of the vector γ . Namely, we suppose that

log IE exp(γ>ξ

)≤ ‖γ‖2/2, γ ∈ IRp, ‖γ‖◦ ≤ g◦, (9.20)

where ‖·‖◦ is a norm which differs from the usual Euclidean norm. Our driving example

is given by the sup-norm case with ‖γ‖◦ ≡ ‖γ‖∞ . We are interested to check whether

the previous results of Section 9.2 still apply. The answer depends on how massive the

set A(r) = {γ : ‖γ‖◦ ≤ r} is in terms of the standard Gaussian measure on IRp . Recall

that the quadratic norm ‖ε‖2 of a standard Gaussian vector ε in IRp concentrates

219

around p at least for p large. We need a similar concentration property for the norm

‖ · ‖◦ . More precisely, we assume for a fixed r◦ that

IP(‖ε‖◦ ≤ r◦

)≥ 1/2, ε ∼ N(0, IIp). (9.21)

This implies for any value u◦ > 0 and all u ∈ IRp with ‖u‖◦ ≤ u◦ that

IP(‖ε− u‖◦ ≤ r◦ + u◦

)≥ 1/2, ε ∼ N(0, IIp).

For each z > p , consider

µ(z) = (z− p)/z.

Given u◦ , denote by z◦ = z◦(u◦) the root of the equation

g◦µ(z◦)

− r◦

µ1/2(z◦)= u◦. (9.22)

One can easily see that this value exists and unique if u◦ ≥ g◦−r◦ and it can be defined

as the largest z for which g◦µ(z) −

r◦µ1/2(z)

≥ u◦ . Let µ◦ = µ(z◦) be the corresponding

µ -value. Define also x◦ by

2x◦ = µ◦z◦ + p log(1− µ◦).

If u◦ < g◦ − r◦ , then set z◦ =∞ , x◦ =∞ .

Theorem 9.5.1. Let a random vector ξ in IRp fulfill (9.20). Suppose (9.21) and let,

given u◦ , the value z◦ be defined by (9.22). Then it holds for any u > 0

IP(‖ξ‖2 > p+ u, ‖ξ‖◦ ≤ u◦

)≤ 2 exp

{−(p/2)φ(u)

]}. (9.23)

yielding for x ≤ x◦

IP(‖ξ‖2 > p+

√κxp ∨ (κx), ‖ξ‖◦ ≤ u◦

)≤ 2 exp(−x), (9.24)

where κ = 6.6 . Moreover, for z ≥ z◦ , it holds

IP(‖ξ‖2 > z, ‖ξ‖◦ ≤ u◦

)≤ 2 exp

{−µ◦z/2− (p/2) log(1− µ◦)

}= 2 exp

{−x◦ − g◦(z− z◦)/2

}.

Proof. The arguments behind the result are the same as in the one-norm case of Theo-

rem 9.2.1. We only outline the main steps.

220

Lemma 9.5.2. Suppose (9.20) and (9.21). For any µ < 1 with g◦ > µ1/2r◦ , it holds

IE exp(µ‖ξ‖2/2

)1I(‖ξ‖◦ ≤ g◦/µ− r◦/µ

1/2)≤ 2(1− µ)−p/2. (9.25)

Proof. Let ε be a standard normal vector in IRp and u ∈ IRp . Let us fix some ξ with

µ1/2‖ξ‖◦ ≤ µ−1/2g◦−r◦ and denote by IPξ the conditional probability given ξ . It holds

by (9.21) with cp = (2π)−p/2

cp

∫exp(γ>ξ − 1

2µ‖γ‖2

)1I(‖γ‖◦ ≤ g◦)dγ

= cp exp(µ‖ξ‖2/2

) ∫exp(−1

2

∥∥µ1/2ξ − µ−1/2γ∥∥2) 1I(‖µ−1/2γ‖◦ ≤ µ−1/2g◦)dγ

= µp/2 exp(µ‖ξ‖2/2

)IPξ(‖ε− µ1/2ξ‖◦ ≤ µ−1/2g◦

)≥ 0.5µp/2 exp

(µ‖ξ‖2/2

).

This implies

exp(µ‖ξ‖2

2

)1I(‖ξ‖◦ ≤ g◦/µ− r◦/µ

1/2)

≤ 2µ−p/2cp

∫exp(γ>ξ − 1

2µ‖γ‖2

)1I(‖γ‖◦ ≤ g◦)dγ.

Further, by (9.20)

cpIE

∫exp(γ>ξ − 1

2µ‖γ‖2

)1I(‖γ‖◦ ≤ g◦)dγ

≤ cp

∫exp(−µ−1 − 1

2‖γ‖2

)dγ ≤ (µ−1 − 1)−p/2

and (9.25) follows.

As in the Gaussian case, (9.25) implies for z > p with µ = µ(z) = (z − p)/z the

bounds (9.23) and (9.24). Note that the value µ(z) clearly grows with z from zero to

one, while g◦/µ(z)− r◦/µ1/2(z) is strictly decreasing. The value z◦ is defined exactly as

the point where g◦/µ(z)− r◦/µ1/2(z) crosses u◦ , so that g◦/µ(z)− r◦/µ

1/2(z) ≥ u◦ for

z ≤ z◦ .

For z > z◦ , the choice µ = µ(y) conflicts with g◦/µ(z) − r◦/µ1/2(z) ≥ u◦ . So, we

apply µ = µ◦ yielding by the Markov inequality

IP(‖ξ‖2 > z, ‖ξ‖◦ ≤ u◦

)≤ 2 exp

{−µ◦z/2− (p/2) log(1− µ◦)

},

and the assertion follows.

221

It is easy to check that the result continues to hold for the norm of Πξ for a given

sub-projector Π in IRp satisfying Π = Π> , Π2 ≤ Π . As above, denote pdef= tr(Π2) ,

v2def= 2 tr(Π4) . Let r◦ be fixed to ensure

IP(‖Πε‖◦ ≤ r◦

)≥ 1/2, ε ∼ N(0, IIp).

The next result is stated for g◦ ≥ r◦ + u◦ , which simplifies the formulation.

Theorem 9.5.3. Let a random vector ξ in IRp fulfill (9.20) and Π follows Π = Π> ,

Π2 ≤ Π . Let some u◦ be fixed. Then for any µ◦ ≤ 2/3 with g◦µ−1◦ − r◦µ

−1/2◦ ≥ u◦ ,

IE exp{µ◦

2(‖Πξ‖2 − p)

}1I(‖Π2ξ‖◦ ≤ u◦

)≤ 2 exp(µ2◦v

2/4), (9.26)

where v2 = 2 tr(Π4) . Moreover, if g◦ ≥ r◦ + u◦ , then for any z ≥ 0

IP(‖Πξ‖2 > z, ‖Π2ξ‖◦ ≤ u◦

)≤ IP

(‖Πξ‖2 > p + (2vx1/2) ∨ (6x), ‖Π2ξ‖◦ ≤ u◦

)≤ 2 exp(−x).

Proof. Arguments from the proof of Lemmas 9.3.2 and 9.5.2 yield in view of g◦µ−1◦ −

r◦µ−1/2◦ ≥ u◦

IE exp{µ◦‖Πξ‖2/2

}1I(‖Π2ξ‖◦ ≤ u◦

)≤ IE exp

(µ◦‖Πξ‖2/2

)1I(‖Π2ξ‖◦ ≤ g◦/µ◦ − p/µ

1/2◦)

≤ 2det(IIp − µ◦Π2)−1/2.

Now the inequality log(1− t) ≥ −t− t2 for t ≤ 2/3 implies

− log det(IIp − µ◦Π2) ≤ µ◦p + µ2◦v2/2

cf. (9.17); the assertion (9.26) follows.

9.6 A bound for the `2 -norm under Bernstein conditions

For comparison, we specify the results to the case considered recently in Y. Baraud

(2010). Let ζ be a random vector in IRn whose components ζi are independent and

satisfy the Bernstein type conditions: for all |λ| < c−1

log IEeλζi ≤ λ2σ2

1− c|λ|. (9.27)

222

Denote ξ = ζ/(2σ) and consider ‖γ‖◦ = ‖γ‖∞ . Fix g◦ = σ/c . If ‖γ‖◦ ≤ g◦ , then

1− cγi/(2σ) ≥ 1/2 and

log IE exp(γ>ξ

)≤∑i

log IE exp(γiζi

2σ

)≤∑i

|γi/(2σ)|2σ2

1− cγi/(2σ)≤ ‖γ‖2/2.

Let also S be some linear subspace of IRn with dimension p and ΠS denote the

projector on S . For applying the result of Theorem 9.5.1, the value r◦ has to be fixed.

We use that the infinity norm ‖ε‖∞ concentrates around√

2 log p .

Lemma 9.6.1. It holds for a standard normal vector ε ∈ IRp with r◦ =√

2 log p

IP(‖ε‖◦ ≤ r◦

)≥ 1/2.

Proof. By definition

IP(‖ε‖◦ > r◦

)≤ IP

(‖ε‖∞ >

√2 log p

)≤ pIP

(|ε1| >

√2 log p

)≤ 1/2

as required.

Now the general bound of Theorem 9.5.1 is applied to bounding the norm of ‖ΠSξ‖ .

For simplicity of formulation we assume that g◦ ≥ u◦ + r◦ .

Theorem 9.6.2. Let S be some linear subspace of IRn with dimension p . Let g◦ ≥u◦ + r◦ . If the coordinates ζi of ζ are independent and satisfy (9.27), then for all x ,

IP((4σ2)−1‖ΠSζ‖2 > p +

√κxp ∨ (κx), ‖ΠSζ‖∞ ≤ 2σu◦)≤ 2 exp(−x),

The bound of Baraud (2010) reads

IP

(‖ΠSζ‖2 >

(3σ ∨

√6cu)√

x + 3p, ‖ΠSζ‖∞ ≤ 2σu◦

)≤ e−x.

As expected, in the region x ≤ xc of Gaussian approximation, the bound of Baraud is

not sharp and actually quite rough.

Bibliography

223