Basics of Modern Parametric Statistics
Vladimir Spokoiny
Weierstrass-Institute,
Mohrenstr. 39, 10117 Berlin, Germany
February 13, 2012
2 parametric statistics: modern view
Contents
Preface 9
I Basics 13
1 Basic notions 15
1.1 Example of a Bernoulli experiment . . . . . . . . . . . . . . . . . . . . . . 15
1.2 Least squares estimation in a linear model . . . . . . . . . . . . . . . . . . 18
1.3 General parametric model . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
1.4 Statistical decision problem . . . . . . . . . . . . . . . . . . . . . . . . . . 22
1.5 Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
2 Parameter estimation for an i.i.d. model 25
2.1 Empirical distribution. Glivenko-Cantelli Theorem . . . . . . . . . . . . . 25
2.2 Substitution principle. Method of moments . . . . . . . . . . . . . . . . . 29
2.2.1 Method of moments. Univariate parameter . . . . . . . . . . . . . 30
2.2.2 Method of moments. Multivariate parameter . . . . . . . . . . . . 31
2.2.3 Method of moments. Examples . . . . . . . . . . . . . . . . . . . . 31
2.3 Unbiased estimates, bias, and quadratic risk . . . . . . . . . . . . . . . . . 36
2.3.1 Univariate parameter . . . . . . . . . . . . . . . . . . . . . . . . . 36
2.3.2 Multivariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4 Asymptotic properties . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
2.4.1 Root-n normality. Univariate parameter . . . . . . . . . . . . . . 38
2.4.2 Root-n normality. Multivariate parameter . . . . . . . . . . . . . 40
2.5 Some geometric properties of a parametric family . . . . . . . . . . . . . . 43
2.5.1 Kullback-Leibler divergence . . . . . . . . . . . . . . . . . . . . . . 43
2.5.2 Hellinger distance . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
2.5.3 Regularity and the Fisher Information. Univariate parameter . . . 46
2.6 Cramer-Rao Inequality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
3
4 parametric statistics: modern view
2.6.1 Univariate parameter . . . . . . . . . . . . . . . . . . . . . . . . . 50
2.6.2 Exponential families and R-efficiency . . . . . . . . . . . . . . . . . 51
2.7 Cramer-Rao inequality. Multivariate parameter . . . . . . . . . . . . . . . 53
2.7.1 Regularity and Fisher Information. Multivariate parameter . . . . 53
2.7.2 Local properties of the Kullback-Leibler divergence and Hellinger
distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
2.7.3 Multivariate Cramer-Rao Inequality . . . . . . . . . . . . . . . . . 56
2.7.4 Exponential families and R-efficiency . . . . . . . . . . . . . . . . . 57
2.8 Maximum likelihood and other estimation methods . . . . . . . . . . . . . 59
2.8.1 Minimum distance estimation . . . . . . . . . . . . . . . . . . . . . 59
2.8.2 M -estimation and Maximum likelihood estimation . . . . . . . . . 59
2.9 Maximum Likelihood for some parametric families . . . . . . . . . . . . . 63
2.9.1 Gaussian shift . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
2.9.2 Variance estimation for the normal law . . . . . . . . . . . . . . . 65
2.9.3 Univariate normal distribution . . . . . . . . . . . . . . . . . . . . 66
2.9.4 Uniform distribution on [0, θ] . . . . . . . . . . . . . . . . . . . . . 66
2.9.5 Bernoulli or binomial model . . . . . . . . . . . . . . . . . . . . . . 66
2.9.6 Multinomial model . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.9.7 Exponential model . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
2.9.8 Poisson model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
2.9.9 Shift of a Laplace (double exponential) law . . . . . . . . . . . . . 68
2.10 Quasi Maximum Likelihood approach . . . . . . . . . . . . . . . . . . . . 69
2.10.1 LSE as quasi likelihood estimation . . . . . . . . . . . . . . . . . . 69
2.10.2 LAD and robust estimation as quasi likelihood estimation . . . . . 71
2.11 Univariate exponential families . . . . . . . . . . . . . . . . . . . . . . . . 72
2.11.1 Natural parametrization . . . . . . . . . . . . . . . . . . . . . . . . 72
2.11.2 Canonical parametrization . . . . . . . . . . . . . . . . . . . . . . . 75
2.11.3 Deviation probabilities for the maximum likelihood . . . . . . . . . 78
3 Regression Estimation 85
3.1 Regression model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.1.1 Observations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
3.1.2 Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
3.1.3 Errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87
3.1.4 Regression function . . . . . . . . . . . . . . . . . . . . . . . . . . 88
3.2 Method of substitution and M-estimation . . . . . . . . . . . . . . . . . . 89
3.2.1 Mean regression. Least squares estimate . . . . . . . . . . . . . . . 89
spokoiny, v. 5
3.2.2 Median regression. Least absolute deviation estimate . . . . . . . . 90
3.2.3 Maximum likelihood regression . . . . . . . . . . . . . . . . . . . . 91
3.3 Linear regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
3.3.1 Projection estimation . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3.2 Piecewise linear estimation . . . . . . . . . . . . . . . . . . . . . . 94
3.3.3 Spline estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3.4 Wavelet estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.3.5 Kernel estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.4 Density function estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.4.1 Linear projection estimation . . . . . . . . . . . . . . . . . . . . . . 94
3.4.2 Wavelet density estimation . . . . . . . . . . . . . . . . . . . . . . 94
3.4.3 Kernel density estimation . . . . . . . . . . . . . . . . . . . . . . . 94
3.4.4 Estimation based on Fourier transformation . . . . . . . . . . . . . 94
3.5 Generalized regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94
3.6 Generalized linear models . . . . . . . . . . . . . . . . . . . . . . . . . . . 96
3.6.1 Logit regression for binary data . . . . . . . . . . . . . . . . . . . . 97
3.6.2 Poisson regression . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
3.7 Quasi Maximum Likelihood estimation . . . . . . . . . . . . . . . . . . . . 98
4 Estimation in linear models 101
4.1 Modeling assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
4.2 Quasi maximum likelihood estimation . . . . . . . . . . . . . . . . . . . . 102
4.2.1 Estimation under the homogeneous noise assumption . . . . . . . . 104
4.2.2 Linear basis transformation . . . . . . . . . . . . . . . . . . . . . . 104
4.2.3 Orthogonal and orthonormal design . . . . . . . . . . . . . . . . . 106
4.2.4 Spectral representation . . . . . . . . . . . . . . . . . . . . . . . . 107
4.3 Properties of the response estimate f . . . . . . . . . . . . . . . . . . . . 108
4.3.1 Decomposition into a deterministic and a stochastic component . . 109
4.3.2 Properties of the operator Π . . . . . . . . . . . . . . . . . . . . . 109
4.3.3 Quadratic loss and risk of the response estimation . . . . . . . . . 110
4.3.4 Misspecified “colored noise” . . . . . . . . . . . . . . . . . . . . . . 111
4.4 Properties of the MLE θ . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
4.4.1 Properties of the stochastic component . . . . . . . . . . . . . . . . 113
4.4.2 Properties of the deterministic component . . . . . . . . . . . . . . 114
4.4.3 Risk of estimation. R-efficiency . . . . . . . . . . . . . . . . . . . . 115
4.4.4 The case of a misspecified noise . . . . . . . . . . . . . . . . . . . . 118
4.5 Linear models and quadratic log-likelihood . . . . . . . . . . . . . . . . . . 119
6 parametric statistics: modern view
4.6 Inference based on the maximum likelihood . . . . . . . . . . . . . . . . . 121
4.6.1 A misspecified LPA . . . . . . . . . . . . . . . . . . . . . . . . . . 124
4.6.2 A misspecified noise structure . . . . . . . . . . . . . . . . . . . . . 124
4.7 Ridge regression, projection, and shrinkage . . . . . . . . . . . . . . . . . 125
4.7.1 Regularization and ridge regression . . . . . . . . . . . . . . . . . . 126
4.7.2 Penalized likelihood. Bias and variance . . . . . . . . . . . . . . . 127
4.7.3 Inference for the penalized MLE . . . . . . . . . . . . . . . . . . . 130
4.7.4 Projection and shrinkage estimates . . . . . . . . . . . . . . . . . . 131
4.7.5 Smoothness constraints and roughness penalty approach . . . . . . 134
4.8 Shrinkage in a linear inverse problem . . . . . . . . . . . . . . . . . . . . . 134
4.8.1 Spectral cut-off and spectral penalization. Diagonal estimates . . . 135
4.8.2 Galerkin method . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
4.9 Semiparametric estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.9.1 (θ,η) - and υ -setup . . . . . . . . . . . . . . . . . . . . . . . . . . 138
4.9.2 Orthogonality and product structure . . . . . . . . . . . . . . . . . 139
4.9.3 Partial estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
4.9.4 Profile estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
4.9.5 Semiparametric efficiency bound . . . . . . . . . . . . . . . . . . . 145
4.9.6 Inference for the profile likelihood approach . . . . . . . . . . . . . 146
4.9.7 Plug-in method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
4.9.8 Two step procedure . . . . . . . . . . . . . . . . . . . . . . . . . . 149
4.9.9 Alternating method . . . . . . . . . . . . . . . . . . . . . . . . . . 149
5 Bayes estimation 153
5.1 Bayes formula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
5.2 Conjugated priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.2.1 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
5.2.2 Exponential families and conjugated priors . . . . . . . . . . . . . 157
5.3 Linear Gaussian model and Gaussian priors . . . . . . . . . . . . . . . . . 157
5.3.1 Univariate case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 157
5.3.2 Linear Gaussian model and Gaussian prior . . . . . . . . . . . . . 158
5.3.3 Homogeneous errors, orthogonal design . . . . . . . . . . . . . . . 161
5.4 Non-informative priors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
5.5 Bayes estimate and posterior mean . . . . . . . . . . . . . . . . . . . . . . 163
5.5.1 Posterior mean and ridge regression . . . . . . . . . . . . . . . . . 165
spokoiny, v. 7
6 Testing a statistical hypothesis 167
6.1 Testing problem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.1.1 Simple hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
6.1.2 Composite hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.1.3 A test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
6.1.4 Errors of the first kind, test level . . . . . . . . . . . . . . . . . . . 169
6.1.5 A randomized test . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
6.1.6 An alternative, error of the second kind, power of the test . . . . 170
6.2 Neyman-Pearson test for two simple hypotheses . . . . . . . . . . . . . . . 171
6.2.1 Neyman-Pearson test for an i.i.d. sample . . . . . . . . . . . . . . 173
6.3 Likelihood ratio test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
6.3.1 Gaussian shift model . . . . . . . . . . . . . . . . . . . . . . . . . . 175
6.3.2 One-sided test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 176
6.3.3 Testing the mean when the variance is unknown . . . . . . . . . . 177
6.3.4 LR-tests. Examples . . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.4 Testing problem for a univariate exponential family . . . . . . . . . . . . . 178
6.4.1 Two-sided alternative . . . . . . . . . . . . . . . . . . . . . . . . . 178
6.4.2 One-sided alternative . . . . . . . . . . . . . . . . . . . . . . . . . 180
6.4.3 Interval hypothesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
7 Testing in linear models 185
7.1 Likelihood ratio test for a simple null . . . . . . . . . . . . . . . . . . . . . 185
7.1.1 General errors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
7.1.2 I.i.d. errors, known variance . . . . . . . . . . . . . . . . . . . . . . 186
7.1.3 Smooth Wald test . . . . . . . . . . . . . . . . . . . . . . . . . . . 190
7.1.4 I.i.d. errors with unknown variance . . . . . . . . . . . . . . . . . . 190
7.2 Likelihood ratio test for a linear hypothesis . . . . . . . . . . . . . . . . . 192
8 Some other testing methods 197
8.1 Method of moments for an i.i.d. sample . . . . . . . . . . . . . . . . . . . 197
8.1.1 Series expansion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
8.1.2 Chi-squared test . . . . . . . . . . . . . . . . . . . . . . . . . . . . 200
8.1.3 Testing a parametric hypothesis . . . . . . . . . . . . . . . . . . . 200
8.2 Minimum distance method for an i.i.d. sample . . . . . . . . . . . . . . . 201
8.2.1 Kolmogorov-Smirnov test . . . . . . . . . . . . . . . . . . . . . . . 202
8.2.2 ω2 test (Cramer-Smirnov-von Mises) . . . . . . . . . . . . . . . . 204
8.3 Partially Bayes tests and Bayes testing . . . . . . . . . . . . . . . . . . . . 204
8.3.1 Quasi LR approach . . . . . . . . . . . . . . . . . . . . . . . . . . . 204
8 parametric statistics: modern view
8.3.2 Partial Bayes approach and Bayes tests . . . . . . . . . . . . . . . 205
8.3.3 Bayes approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
9 Deviation probability for quadratic forms 207
9.1 Gaussian case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
9.2 A bound for the `2 -norm . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
9.3 A bound for a quadratic form . . . . . . . . . . . . . . . . . . . . . . . . . 214
9.4 Rescaling and regularity condition . . . . . . . . . . . . . . . . . . . . . . 217
9.5 A chi-squared bound with norm-constraints . . . . . . . . . . . . . . . . . 218
9.6 A bound for the `2 -norm under Bernstein conditions . . . . . . . . . . . . 221
Preface
This book was written on the basis of a graduate course on mathematical statistics given
at the mathematical faculty of the Humboldt University Berlin.
The classical theory of parametric estimation, since the seminal works by Fisher,
Wald and Le Cam, among many others, has now reached maturity and an elegant form.
It can be considered as more or less complete, at least for the so-called “regular case”.
The question of the optimality and efficiency of the classical methods has been rigorously
studied and typical results state the asymptotic normality and efficiency of the maximum
likelihood and/or Bayes estimates; see an excellent monograph by ? for a comprehensive
study.
In the time around 1984 when I started my own PhD at the Lomonosoff University,
a popular joke in our statistical community in Moscow was that all the problems in the
parametric statistical theory have been solved and described in a complete way in ?, there
is nothing to do any more for mathematical statisticians. If at all, only few nonparametric
problems remain open. After finishing my PhD I also moved to nonparametric statistics
for a while with the focus on local adaptive estimation. In the year 2005 I started to
write a monograph on nonparametric estimation using local parametric methods which
was supposed to systemize my previous experience in this area. The very first draft of
this book was available already in the autumn 2005, and it only included few sections
about basics of parametric estimation. However, attempts to prepare a more systematic
and more general presentation of the nonparametric theory led me back to the very
basic parametric concepts. In 2007 I extended significantly the part about parametric
methods. In the spring 2009 I taught a graduate course on parametric statistics at the
mathematical faculty of the Humboldt University Berlin. My intention was to present a
“modern” version of the theory which mainly addresses the following questions:
“what do you need to know from parametric statistics to work on modern
parametric and nonparametric methods?”
“what kind of results can be established about the quality of general para-
metric methods if the underlying parametric model is misspecified and if the
9
10 parametric statistics: modern view
sample size does not tend to infinity?”
- “the borderline between parametric and nonparametric statistics”.
The classical viewpoint is that the parametric statistics deal with a fixed finite dimen-
sional parameter space while the nonparametric statistics consider either infinite dimen-
sional (functional) parameter space or the dimensionality of the parameter space grows
with the sample size. Unfortunately this approach is not very much useful and informa-
tive if we allow for model misspecification and finite samples. The book offers a slightly
different vision. In particular, many problems usually treated within the nonparametric
setup are included here as parametric ones. The examples are given by high dimensional
linear estimation, roughness penalty, posterior for high dimensional Gaussian priors, etc.
The starting point of “modern parametric” view can be stressed as follows:
- any model is parametric;
- any parametric model is wrong;
- even a wrong model can be useful.
The model mentioned in the first item can be understood as a set of assumptions describ-
ing the unknown distribution of the underlying data. This description is usually given
in terms of some parameters. The parameter space can be large or infinite dimensional,
however, the model is uniquely specified by the parameter value. In this sense “any
model is parametric”.
The second statement “any parametric model is wrong” means that any imaginary
model is only an idealization (approximation) of reality. It is unrealistic to assume
that the data exactly follow the parametric model, even if this model is flexible and
involves a lot of parameters. Model misspecification naturally leads to the notion of the
modeling bias measuring the distance between the underlying model and the selected
parametric family. It also indicates a borderline between parametric and nonparametric
approaches. The parametric approach focuses on “estimation within the model” ignoring
the modeling bias. The nonparametric approach attempts to account for the modeling
bias and to optimize the joint impact of two kinds of errors: estimation error within
the model and the modeling bias. This volume is limited to parametric estimation for
some special models like exponential families or linear models. However, it prepares some
important tools for doing the general parametric theory presented in the second volume.
The last statement “even a wrong model can be useful” introduces the notion of a
“useful” parametric specification. In some sense it indicates a change of a paradigm in
the parametric statistics. Trying to find the true model is hopeless anyway. Instead,
one aims at taking a potentially wrong parametric model which however, possesses some
useful properties. Among others, one can figure out the following “useful” features:
spokoiny, v. 11
- a nice geometric structure of the likelihood leading to a numerically efficient
estimation procedure;
- parameter identifiability.
Lack of identifiability in the considered model is just an indication that the considered
parametric model is poorly selected. A proper parametrization should involve a reason-
able regularization ensuring both features: numerical efficiency/stability and a proper
parameter identification. The present volume presents some examples of “useful mod-
els” like linear or exponential families. The second volume will extend such models to
a quite general regular case involving some smoothness and moment conditions on the
log-likelihood process of the considered parametric family.
This book does not pretend to systematically cover the scope of the classical paramet-
ric theory. Some very important and even fundamental issues are not considered at all in
this book. One characteristic example is given by the notion of sufficiency, which can be
hardly combined with model misspecification. At the same time, much more attention is
paid to the questions of nonasymptotic inference under model misspecification including
concentration and confidence sets, dimensionality of the parameter space.
The first volume of the book presents some basic issues and concepts of the statistical
theory and illustrates them in details for exponential families and linear models. A special
focus on linear models can be explained by their role in the general theory in which a linear
model naturally arises from local approximation of a general regular model. This volume
can be used as text-book for a graduate course in mathematical statistics. It assumes
that the reader is familiar with the basic notions of the probability theory including the
Lebesgue measure, Radon-Nycodim derivative, etc. Knowledge of basic statistics is not
required. I tried to be as self-contained as possible; the most of the presented results are
proved in a rigorous way. Sometimes the details are left to the reader as exercises, in
those cases some hints are given. The volume is structured as follows. The first chapter
starts with a couple of examples illustrating the basis notions of the statistical estimation
theory. Then it introduces some important notions like statistical experiment, regression
model, i.i.d. sample.
Chapter 2 is very important for understanding the whole book. It starts with very
classical stuff: Glivenko-Cantelli results for the empirical measure that motivate the
famous substitution principle. Then the method of moments is studied in more detail
including the risk analysis and asymptotic properties. Some other classical estimation
procedures are briefly discussed including the methods of minimum distance, M-estimates
and its special cases: least squares, least absolute deviations and maximum likelihood
estimates. The concept of efficiency is discussed in context of the Cramer-Rao risk bound
12 parametric statistics: modern view
which is given in univariate and multivariate case. The last sections of Chapter 2 start a
kind of smooth transition from classical to “modern” parametric statistics and they reveal
the approach of the book. The presentation is focused on the (quasi) likelihood-based
concentration and confidence sets. The basic concentration result is first introduced
for the simplest Gaussian shift model and then extended to the case of a univariate
exponential family in Section 2.11.
Chapter 3 extends the notions and approaches introduced for the i.i.d. case to the
more general regression models. Chapter 4 systematically studies the estimation problem
for a linear model. The first four sections are fairly classical and the presented results are
based on the direct analysis of the linear estimation procedures. Section 4.6 reproduces
in a very short form the same results but now based on the likelihood analysis. The
presentation is based on the celebrated chi-squared phenomenon which appears to be
the fundamental fact yielding the exact likelihood based concentration and confidence
properties. The further sections are complementary and can be recommended for a more
profound reading. The issues like regularization, shrinkage, smoothness and roughness
are usually studied within the nonparametric theory, here I am trying to fit them to
the classical linear parametric setup. A special focus is on semiparametric estimation
in Section 4.9. In particular, efficient estimation and chi-squared result are extended to
the semiparametric framework. Chapter 5 briefly discusses the Bayes approach in the
problem of parameter estimation.
The remaining chapters of the volume are devoted to the testing problem. Chapter 6
presents the classical results like Neyman-Pearson Lemma, properties of the likelihood
ratio test for the exponential family. Chapter 7 focuses on testing problem for linear
Gaussian model. Finally, Chapter 8 presents an overview of some nonparametric testing
procedures like minimum distance, Kolmogorov-Smirnov, ω2 , χ2 tests. A brief look on
testing problem from Bayes viewpoint is given at the end.
Part I
Basics
13
Chapter 1
Basic notions
The starting point of any statistical analysis is data, also called observations or a sample.
A statistical model is used to explain the nature of the data. A standard approach
assumes that the data is random and utilizes some probabilistic framework. On the
contrary to probability theory, the distribution of the data is not known precisely and
the goal of the analysis is to infer on this unknown distribution.
The parametric approach assumes that the distribution of the data is known up to the
value of a parameter θ from some subset Θ of a finite-dimensional space IRp . In this
case the statistical analysis is naturally reduced to the estimation of the parameter θ :
as soon as θ is known, we know the whole distribution of the data. Before introducing
the general notion of a statistical model, we discuss some popular examples.
1.1 Example of a Bernoulli experiment
Let Y = (Y1, . . . , Yn)> be a sequence of binary digits zero or one. We distinguish
between deterministic and random sequences. Deterministic sequences appear e.g. from
the binary representation of a real number, or from digitally coded images, etc. Random
binary sequences appear e.g. from coin throw, games, etc. In many situations incomplete
information can be treated as random data: the classification of healthy and sick patients,
individual vote results, the bankruptcy of a firm or credit default, etc.
Basic assumptions behind a Bernoulli experiment are:
• the observed data Yi are independent and identically distributed.
• each Yi assumes the value one with probability θ ∈ [0, 1] .
The parameter θ completely identifies the distribution of the data Y . Indeed, for every
15
16
i ≤ n and y ∈ {0, 1} ,
IP (Yi = y) = θy(1− θ)1−y,
and the independence of the Yi ’s implies for every sequence y = (y1, . . . , yn) that
IP(Y = y
)=
n∏i=1
θyi(1− θ)1−yi . (1.1)
To indicate this fact, we write IPθ in place of IP .
The equation (1.1) can be rewritten as
IPθ(Y = y
)= θsn(1− θ)n−sn ,
where
sn =n∑i=1
yi .
The value sn is often interpreted as the number of successes in the sequence y .
Probabilistic theory focuses on the probabilistic properties of the data Y under the
given measure IPθ . The aim of the statistical analysis is to infer on the measure IPθ for
an unknown θ based on the available data Y . Typical examples of statistical problems
are:
1. Estimate the parameter θ i.e. build a function θ of the data Y into [0, 1] which
approximates the unknown value θ as well as possible;
2. Build a confidence set for θ i.e. a random (data-based) set (usually an interval)
containing θ with a prescribed probability;
3. Testing a simple hypothesis that θ coincides with a prescribed value θ0 , e.g. θ0 =
1/2 ;
4. Testing a composite hypothesis that θ belongs to a prescribed subset Θ0 of the
interval [0, 1] .
Usually any statistical method is based on a preliminary probabilistic analysis of the
model under the given θ .
Theorem 1.1.1. Let Y be i.i.d. Bernoulli with the parameter θ . Then the mean and
the variance of the sum Sn = Y1 + . . .+ Yn satisfy
IEθSn = nθ,
Varθ Sndef= IEθ
(Sn − IEθSn
)2= nθ(1− θ).
17
Exercise 1.1.1. Prove this theorem.
This result suggests that the empirical mean θ = Sn/n is a reasonable estimate of
θ . Indeed, the result of the theorem implies
IEθθ = θ, IEθ(θ − θ
)2= θ(1− θ)/n.
The first equation means that θ is an unbiased estimate of θ , that is, IEθθ = θ for all
θ . The second equation yields a kind of concentration (consistency) property of θ : with
n growing, the estimate θ concentrates in a small neighborhood of the point θ . By the
Chebyshev inequality
IPθ(∣∣θ − θ∣∣ > δ
)≤ θ(1− θ)/(nδ2).
This result is refined by the famous de Moivre-Laplace theorem.
Theorem 1.1.2. Let Y be i.i.d. Bernoulli with the parameter θ . Then for every k ≤ n
IPθ(Sn = k
)=
(n
k
)θk(1− θ)n−k
≈ 1√2πnθ(1− θ)
exp
{−(k − nθ
)22nθ(1− θ)
},
where an ≈ bn means an/bn → 1 as n→∞ . Moreover, for any fixed z > 0 ,
IPθ
(∣∣∣Snn− θ∣∣∣ > z
√θ(1− θ)/n
)≈ 2√
2π
∫ ∞z
e−t2/2dt .
This concentration result yields that the estimate θ deviates from a root-n neigh-
borhood A(z, θ)def= {u : |θ − u| ≤ z
√θ(1− θ)/n} with probability of order e−z
2/2 .
This result bounding the difference |θ−θ| can also be used to build random confidence
intervals around the point θ . Indeed, by the result of the theorem, the random interval
E∗(z) = {u : |θ−u| ≤ z√θ(1− θ)/n} fails to cover the true point θ with approximately
the same probability:
IPθ(E∗(z) 63 θ
)≈ 2√
2π
∫ ∞z
e−t2/2dt . (1.2)
Unfortunately, the construction of this interval E∗(z) is not entirely data-based. Its
width involves the true unknown value θ . A data based confidence set can be obtained
by replacing the population variance σ2def= IEθ
(Y1 − θ
)2= θ(1 − θ) with its empirical
counterpart
σ2def=
1
n
n∑i=1
(Yi − θ
)2
18
The resulting confidence set E(z) reads as
E(z)def= {u : |θ − u| ≤ z
√n−1σ2}.
It possesses the same asymptotic properties as E∗(z) including (1.2).
The hypothesis that the value θ is equal to a prescribed value θ0 , e.g. θ0 = 1/2 , can
be checked by examining the difference |θ − 1/2| . If this value is too large compared to
σn−1/2 or with σn−1/2 , then the hypothesis is wrong with high probability. Similarly
one can consider a composite hypothesis that θ belongs to some interval [θ1, θ2] ⊂ [0, 1] .
If θ deviates from this interval at least by the value zσn−1/2 with a large z , then the
data significantly contradict this hypothesis.
1.2 Least squares estimation in a linear model
A linear model assumes a linear systematic dependence between the output (also called
response or explained variable) Y from the input (also called regressor or explanatory
variable) Ψ which in general can be multidimensional. The linear model is usually
written in the form
IE(Y)
= Ψ>θ∗
with an unknown vector of coefficients θ∗ = (θ∗1, . . . , θ∗p)> . Equivalently one writes
Y = Ψ>θ∗ + ε (1.3)
where ε stands for the individual error with zero mean: IEε = 0 . Such a linear model is
often used to describe the influence of the response on the regressor Ψ from the collection
of data in the form of a sample (Yi, Ψi) for i = 1, . . . , n .
Let θ be a vector of coefficients considered as a candidate for θ∗ . Then each obser-
vation Yi is approximated by Ψ>i θ . One often measures the quality of approximation
by the sum of quadratic errors |Yi − Ψ>i θ|2 . Under the model assumption (1.3), the
expected value of this sum is
IE∑|Yi − Ψ>i θ|2 = IE
∑∣∣Ψ>i (θ∗ − θ) + εi∣∣2 =
∑∣∣Ψ>i (θ∗ − θ)∣∣2 +
∑IEε2i .
The cross term cancels in view of IEεi = 0 . Note that minimizing this expression w.r.t.
θ is equivalent to minimizing the first sum because the second sum does not depend on
θ . Therefore,
argminθ
IE∑|Yi − Ψ>i θ|2 = argmin
θ
∑∣∣Ψ>i (θ∗ − θ)∣∣2 = θ∗.
19
In words, the true parameter vector θ∗ minimizes the expected quadratic error of fitting
the data with a linear combinations of the Ψi ’s. The least squares estimate of the
parameter vector θ∗ is defined by minimizing in θ its empirical counterpart, that is,
the sum of the squared errors∣∣Yi − Ψ>i θ∣∣2 over all i :
θdef= argmin
θ
n∑i=1
∣∣Yi − Ψ>i θ∣∣2.This equation can be solved explicitly under some condition on the Ψi ’s. Define the
p × n design matrix Ψ = (Ψ1, . . . , Ψn) . The aforementioned condition means that this
matrix is of rank p .
Theorem 1.2.1. Let Yi = Ψ>i θ∗ + εi for i = 1, . . . , n , where εi are independent and
satisfy IEεi = 0 , IEε2i = σ2 . Suppose that the matrix Ψ is of rank p . Then
θ =(ΨΨ>
)−1ΨY
where Y = (Y1, . . . , Yn)> . Moreover, θ is unbiased in the sense that
IEθ∗ θ = θ∗
and its variance satisfies Var(θ)
= σ2(ΨΨ>
)−1.
For each vector h ∈ IRp , the random value a =⟨h, θ
⟩= h>θ is an unbiased estimate
of a∗ = h>θ∗ :
IEθ∗(a) = a∗ (1.4)
with the variance
Var(a)
= σ2h>(ΨΨ>
)−1h.
Proof. Define
L(θ)def=
n∑i=1
∣∣Yi − Ψ>i θ∣∣2 = ‖Y − Ψ>θ‖2,
where ‖y‖2 def=∑
i y2i . The normal equation dL(θ)/dθ = 0 can be written as ΨΨ>θ =
ΨY yielding the representation of θ . Now the model equation yields IEθY = Ψ>θ∗
and thus
IEθ∗ θ =(ΨΨ>
)−1ΨIEθ∗Y =
(ΨΨ>
)−1ΨΨ>θ∗ = θ∗
as required.
20
Exercise 1.2.1. Check that Var(θ)
= σ2(ΨΨ>
)−1.
Similarly one obtains IEθ∗(a) = IEθ∗(h>θ
)= h>θ∗ = a∗ , that is, a is an unbiased
estimate of a∗ . Also
Var(a)
= Var(h>θ
)= h>Var
(θ)h = σ2h>
(ΨΨ>
)−1h.
which completes the proof.
The next result states that the proposed estimate a is in some sense the best possible
one. Namely, we consider the class of all linear unbiased estimates a satisfying the
identity (1.4). It appears that the variance σ2h>(ΨΨ>
)−1h of a is the smallest possible
in this class.
Theorem 1.2.2 (Gauss-Markov). Let Yi = Ψ>i θ∗+εi for i = 1, . . . , n with uncorrelated
εi satisfying IEεi = 0 and IEε2i = σ2 . Let rank(Ψ) = p . Suppose that the value a∗def=⟨
h,θ∗⟩
= h>θ∗ is to be estimated for a given vector h ∈ IRp . Then a =⟨h, θ
⟩= h>θ
is an unbiased estimate of a∗ . Moreover, a has the minimal possible variance over the
class of all linear unbiased estimates of a∗ .
This result was historically one of the first optimality results in statistics. It presents
a lower efficiency bound of any statistical procedure. Under the imposed restrictions it is
impossible to do better than the LSE does. This and more general results will be proved
later in Chapter 4.
Define also the vector of residuals
εdef= Y − Ψ>θ .
If θ is a good estimate of the vector θ∗ , then due to the model equation, ε is a good
estimate of the vector ε of individual errors. Many statistical procedures utilize this
observation by checking the quality of estimation via the analysis of the estimated vector
ε . In the case when this vector still shows a nonzero systematic component, there is
evidence that the assumed linear model is incorrect. This vector can also be used to
estimate the noise variance σ2 .
Theorem 1.2.3. Consider the linear model Yi = Ψ>i θ∗ + εi with independent homoge-
neous errors εi . Then the variance σ2 = IEε2i can be estimated by
σ2 =‖ε‖2nn− p
=‖Y − Ψ>θ‖2n
n− p
and σ2 is an unbiased estimate of σ2 , that is, IEθ∗ σ2 = σ2 for all θ∗ and σ .
21
Theorems 1.2.2 and 1.2.3 can be used to describe the concentration properties of the
estimate a and to build confidence sets based on a and σ , especially if the errors εi
are normally distributed.
Theorem 1.2.4. Let Yi = Ψ>i θ∗+εi for i = 1, . . . , n with εi ∼ N(0, σ2) . Let rank(Ψ) =
p . Then it holds for the estimate a = h>θ of θ = h>θ∗
a− a∗ ∼ N(0, s2
)with s2 = σ2h>
(ΨΨ>
)−1h .
Corollary 1.2.5 (Concentration). If for some α > 0 , zα is the 1−α/2 -quantile of the
standard normal law (i.e. Φ(zα) = 1− α/2 ) then
IPθ∗(|a− a∗| > zα s
)= α
Exercise 1.2.2. Check Corollary 1.2.5.
The next result describes the confidence set for a∗ . The unknown variance s2 is
replaced by its estimate
s2def= σ2h>
(ΨΨ>
)−1h
Corollary 1.2.6 (Confidence set). If E(zα)def= {a : |a− a| ≤ s zα} then
IPθ∗(E(zα) 63 a∗
)≈ α.
1.3 General parametric model
Let Y denote the observed data with values in the observation space Y . In most cases,
Y ∈ IRn , that is, Y = (Y1, . . . , Yn)> . Here n denotes the sample size (number of
observations). The basic assumption about these data is that the vector Y is a random
variable on a probability space (Y,B(Y), IPθ∗) , where B(Y) is the Borel σ -algebra on
Y . The probabilistic approach assumes that the probability measure IPθ∗ is known and
studies the distributional (population) properties of the vector Y . On the contrary,
the statistical approach assumes that the data Y are given and tries to recover the
distribution IP on the basis of the available data Y . One can say that the statistical
problem is inverse to the probabilistic one.
The statistical analysis is usually based on the notion of statistical experiment. This
notion assumes that a family P = {IP} of probability measures IP on (Y,B(Y)) is fixed
and the unknown underlying measure IPθ∗ belongs to this family. Often this family is
22
parameterized by the value θ from some parameter set Θ : P = (IPθ,θ ∈ Θ) . The
corresponding statistical experiment can be written as
(Y,B(Y), (IPθ,θ ∈ Θ)
).
The value θ∗ denotes the “true” parameter value, that is, IP = IPθ∗ .
The statistical experiment is dominated if there exists a dominating σ -finite measure
µ0 such that all the IPθ are absolutely continuous w.r.t. µ0 . In what follows we assume
without further mention that the considered statistical models are dominated. Usually
the choice of a dominating measure is unimportant and any one can be used.
The parametric approach assumes that Θ is a subset of a finite-dimensional Euclidean
space IRp . In this case, the unknown data distribution is specified by the value of a finite-
dimensional parameter θ from Θ ⊆ IRp . Since in this case the parameter θ completely
identifies the distribution of the observations Y , the statistical estimation problem is
reduced to recovering (estimating) this parameter from the data. The nice feature of the
parametric theory is that the estimation problem can be solved in a rather general way.
1.4 Statistical decision problem. Loss and Risk
The statistical decision problem is usually formulated in terms of game theory, the statis-
tician playing as it were against nature. Let D denote the decision space that is assumed
to be a topological space. Next, let ℘(·, ·) be a loss function given on the product D×Θ .
The value ℘(d,θ) denotes the loss associated with the decision d ∈ D when the true
parameter value is θ ∈ Θ . The statistical decision problem is composed of a statistical
experiment (Y,B(Y),P) , a decision space D and a loss function ℘(·, ·) .
A statistical decision ρ = ρ(Y ) is a measurable function of the observed data Y
with values in the decision space D . Clearly, ρ(Y ) can be considered as a random
D -valued element on the space (Y,B(Y)) . The corresponding loss under the true model
(Y,B(Y), IPθ∗) reads as ℘(ρ(Y ),θ∗) . Finally, the risk is defined as the expected value
of the loss:
R(ρ,θ∗)def= IEθ∗℘(ρ(Y ),θ∗).
Below we present a list of typical the statistical decision problems.
Example 1.4.1. [Point estimation problem] Let the target of analysis be the true pa-
rameter θ∗ itself, that is, D coincide with Θ , Let ℘(·, ·) be a kind of distance on
Θ , that is, ℘(θ,θ∗) denotes the loss of estimation, when the selected value is θ while
the true parameter is θ∗ . Typical examples of the loss function are quadratic loss
23
℘(θ,θ∗) = ‖θ− θ∗‖2 , l1 -loss ℘(θ,θ∗) = ‖θ− θ∗‖1 or sup-loss ℘(θ,θ∗) = ‖θ− θ∗‖∞ =
maxj=1,...,p |θj − θ∗j | .If θ is an estimate of θ∗ , that is, θ is a Θ -valued function of the data Y , then the
corresponding risk is
R(ρ,θ∗)def= IEθ∗℘(θ,θ∗).
Particularly, the quadratic risk reads as IEθ∗‖θ − θ∗‖2 .
Example 1.4.2. [Testing problem] Let Θ0 and Θ1 be two complementary subsets of Θ ,
that is, Θ0 ∩Θ1 = ∅ , Θ0 ∪Θ1 = Θ . Our target is to check whether the true parameter
θ∗ belongs to the subset Θ0 . The decision space consists of two points {0, 1} for which
d = 0 means the acceptance of the hypothesis H0 : θ∗ ∈ Θ0 while d = 1 rejects H0 in
favor of the alternative H1 : θ∗ ∈ Θ1 . Define the loss
℘(d,θ) = 1(d = 1,θ ∈ Θ0) + 1(d = 0,θ ∈ Θ1).
A test φ is a binary valued function of the data, φ = φ(Y ) ∈ {0, 1} . The corresponding
risk R(φ,θ∗) = IEθ∗φ(Y ) can be interpreted as the probability of selecting the wrong
subset.
Example 1.4.3. [Confidence estimation] Let the target of analysis again be the pa-
rameter θ∗ . However, we aim to identify a subset A of Θ , as small as possible, that
covers with a presribed probability the true value θ∗ . Our decision space D is now
the set of all measurable subsets in Θ . For any A ∈ D , the loss function is defined as
℘(A,θ∗) = 1(A 63 θ∗) . A confidence set is a random set E selected from the data Y ,
E = E(Y ) . The corresponding risk R(E,θ∗) = IEθ∗℘(E,θ∗) is just the probability that
E does not cover θ∗ .
Example 1.4.4. [Estimation of a functional] Let the target of estimation be a given
function f(θ∗) of the parameter θ∗ with values in another space F . A typical example is
given by a single component of the vector θ∗ . An estimate ρ of f(θ∗) is a function of the
data Y into F : ρ = ρ(Y ) ∈ F . The loss function ℘ is defined on the product F ×F ,
yielding the loss ℘(ρ(Y ), f(θ∗)) and the risk R(ρ(Y ), f(θ∗)) = IEθ∗℘(ρ(Y ), f(θ∗)) .
Exercise 1.4.1. Define the statistical decision problem for testing a simple hypothesis
θ∗ = θ0 for a given point θ0 .
1.5 Efficiency
After the statistical decision problem is stated, one can ask for its optimal solution.
Equivalently one can say that the aim of statistical analysis is to build a decision with
24
the minimal possible risk. However, a comparison of any two decisions on the basis of
risk can be a nontrivial problem. Indeed, the risk R(ρ,θ∗) of a decision ρ depends on
the true parameter value θ∗ . It may happen that one decision performs better for some
points θ∗ ∈ Θ but worse at other points θ∗ . An extreme example of such an estimate is
the trivial deterministic decision θ = θ0 which sets the estimate equal to the value θ0
whatever the data is. This is, of course, a very strange and poor estimate, but it clearly
outperforms all other methods if the true parameter θ∗ is indeed θ0 .
Two approaches are typically used to compare different statistical decisions: the
minimax approach considers the maximum R(ρ) of the risks R(ρ,θ) over the parameter
set Θ while the Bayes approach is based on the weighted sum (integral) Rπ(ρ) of such
risks with respect to some measure π on the parameter set Θ which is called the prior
distribution:
R(ρ) = supθ∈Θ
R(ρ,θ),
Rπ(ρ) =
∫R(ρ,θ)π(dθ).
The decision ρ∗ is called minimax if
R(ρ∗) = infρR(ρ) = inf
ρsupθ∈Θ
R(ρ,θ),
where the infimum is taken over the set of all possible decisions ρ . The value R∗ = R(ρ∗)
is called the minimax risk.
Similarly, the decision ρπ is called Bayes for the prior π if
Rπ(ρπ) = infρRπ(ρ).
The corresponding value Rπ(ρπ) is called the Bayes risk.
Exercise 1.5.1. Show that the minimax risk is greater than or equal to the Bayes risk
whatever the prior measure π is.
Hint: show that for any decision ρ , it holds R(ρ) ≥ Rπ(ρ) .
Usually the problem of finding a minimax or Bayes estimate is quite hard and a
closed form solution is available only in very few special cases. A standard way out of
this problem is to switch to an asymptotic set-up in which the sample size grows to
infinity.
Chapter 2
Parameter estimation for an i.i.d.
model
In the present chapter we consider the estimation problem for a sample of independent
identically distributed (i.i.d.) observations. Throughout the chapter the data Y are
assumed to be given in the form of a sample (Y1, . . . , Yn) . We assume that the obser-
vations Y1, . . . , Yn are independent identically distributed; each Yi is from an unknown
distribution P also called a marginal measure. The joint data distribution IP is the
n -fold product of P : IP = P⊗n . Thus, the measure IP is uniquely identified by P and
the statistical problem can be reduced to recovering P .
The further step in model specification is based on a parametric assumption (PA):
the measure P belongs to a given parametric family.
2.1 Empirical distribution. Glivenko-Cantelli Theorem
Let Y = (Y1, . . . , Yn)> be an i.i.d. sample. For simplicity we assume that the Yi ’s are
univariate with values in IR . Let P denote the distribution of each Yi :
P (B) = IP (Yi ∈ B), B ∈ B(IR).
One often says that Y is an i.i.d. sample from P . Let also F be the corresponding
distribution function (cdf):
F (y) = IP (Y1 ≤ y) = P ((−∞, y]).
The assumption that the Yi ’s are i.i.d. implies that the joint distribution IP of the data
Y is given by the n -fold product of the marginal measure P :
IP = P⊗n.
25
26
Let also Pn (resp. Fn ) be the empirical measure (resp. empirical distribution function
(edf))
Pn(B) =1
n
∑1(Yi ∈ B), Fn(y) =
1
n
∑1(Yi ≤ y).
Here and everywhere in this chapter the symbol∑
stands for∑n
i=1 . One can consider
Fn as the distribution function of the empirical measure Pn defined as the atomic
measure at the Yi ’s:
Pn(A)def=
1
n
n∑i=1
1(Yi ∈ A).
So, Pn(A) is the empirical frequency of the event A , that is, the fraction of observations
Yi belonging to A . By the law of large numbers one can expect that this empirical
frequency is close to the true probability P (A) if the number of observations is sufficiently
large.
An equivalent definition of the empirical measure and empirical distribution function
can be given in terms of the empirical mean IEng for a measurable function g :
IEngdef=
∫ ∞−∞
g(y)IPn(dy) =
∫ ∞−∞
g(y)dFn(y) =1
n
n∑i=1
g(Yi).
The first results claims that indeed, for every Borel set B on the real line, the empirical
mass Pn(B) (which is random) is close in probability to the population counterpart
P (B) .
Theorem 2.1.1. For any Borel set B , it holds
1. IEPn(B) = P (B) .
2. Var{Pn(B)
}= n−1σ2B with σ2B = P (B)
{1− P (B)
}.
3. Pn(B)→ P (B) in probability as n→∞ .
4.√n{Pn(B)− P (B)} w−→ N(0, σ2B) .
Proof. Denote ξi = 1(Yi ∈ B) . This is a Bernoulli r.v. with parameter P (B) = IEξi .
The first statement holds by definition of Pn(B) = n−1∑
i ξi . Next, for each i ≤ n ,
Var ξidef= IEξ2i −
(IEξi
)2= P (B)
{1− P (B)
}in view of ξ2i = ξi . Independence of the ξi ’s yields
Var{Pn(B)
}= Var
(n−1
n∑i=1
ξi
)= n−2
n∑i=1
Var ξi = n−1σ2B.
27
The third statement follows by the law of large numbers for the i.i.d. r.v. ξi :
1
n
n∑i=1
ξiIP−→ IEξ1 .
Finally, the last statement follows by the Central Limit Theorem for the ξi :
1√n
n∑i=1
(ξi − IEξi
) w−→ N(0, σ2B
).
The next important result shows that the edf Fn is a good approximation of the cdf
F in the uniform norm.
Theorem 2.1.2 (Glivenko-Cantelli). It holds
supy
∣∣Fn(y)− F (y)∣∣→ 0, n→∞
Proof. Consider first the case when the function F is continuous in y . Fix any integer
N and define with ε = 1/N the points t1 < t2 < . . . < tN = +∞ such that F (tj) −F (tj−1) = ε for j = 2, . . . , N . For every j , by (3) of Theorem 2.1.1, it holds Fn(tj)→F (tj) . This implies that for some n(N) , it holds for all n ≥ n(N)∣∣Fn(tj)− F (tj)
∣∣ ≤ ε, j = 1, . . . , N. (2.1)
Now for every t ∈ [tj−1, tj ] , it holds by definition
F (tj−1) ≤ F (t) ≤ F (tj), Fn(tj−1) ≤ Fn(t) ≤ Fn(tj).
This together with (2.1) implies ∣∣Fn(t)− F (t)∣∣ ≤ 2ε.
If the function F (·) is not continuous, then for every positive ε , there exists a finite set
Sε of points of discontinuity sm with F (sm) − F (sm − 0) ≥ ε . One can proceed as in
the continuous case by adding the points from Sε to the discrete set {tj} .
Exercise 2.1.1. Check the details of the proof of Theorem 2.1.2.
The results of Theorems 2.1.1 and 2.1.2 can be extended to certain functionals of the
distribution P . Let g(y) be a function on the real line. Consider its expectation
s0def= IEg(Y1) =
∫ ∞−∞
g(y)dF (y).
28
Its empirical counterpart is defined by
Sndef=
∫ ∞−∞
g(y)dFn(y) =1
n
n∑i=1
g(Yi).
It appears that Sn indeed well estimates s0 , at least for large n .
Theorem 2.1.3. Let g(y) be a function on the real line such that∫ ∞−∞
g2(y)dF (y) <∞
Then
SnIP−→ s0,
√n(Sn − s0)
w−→ N(0, σ2g), n→∞,
where
σ2gdef=
∫ ∞−∞
g2(y)dF (y)− s20 =
∫ ∞−∞
[g(y)− s0
]2dF (y).
Moreover, if h(z) is a twice continuously differentiable function on the real line, and
h′(s0) 6= 0 then
h(Sn)IP−→ h(s0),
√n{h(Sn)− h(s0)
} w−→ N(0, σ2h), n→∞,
where σ2hdef= |h′(s0)|2σ2g .
Proof. The first statement is again the CLT for the i.i.d. random variables ξi = g(Yi)
having mean value s0 and variance σ2g .
It also implies the second statement in view of the Taylor expansion h(Sn)−h(s0) ≈h′(s0) (Sn − s0) .
Exercise 2.1.2. Complete the proof.
Hint: use the first result to show that Sn belongs with high probability to a small
neighborhood U of the point s0 .
Then apply the Taylor expansion of second order to h(Sn)−h(s0) = h(s0+n−1/2ξn
)−
h(s0) with ξn =√n(Sn − s0) :
∣∣n1/2[h(Sn)− h(s0)]− h′(s0) ξn∣∣ ≤ n−1/2h∗ξ2n/2,
where h∗ = maxU |h′′(y)| . Show that n−1/2ξ2nIP−→ 0 because ξn is stochastically
bounded by the first statement of the theorem.
29
The results of Theorems 2.1.2 and 2.1.3 can be extended to the case of a vectorial
function g(·) : IR1 → IRm , that is, g(y) =(g1(y), . . . , gm(y)
)>for y ∈ IR1 . Then
s0 = (s0,1, . . . , s0,m)> and its empirical counterpart Sn = (Sn,1, . . . , Sn,m)> are vectors
in IRm as well:
s0,jdef=
∫ ∞−∞
gj(y)dF (y), Sn,jdef=
∫ ∞−∞
gj(y)dFn(y), j = 1, . . . ,m.
Theorem 2.1.4. Let g(y) be an IRm -valued function on the real line with a bounded
covariance matrix Σ = (Σjk)j,k=1,...,m :
Σjkdef=
∫ ∞−∞
[gj(y)− s0,j
][gk(y)− s0,k
]dF (y) <∞, j, k ≤ m
Then
SnIP−→ s0,
√n(Sn − s0
) w−→ N(0, Σ), n→∞.
Moreover, if H(z) is a twice continuously differentiable function on IRm and ΣH ′(s0) 6=0 where H ′(z) stands for the gradient of H at z then
H(Sn)IP−→ H(s0),
√n{H(Sn)−H(s0)
} w−→ N(0, σ2H), n→∞,
where σ2Hdef= H ′(s0)
>ΣH ′(s0) .
Exercise 2.1.3. Prove Theorem 2.1.4.
Hint: consider for every h ∈ IRm the scalar products h>g(y) , h>s0 , h>Sn . For
the first statement, it suffices to show that
h>SnIP−→ h>s0,
√nh>
(Sn − s0
) w−→ N(0,h>Σh), n→∞.
For the second statement, consider the expansion
∣∣n1/2[H(Sn)−H(s0)]− ξ>nH ′(s0)∣∣ ≤ n−1/2H∗ ‖ξn‖2/2 IP−→ 0,
with ξn = n1/2(Sn − s0) and H∗ = maxy∈U ‖H ′′(y)‖ for a neighborhood U of s0 .
2.2 Substitution principle. Method of moments
By the Glivenko-Cantelli theorem the empirical measure Pn (resp. edf Fn ) is a good
approximation of the true measure P (reps. pdf F ), at least, if n is sufficiently large.
This leads to the important substitution method of statistical estimation: represent the
target of estimation as a function of the distribution P , then replace P by Pn .
30
Suppose that there exists some functional g of a measure Pθ from the family P =
(Pθ,θ ∈ Θ) such that the following identity holds:
θ = g(Pθ), θ ∈ Θ.
This particularly implies θ∗ = g(Pθ∗) = g(P ) . The substitution estimate is defined by
substituting Pn for P :
θ = g(Pn).
Sometimes the obtained value θ can lie outside the parameter set Θ . Then one can
redefine the estimate θ as the value providing the best fit of g(Pn) :
θ = argminθ‖g(Pθ)− g(Pn)‖.
Here ‖ · ‖ denotes some norm on the parameter set Θ , e.g. the Euclidean norm.
2.2.1 Method of moments. Univariate parameter
The method of moments is a special but at the same time the most frequently used
case of the substitution method. For illustration, we start with the univariate case. Let
Θ ⊆ IR , that is, θ is a univariate parameter. Let g(y) be a function on IR such that
the first moment
m(θ)def= Eθg(Y1) =
∫g(y)dPθ(y)
is continuous and monotonic. Then the parameter θ can be uniquely identified by the
value m(θ) , that is, there exists an inverse function m−1 satisfying
θ = m−1(∫
g(y)dPθ(y)
).
The substitution method leads to the estimate
θ = m−1(∫
g(y)dPn(y)
)= m−1
(1
n
∑g(Yi)
).
Usually g(x) = x or g(x) = x2 , which explains the name of the method. This method
was proposed by Pearson and is historically the first regular method of constructing a
statistical estimate.
31
2.2.2 Method of moments. Multivariate parameter
The method of moments can be easily extended to the multivariate case. Let Θ ⊆ IRp ,
and let g(y) =(g1(y), . . . , gp(y)
)>be a function with values in IRp . Define the moments
m(θ) =(m1(θ), . . . ,mp(θ)
)by
mj(θ) = Eθgj(Y1) =
∫gj(y)dPθ(y).
The main requirement on the choice of the vector function g is that the function m is
invertible, that is, the system of equations
mj(θ) = tj
has a unique solution for any t ∈ IRp . The empirical counterpart Mn of the true
moments m(θ∗) is given by
Mndef=
∫g(y)dPn(y) =
(1
n
∑g1(Yi), . . . ,
1
n
∑gp(Yi)
)>.
Then the estimate θ can be defined as
θdef= m−1(Mn) = m−1
(1
n
∑g1(Yi), . . . ,
1
n
∑gp(Yi)
).
2.2.3 Method of moments. Examples
This section lists some widely used parametric families and discusses the problem of
constructing the parameter estimates by different methods. In all the examples we assume
that an i.i.d. sample from a distribution P is observed, and this measure P belongs a
given parametric family (Pθ,θ ∈ Θ) , that is, P = Pθ∗ for θ∗ ∈ Θ .
Gaussian shift
Let Pθ be the normal distribution on the real line with mean θ and the known variance
σ2 . The corresponding density w.r.t. the Lebesgue measure reads as
p(y, θ) =1√
2πσ2exp{−(y − θ)2
2σ2
}.
It holds IEθY1 = θ and Varθ(Y1) = σ2 leading to the moment estimate
θ =
∫ydPn(y) =
1
n
∑Yi
with mean IEθθ = θ and variance
Varθ(θ) = σ2/n.
32
Univariate normal distribution
Let Yi ∼ N(α, σ2) as in the previous example but both mean α and the variance σ2
are unknown. This leads to the problem of estimating the vector θ = (θ1, θ2) = (α, σ2)
from the i.i.d. sample Y .
The method of moments suggests to estimate the parameters from the first two em-
pirical moments of the Yi ’s using the equations m1(θ) = IEθY1 = α , m2(θ) = IEθY21 =
α2 + σ2 . Inverting these equalities leads to
α = m1(θ) , σ2 = m2(θ)−m21(θ)
Substituting the empirical measure Pn yields the expressions for θ :
α =1
n
∑Yi, σ2 =
1
n
∑Y 2i −
(1
n
∑Yi
)2
=1
n
∑(Yi − α
)2. (2.2)
As previously for the case of a known variance, it holds under IP = IPθ :
IEα = α, Varθ(α) = σ2/n.
However, for the estimate σ2 of σ2 , the result is slightly different and it is described in
the next theorem.
Theorem 2.2.1. It holds
IEθσ2 =
n− 1
nσ2, Varθ(σ2) =
2(n− 1)
n2σ4.
Proof. We use vector notation. Consider the unit vector e = n−1/2(1, . . . , 1)> ∈ IRn and
denote by Π1 the projector on e :
Π1h = (e>h)e.
Then by definition α = n−1/2e>Π1Y and σ2 = n−1‖Y −Π1Y ‖2 . Moreover, the model
equation Y = n1/2αe+ ε implies in view of Π1e = e that
Π1Y = (n1/2αe+Π1ε).
Now
nσ2 = ‖Y −Π1Y ‖2 = ‖ε−Π1ε‖2 = ‖(In −Π1)ε‖2
where In is the identity operator in IRn and In−Π1 is the projector on the hyperplane
in IRn orthogonal to the vector e . Obviously (In−Π1)ε is a Gaussian vector with zero
33
mean and the covariance matrix V defined by
V = IE[(In −Π1)εε
>(In −Π1)]
= (In −Π1)IE(εε>)(In −Π1)
= σ2(In −Π1)2 = σ2(In −Π1).
It remains to note that for any Gaussian vector ξ ∼ N(0, V ) it holds
IE‖ξ‖2 = trV, Var(‖ξ‖2
)= 2 tr(V 2).
Exercise 2.2.1. Check the details of the proof.
Hint: reduce to the case of diagonal V .
Exercise 2.2.2. Compute the covariance IE(α−α)(σ2−σ2) . Show that α and σ2 are
independent.
Hint: represent α − α = eΠ1ε and σ2 = n−1‖(In − Π1)ε‖2 . Use that Π1ε and
(In −Π1)ε are independent if Π1 is a projector and ε is a Gaussian vector.
Uniform distribution on [0, θ]
Let Yi be uniformly distributed on the interval [0, θ] of the real line where the right
end point θ is unknown. The density p(y, θ) of Pθ w.r.t. the Lebesgue measure is
θ−11(y ≤ θ) . It is easy to compute that for an integer k
IEθ(Yk1 ) = θ−1
∫ θ
0ykdy = θk/(k + 1),
or θ ={
(k + 1)IEθ(Yk1 )}1/k
. This leads to the family of estimates
θk =
(k + 1
n
∑Y ki
)1/(k+1)
.
Letting k to infinity leads to the estimate
θ∞ = max{Y1, . . . , Yn}.
This estimate is quite natural in the context of the univariate distribution. Later it will
appear once again as the maximum likelihood estimate. However, it is not a moment
estimate.
34
Bernoulli or binomial model
Let Pθ be a Bernoulli law for θ ∈ [0, 1] . Then every Yi is binary with
IEθYi = θ.
This leads to the moment estimate
θ =
∫ydPn(y) =
1
n
∑Yi .
Exercise 2.2.3. Compute the moment estimate for g(y) = yk , k ≥ 1 .
Multinomial model
The multinomial distribution Bmθ describes the number of successes in m experiments
when each success has the probability θ ∈ [0, 1] . This distribution can be viewed as the
sum of m binomials with the same parameter θ . Observed is the sample Y where each
Yi is the number of successes in the i th experiment. One has
Pθ(Y1 = k) =
(m
k
)θk(1− θ)m−k, k = 0, . . . ,m.
Exercise 2.2.4. Check that method of moments with g(x) = x leads to the estimate
θ =1
mn
∑Yi .
Compute Varθ(θ) .
Hint: Reduce the multinomial model to the sum of m Bernoulli.
Exponential model
Let Pθ be an exponential distribution on the positive semiaxis with the parameter θ .
This means
IPθ(Y1 > y) = e−y/θ.
Exercise 2.2.5. Check that method of moments with g(x) = x leads to the estimate
θ =1
n
∑Yi .
Compute Varθ(θ) .
35
Poisson model
Let Pθ be the Poisson distribution with the parameter θ . The Poisson random variable
Y1 is integer-valued with
Pθ(Y1 = k) =θk
k!e−k.
Exercise 2.2.6. Check that method of moments with g(x) = x leads to the estimate
θ =1
n
∑Yi .
Compute Varθ(θ) .
Shift of a Laplace (double exponential) law
Let P0 be a symmetric distribution defined by the equations
P0(|Y1| > y) = e−y/σ, y ≥ 0,
for some given σ > 0 . Equivalently one can say that the absolute value of Y1 is
exponential with parameter σ under P0 . Now define Pθ by shifting P0 by the value
θ . This means that
Pθ(|Y1 − θ| > y) = e−y/σ, y ≥ 0.
It is obvious that IE0Y1 = 0 and IEθY1 = θ .
Exercise 2.2.7. Check that method of moments leads to the estimate
θ =1
n
∑Yi .
Compute Varθ(θ) .
Shift of a symmetric density
Let the observations Yi be defined by the equation
Yi = θ∗ + εi
where θ∗ is an unknown parameter and the errors εi are i.i.d. with a density symmetric
around zero and finite second moment σ2 = IEε21 . This particularly yields that IEεi = 0
and IEYi = θ∗ . The method of moments immediately yields the empirical mean estimate
θ =1
n
∑Yi
with Varθ(θ) = σ2/n .
36
2.3 Unbiased estimates, bias, and quadratic risk
Consider a parametric i.i.d. experiment corresponding to a sample Y = (Y1, . . . , Yn)>
from a distribution Pθ∗ ∈ (Pθ,θ ∈ Θ ⊆ IRp) . By θ∗ we denote the true parameter from
Θ . Let θ be an estimate of θ∗ , that is, a function of the available data Y with values
in Θ : θ = θ(Y ) .
An estimate θ of the parameter θ∗ is called unbiased if
IEθ∗ θ = θ∗.
This property seems to be rather natural and desirable. However, it is often just matter
of parametrization. Indeed, if g : Θ → Θ is a linear transformation of the parameter
set Θ , that is, g(θ) = Aθ + b , then the estimate ϑdef= Aθ + b of the new parameter
ϑ = Aθ + b is again unbiased. However, if m(·) is a nonlinear transformation, then the
identity IEθ∗m(θ) = m(θ∗) is not preserved.
Example 2.3.1. Consider the Gaussian shift experiments for Yi i.i.d. N(θ∗, σ2) with
known variance σ2 but the shift parameter θ∗ is unknown. Then θ = n−1(Y1+ . . .+Yn)
is an unbiased estimate of θ∗ . However, for m(θ) = θ2 , it holds
IEθ∗ |θ|2 = |θ∗|2 + σ2/n,
that is, the estimate |θ|2 of |θ∗|2 is slightly biased.
The property of “no bias” is especially important in connection with the quadratic
risk of the estimate θ . To illustrate this point, we first consider the case of a univariate
parameter.
2.3.1 Univariate parameter
Let θ ∈ Θ ⊆ IR1 . Denote by Var(θ) the variance of the estimate θ :
Varθ∗(θ) = IEθ∗(θ − IEθ∗ θ
)2.
The quadratic risk of θ is defined by
R(θ, θ∗)def= IEθ∗
∣∣θ − θ∗∣∣2.It is obvious that R(θ, θ∗) = Varθ∗(θ) if θ is unbiased. It turns out that the quadratic
risk of θ is larger than the variance when this property is not fulfilled. Define the bias
of θ as
b(θ, θ∗)def= IEθ∗ θ − θ∗.
37
Theorem 2.3.1. It holds for any estimate θ of the univariate parameter θ∗ :
R(θ, θ∗) = Varθ∗(θ) + b2(θ, θ∗).
Due to this result, the bias b(θ, θ∗) contributes the value b2(θ, θ∗) in the quadratic
risk. This particularly explains why one is interested in considering unbiased or at least
nearly unbiased estimates.
2.3.2 Multivariate case
Now we extend the result to the multivariate case with θ ∈ Θ ⊆ IRp . Then θ is a vector
in IRp . The corresponding variance-covariance matrix Varθ∗(θ) is defined as
Varθ∗(θ)def= IEθ∗
[(θ − IEθ∗ θ
)(θ − IEθ∗ θ
)>].
As previously, θ is unbiased if IEθ∗ θ = θ∗ , and the bias of θ is b(θ,θ∗)def= IEθ∗ θ− θ∗ .
The quadratic risk of the estimate θ in the multivariate case is usually defined via
the Euclidean norm of the difference θ − θ∗ :
R(θ,θ∗)def= IEθ∗
∥∥θ − θ∗∥∥2.Theorem 2.3.2. It holds
R(θ,θ∗) = tr[Varθ∗(θ)
]+∥∥b(θ,θ∗)∥∥2
Proof. The result follows similarly to the univariate case using the identity ‖v‖2 =
tr(vv>) for any vector v ∈ IRp .
Exercise 2.3.1. Complete the proof of Theorem 2.3.2.
2.4 Asymptotic properties
The properties of the previously introduced estimate θ heavily depend on the sample size
n . We therefore, use the notation θn to highlight this dependence. A natural extension
of the condition that θ is unbiased is the requirement that the bias b(θ,θ∗) becomes
negligible as the sample size n increases. This leads to the notion of consistency.
Definition 2.4.1. A sequence of estimates θn is consistent if
θnIP−→ θ∗ n→∞.
θn is mean consistent if
IEθ∗‖θn − θ∗‖ → 0, n→∞.
38
Clearly mean consistency implies consistency and also asymptotic unbiasedness:
b(θn,θ∗) = IEθn − θ∗
IP−→ 0, n→∞.
The property of consistency means that the difference θ − θ∗ is small for n large. The
next natural question to address is how fast this difference tends to zero with n . The
Glivenko-Cantelli result suggests that√n(θn − θ∗
)is asymptotically normal.
Definition 2.4.2. A sequence of estimates θn is root-n normal if
√n(θn − θ∗
) w−→ N(0, V )
for some fixed matrix V .
We aim to show that the moment estimates are consistent and asymptotically root-n
normal under very general conditions. We start again with the univariate case.
2.4.1 Root-n normality. Univariate parameter
Our first result describes the simplest situation when the parameter of interest θ∗ can
be represented as an integral∫g(y)dPθ∗(y) for some function g(·) .
Theorem 2.4.3. Suppose that Θ ⊆ IR and a function g(·) : IR→ IR satisfies for every
θ ∈ Θ ∫g(y)p(y, θ)dµ0(y) = θ,∫ [
g(y)− θ]2p(y, θ)dµ0(y) = σ2(θ) <∞.
Then the moment estimates θn = n−1∑g(Yi) satisfy the following conditions:
1. each θn is unbiased, that is, IEθ∗ θn = θ∗ .
2. the normalized quadratic risk nIEθ∗(θn − θ∗
)2fulfills
nIEθ∗(θn − θ∗
)2= σ2(θ∗).
3. θn is asymptotically root-n normal:
√n(θn − θ∗
) w−→ N(0, σ2(θ∗)).
This result has already been proved, see Theorem 2.1.3. Next we extend this result
to the more general situation when θ∗ is defined implicitly via the moment s0(θ∗) =∫
g(y)dPθ∗(y) . This means that there exists another function m(θ∗) such that m(θ∗) =∫g(y)dPθ∗(y) .
39
Theorem 2.4.4. Suppose that Θ ⊆ IR and a functions g(y) : IR→ IR and m(θ) : Θ →IR satisfy ∫
g(y)p(y, θ∗)dµ0(y) = m(θ∗),∫ {g(y)−m(θ∗)
}2p(y, θ∗)dµ0(y) = σ2g(θ
∗) <∞.
We also assume that m(·) is monotonic and twice continuously differentiable with m′(m(θ∗)
)6=
0 . Then the moment estimates θn = m−1(n−1
∑g(Yi)
)satisfy the following conditions:
1. θn is consistent, that is, θnIP−→ θ∗ .
2. θn is asymptotically root-n normal:
√n(θn − θ∗
) w−→ N(0, σ2(θ∗)), (2.3)
where σ2(θ∗) = |m′(m(θ∗)
)|−2σ2g(θ∗) .
This result also follows directly from Theorem 2.1.3 with h(s) = m−1(s) .
The property of asymptotic normality allows us to study the asymptotic concentration
of θn and to build asymptotic confidence sets.
Corollary 2.4.5. Let θn be asymptotically root-n normal: see (2.3). Then for any
z > 0
limn→∞
IPθ∗(√n∣∣θn − θ∗∣∣ > zσ(θ∗)
)= 2Φ(−z)
where Φ(z) is the cdf of the standard normal law.
In particular, this result implies that the estimate θn belongs to a small root-n
neighborhood
A(z)def= [θ∗ − n−1/2σ(θ∗)z, θ∗ + n−1/2σ(θ∗)z]
with the probability about 2Φ(−z) which is small provided that z is sufficiently large.
Next we briefly discuss the problem of interval (or confidence) estimation of the
parameter θ∗ . This problem differs from the problem of point estimation: the target is
to build an interval (a set) Eα on the basis of the observations Y such that IP (Eα 3θ∗) ≈ 1−α for a given α ∈ (0, 1) . This problem can be attacked similarly to the problem
of concentration by considering the interval of width 2σ(θ∗)z centered at the estimate
θ . However, the major difficulty is raised by the fact that this construction involves the
true parameter value θ∗ via the variance σ2(θ∗) . In some situations this variance does
not depend on θ∗ : σ2(θ∗) ≡ σ2 with a known value σ2 . In this case the construction is
immediate.
40
Corollary 2.4.6. Let θn be asymptotically root-n normal: see (2.3). Then for any
α ∈ (0, 1) , the set
E◦(zα)def= [θn − n−1/2σ(θ∗)zα, θn + n−1/2σ(θ∗)zα],
where zα is defined by 2Φ(−zα) = α , satisfies
limn→∞
IPθ∗(E(zα) 3 θ∗)
)= 1− α. (2.4)
Exercise 2.4.1. Check Corollaries 2.4.5 and 2.4.6.
Next we consider the case when the variance σ2(θ∗) is unknown. Instead we assume
that a consistent variance estimate σ2 is available. Then we plug this estimate in the
construction of the confidence set in place of the unknown true variance σ2(θ∗) leading
to the following confidence set:
E(zα)def= [θn − n−1/2σzα, θn + n−1/2σzα]. (2.5)
Theorem 2.4.7. Let θn be asymptotically root-n normal: see (2.3). Let σ(θ∗) > 0 and
σ2 be a consistent estimate of σ2(θ∗) in the sense that σ2IP−→ σ2(θ∗) . Then for any
α ∈ (0, 1) , the set E(zα) is asymptotically α -confident in the sense of (2.4).
One natural estimate of the variance σ(θ∗) can be obtained by plugging in the esti-
mate θ in place of θ∗ leading to σ = σ(θ) . If σ(θ) is a continuous function of θ in a
neighborhood of θ∗ , then consistency of θ implies consistency of σ .
Corollary 2.4.8. Let θn be asymptotically root-n normal and let the variance σ2(θ) be
a continuous function of θ at θ∗ . Then σdef= σ(θn) is a consistent estimate of σ(θ∗)
and the set A(zα) from (2.5) is asymptotically α -confident.
2.4.2 Root-n normality. Multivariate parameter
Let now Θ ⊆ IRp and θ∗ be the true parameter vector. The method of moments requires
at least p different moment functions for identifying p parameters. Let g(y) : IR→ IRp
be a vector of moment functions, g(y) =(g1(y), . . . , gp(y)
)>. Suppose first that the true
parameter can be obtained just by integration: θ∗ =∫g(y)dPθ∗(y) . This yields the
moment estimate θn = n−1∑g(Yi) .
Theorem 2.4.9. Suppose that a vector-function g(y) : IR → IRp satisfies the following
conditions: ∫g(y)p(y,θ∗)dµ0(y) = θ∗,∫ {
g(y)− θ∗}{g(y)− θ∗
}>p(y,θ∗)dµ0(y) = Σ(θ∗).
41
Then it holds for the moment estimate θn = n−1∑g(Yi) :
1. θ is unbiased, that is, IEθ∗ θ = θ∗ .
2. θn is asymptotically root-n normal:
√n(θn − θ∗
) w−→ N(0, Σ(θ∗)). (2.6)
3. the normalized quadratic risk nIEθ∗∥∥θn − θ∗∥∥2 fulfills
nIEθ∗∥∥θn − θ∗∥∥2 = trΣ(θ∗).
Similarly to the univariate case, this result yields corollaries about concentration and
confidence sets with intervals replaced by ellipsoids. Indeed, due to the second statement,
the vector
ξndef=√n{Σ(θ∗)}−1/2(θ − θ∗)
is asymptotically standard normal: ξnw−→ ξ ∼ N(0, Ip) . This also implies that the
squared norm of ξn is asymptotically χ2p -distributed where ξ2p is the law of ‖ξ‖2 =
ξ21 + . . .+ ξ2p . Define the value zα via the quantiles of χ2p by the relation
IP(‖ξ‖ > zα
)= α. (2.7)
Corollary 2.4.10. Suppose that θn is root-n normal, see (2.6). Define for a given z
the ellipsoid
A(z)def= {θ : (θ − θ∗)>{Σ(θ∗)}−1(θ − θ∗) ≤ z2/n}.
Then A(zα) is asymptotically (1− α) -concentration set for θn in the sense that
limn→∞
IP(θ 6∈ A(zα)
)= α.
The weak convergence ξnw−→ ξ suggests to build confidence sets also in form of
ellipsoids with the axis defined by the covariance matrix Σ(θ∗) . Define for α > 0
E◦(zα)def={θ :√n∥∥{Σ(θ∗)}−1/2(θ − θ)
∥∥ ≤ zα}.The result of Theorem 2.4.9 implies that this set covers the true value θ∗ with probability
approaching 1− α .
Unfortunately, in typical situations the matrix Σ(θ∗) is unknown because it depends
on the unknown parameter θ∗ . It is natural to replace it with the matrix Σ(θ) replacing
42
the true value θ∗ with its consistent estimate θ . If Σ(θ) is a continuous function of
θ , then Σ(θ) provides a consistent estimate of Σ(θ∗) . This leads to the data-driven
confidence set:
E(zα)def={θ :√n∥∥{Σ(θ)}−1/2(θ − θ)
∥∥ ≤ z}.Corollary 2.4.11. Suppose that θn is root-n normal, see (2.6), with a non-degenerate
matrix Σ(θ∗) . Let the matrix function Σ(θ) be continuous at θ∗ . Let zα be defined
by (2.7). Then E◦(zα) and E(zα) are asymptotically (1− α) -confidence sets for θ∗ :
limn→∞
IP(E◦(zα) 3 θ∗
)= lim
n→∞IP(E(zα) 3 θ∗
)= 1− α.
Exercise 2.4.2. Check Corollaries 2.4.10 and 2.4.11 about the set E◦(zα) .
Exercise 2.4.3. Check Corollary 2.4.11 about the set E(zα) .
Hint: θ is consistent and Σ(θ) is continuous and invertible at θ∗ . This implies
Σ(θ)−Σ(θ∗)IP−→ 0, {Σ(θ)}−1 − {Σ(θ∗)}−1 IP−→ 0,
and hence, the sets E◦(zα) and E(zα) are nearly the same.
Finally we discuss the general situation when the target parameter is a function of
the moments. This means the relations
m(θ) =
∫g(y)dPθ(y), θ = m−1
(m(θ)
).
Of course, these relations assume that the vector function m(·) is invertible. The sub-
stitution principle leads to the estimate
θdef= m−1(Mn),
where Mn is the vector of empirical moments:
Mndef=
∫g(y)dPn(y) =
1
n
∑g(Yi).
The central limit theorem implies (see Theorem 2.1.4) that Mn is a consistent estimate of
m(θ∗) and the vector√n[Mn−m(θ∗)
]is asymptotically normal with some covariance
matrix Σg(θ∗) . Moreover, if m−1 is differentiable at the point m(θ∗) , then
√n(θ−θ∗)
is asymptotically normal as well:
√n(θ − θ∗) w−→ N(0, Σ(θ∗))
where Σ(θ∗) = H>Σg(θ∗)H and H is the p × p -Jacobi matrix of m−1 at m(θ∗) :
Hdef= d
dθm−1(m(θ∗)
).
43
2.5 Some geometric properties of a parametric family
The parametric situation means that the true marginal distribution P belongs to some
given parametric family (Pθ,θ ∈ Θ ⊆ IRp) . By θ∗ we denote the true value, that is,
P = Pθ∗ ∈ (Pθ) . The natural target of estimation in this situation is the parameter
θ∗ itself. Below we assume that the family (Pθ) is dominated, that is, there exists a
dominating measure µ0 . The corresponding density is denoted by
p(y,θ) =dPθdµ0
(y).
We also use the notation
`(y,θ)def= log p(y,θ)
for the log-density.
The following two important characteristics of the parametric family (Pθ) will be
frequently used in the sequel: the Kullback-Leibler divergence and Fisher information.
2.5.1 Kullback-Leibler divergence
Definition 2.5.1. For any two parameters θ,θ′ , the value
K(Pθ, Pθ′) =
∫log
p(y,θ)
p(y,θ′)p(y,θ)dµ0(y) =
∫ [`(y,θ)− `(y,θ′)
]p(y,θ)dµ0(y)
is called the Kullback-Leibler divergence (KL-divergence) between Pθ and Pθ′ .
We also write K(θ,θ′) instead of K(Pθ, Pθ′) if there is no risk of confusion. Equiv-
alently one can represent the KL-divergence as
K(θ,θ′) = Eθ logp(Y,θ)
p(Y,θ′)= Eθ
[`(Y,θ)− `(Y,θ′)
],
where Y ∼ Pθ . An important feature of the Kullback-Leibler divergence is that it is
always non-negative and it is equal to zero iff the measures Pθ and Pθ′ coincide.
Lemma 2.5.2. For any θ,θ′ , it holds
K(θ,θ′) ≥ 0.
Moreover, K(θ,θ′) = 0 implies that the densities p(y,θ) and p(y,θ′) coincide µ0 -a.s.
Proof. Define Z(y) = p(y,θ′)/p(y,θ) . Then∫Z(y)p(y,θ)dµ0(y) =
∫p(y,θ′)dµ0(y) = 1
44
because p(y,θ′) is the density of Pθ′ w.r.t. µ0 . Next, d2
dt2log(t) = −t−2 < 0 , thus, the
log-function is strictly concave. The Jensen inequality implies
K(θ,θ′) = −∫
log(Z(y))p(y,θ)dµ0(y) ≥ − log
(∫Z(y)p(y,θ)dµ0(y)
)= − log(1) = 0.
Moreover, the strict concavity of the log-function implies that the equality in this relation
is only possible if Z(y) ≡ 1 Pθ -a.s. This implies the last statement of the lemma.
The two mentioned features of the Kullback-Leibler divergence suggest to consider it
as a kind of distance on the parameter space. In some sense, it measures how far Pθ′ is
from Pθ . Unfortunately, it is not a metric because it is not symmetric:
K(θ,θ′) 6= K(θ′,θ)
with very few exceptions for some special situations.
Exercise 2.5.1. Compute KL-divergence for the Gaussian shift, Bernoulli, Poisson,
volatility and exponential families. Check in which cases it is symmetric.
Exercise 2.5.2. Consider the shift experiment given by the equation Y = θ + ε where
ε is an error with the given density function p(·) on IR . Compute the KL-divergence
and check for symmetry.
One more important feature of the KL-divergence is its additivity.
Lemma 2.5.3. Let (P(1)θ ,θ ∈ Θ) and (P
(2)θ ,θ ∈ Θ) be two parametric families with the
same parameter set Θ , and let (Pθ = P(1)θ × P (2)
θ ,θ ∈ Θ) be the product family. Then
for any θ,θ′ ∈ Θ
K(Pθ, Pθ′) = K(P(1)θ , P
(1)
θ′) + K(P
(2)θ , P
(2)
θ′)
Exercise 2.5.3. Prove Lemma 2.5.3. Extend the result to the case of the m -fold product
of measures.
Hint: use that the log-density `(y1, y2,θ) of the product measure Pθ fulfills `(y1, y2,θ) =
`(1)(y1,θ) + `(2)(y2)(θ) .
The additivity of the KL-divergence helps to easily compute the KL quantity for two
measures IPθ and IPθ′ describing the i.i.d. sample Y = (Y1, . . . , Yn)> . The log-density
of the measure IPθ w.r.t. µ0 = µ⊗n0 at the point y = (y1, . . . , yn)> is given by
L(y,θ) =∑
`(yi,θ).
45
An extension of the result of Lemma 2.5.3 yields
K(IPθ, IPθ′)def= IEθ
{L(Y ,θ)− L(Y ,θ′)
}= nK(θ,θ′).
2.5.2 Hellinger distance
Another useful characteristic of a parametric family (Pθ) is the so-called Hellinger dis-
tance. For a fixed µ ∈ [0, 1] and any θ,θ′ ∈ Θ , define
h(µ, Pθ, Pθ′) = Eθ
(dPθ′
dPθ(Y )
)µ=
∫ (p(y,θ′)p(y,θ)
)µdPθ(y)
=
∫pµ(y,θ′)p1−µ(y,θ)dµ0(y).
Note that this function can be represented as an exponential moment of the log-likelihood
ratio `(Y,θ,θ′) = `(Y,θ)− `(Y,θ′) :
h(µ, Pθ, Pθ′) = Eθ exp{µ`(Y,θ′,θ)
}= Eθ
(dPθ′
dPθ(Y )
)µ.
It is obvious that h(µ, Pθ, Pθ′) ≥ 0 . Moreover, h(µ, Pθ, Pθ′) ≤ 1 . Indeed, the function
xµ for µ ∈ [0, 1] is concave and by the Jensen inequality:
Eθ
(dPθ′
dPθ(Y )
)µ≤(IEθ
dPθ′
dPθ(Y )
)µ= 1.
Similarly to the Kullback-Leibler, we often write h(µ,θ,θ′) in place of h(µ, Pθ, Pθ′) .
Typically the Hellinger distance is considered for µ = 1/2 . Then
h(1/2,θ,θ′) =
∫p1/2(y,θ′)p1/2(y,θ)dµ0(y).
In contrast to the Kullback-Leibler divergence, this quantity is symmetric and can be
used to define a metric on the parameter set Θ .
Introduce
m(µ,θ,θ′) = − log h(µ,θ,θ′) = − logEθ exp{µ`(Y,θ′,θ)
}.
The property h(µ,θ,θ′) ≤ 1 implies m(µ,θ,θ′) ≥ 0 . This rate function will play
important role in the concentration properties of the maximum likelihood estimate, see
Section ??.
The rate function, like the KL-divergence, is additive.
46
Lemma 2.5.4. Let (P(1)θ ,θ ∈ Θ) and (P
(2)θ ,θ ∈ Θ) be two parametric families with the
same parameter set Θ , and let (Pθ = P(1)θ × P (2)
θ ,θ ∈ Θ) be the product family. Then
for any θ,θ′ ∈ Θ and any µ ∈ [0, 1]
m(µ, Pθ, Pθ′) = m(P(1)θ , P
(1)
θ′) + m(P
(2)θ , P
(2)
θ′).
Exercise 2.5.4. Prove Lemma 2.5.4. Extend the result to the case of a m -fold product
of measures.
Hint: use that the log-density `(y1, y2,θ) of the product measure Pθ fulfills `(y1, y2,θ) =
`(1)(y1,θ) + `(2)(y2,θ) .
Application of this lemma to the i.i.d. product family yields
M(µ,θ′,θ)def= − log IEθ exp
{µL(Y ,θ,θ′)
}= nm(µ,θ′,θ).
2.5.3 Regularity and the Fisher Information. Univariate parameter
An important assumption on the considered parametric family (Pθ) is that the corre-
sponding density function p(y,θ) is absolutely continuous w.r.t. the parameter θ for
almost all y . Then the log-density `(y,θ) is differentiable as well with
∇`(y,θ)def=
∂`(y,θ)
∂θ=
1
p(y,θ)
∂p(y,θ)
∂θ
with the convention 10 log(0) = 0 . In the case of a univariate parameter θ ∈ IR , we also
write `′(y, θ) instead of ∇`(y, θ) .
Moreover, we usually assume some regularity conditions on the density p(y,θ) . The
next definition presents one possible set of such conditions for the case of a univariate
parameter θ .
Definition 2.5.5. The family (Pθ, θ ∈ Θ ⊂ IR) is regular if the following conditions are
fulfilled:
1. The sets A(θ)def= {y : p(y, θ) = 0} are the same for all θ ∈ Θ .
2. Differentiability under the integration sign: for any function s(y) satisfying∫s2(y)p(y, θ)dµ0(y) ≤ C, θ ∈ Θ
it holds
∂
∂θ
∫s(y)dPθ(y) =
∂
∂θ
∫s(y)p(y, θ)dµ0(y) =
∫s(y)
∂p(y, θ)
∂θdµ0(y).
47
3. Finite Fisher information: the log-density function `(y, θ) is differentiable in θ
and its derivative is square integrable w.r.t. Pθ :∫ ∣∣`′(y, θ)∣∣2dPθ(y) =
∫|p′(y, θ)|2
p(y, θ)dµ0(y). (2.8)
The quantity in the condition (2.8) plays an important role in asymptotic statistics.
Definition 2.5.6. Let (Pθ, θ ∈ Θ ⊂ IR) be a regular parametric family with the univari-
ate parameter. Then the quantity
I(θ)def=
∫ ∣∣`′(y, θ)∣∣2p(y, θ)dµ0(y) =
∫|p′(y, θ)|2
p(y, θ)dµ0(y)
is called the Fisher information of (Pθ) at θ ∈ Θ .
The definition of I(θ) can be written as
I(θ) = IEθ∣∣`′(Y, θ)∣∣2
with Y ∼ Pθ .
A simple sufficient condition for regularity of a family (Pθ) is given by the next
lemma.
Lemma 2.5.7. Let the log-density `(y, θ) = log p(y, θ) of a dominated family (Pθ) be
differentiable in θ and let the Fisher information I(θ) be a continuous function on Θ .
Then (Pθ) is regular.
The proof is technical and can be found e.g. in Borovkov (1998). Some useful prop-
erties of the regular families are listed in the next lemma.
Lemma 2.5.8. Let (Pθ) be a regular family. Then for any θ ∈ Θ and Y ∼ Pθ
1. Eθ`′(Y, θ) =
∫`′(y, θ) p(y, θ) dµ0(y) = 0 and I(θ) = Varθ
[`′(Y, θ)
].
2. I(θ) = −Eθ`′′(Y, θ) = −∫`′′(y, θ)p(y, θ)dµ0(y).
Proof. Differentiating the identity∫p(y, θ)dµ0(y) =
∫exp{`(y, θ)}dµ0(y) ≡ 1 implies
under the regularity conditions the first statement of the lemma. Differentiating once
more yields the second statement with another representation of the Fisher information.
Like the KL-divergence, the Fisher information possesses the important additivity
property.
48
Lemma 2.5.9. Let (P(1)θ , θ ∈ Θ) and (P
(2)θ , θ ∈ Θ) be two parametric families with the
same parameter set Θ , and let (Pθ = P(1)θ × P (2)
θ , θ ∈ Θ) be the product family. Then
for any θ ∈ Θ , the Fisher information I(θ) satisfies
I(θ) = I(1)(θ) + I(2)(θ)
where I(1)(θ) (resp. I(2)(θ) ) is the Fisher information for (P(1)θ ) (resp. for (P
(2)θ ) ).
Exercise 2.5.5. Prove Lemma 2.5.9.
Hint: use that the log-density of the product experiment can be represented as
`(y1, y2, θ) = `1(y1, θ) + `2(y2, θ) . The independence of Y1 and Y2 implies
I(θ) = Varθ[`′(Y1, Y2, θ)
]= Varθ
[`′1(Y1, θ) + `′2(Y2, θ)
]= Varθ
[`′1(Y1, θ)
]+ Varθ
[`′2(Y2, θ)
].
Exercise 2.5.6. Compute the Fisher information for the Gaussian shift, Bernoulli, Pois-
son, volatility and exponential families. Check in which cases it is constant.
Exercise 2.5.7. Consider the shift experiment given by the equation Y = θ+ε where ε
is an error with the given density function p(·) on IR . Compute the Fisher information
and check whether it is constant.
Exercise 2.5.8. Check that the i.i.d. experiment from the uniform distribution on the
interval [0, θ] with unknown θ is not regular.
Now we consider the properties of the i.i.d. experiment from a given regular family
(Pθ) . The distribution of the whole i.i.d. sample Y is described by the product measure
IPθ = P⊗nθ which is dominated by the measure µ0 = µ⊗n0 . The corresponding log-density
L(y,θ) is given by
L(y, θ)def= log
dIPθdµ0
(y) =∑
`(yi, θ).
The function expL(y, θ) is the density of IPθ w.r.t. µ0 and hence, for any r.v. ξ
IEθξ = IE0
[ξ expL(Y , θ)
].
In particular, for ξ ≡ 1 , this formula leads to the indentity
IE0
[expL(Y , θ)
]=
∫exp{L(y, θ)
}µ0(dy) ≡ 1. (2.9)
The next lemma claims that the product family (IPθ) for an i.i.d. sample from a
regular family is also regular.
49
Lemma 2.5.10. Let (Pθ) be a regular family and IPθ = P⊗nθ . Then
1. The set Andef= {y = (y1, . . . , yn)> :
∏p(yi, θ) = 0} is the same for all θ ∈ Θ .
2. For any r.v. S = S(Y ) with IEθS2 ≤ C , θ ∈ Θ , it holds
∂
∂θIEθS =
∂
∂θIE0
[S expL(Y , θ)
]= IE0
[SL′(Y , θ) expL(Y , θ)
],
where L′(Y , θ)def= ∂
∂θL(Y , θ) .
3. The derivative L′(Y , θ) is square integrable and
IEθ∣∣L′(Y , θ)∣∣2 = nI(θ).
Local properties of the Kullback-Leibler divergence and Hellinger distance
Here we show that the quantities introduced so far are closely related to each other. We
start with the Kullback-Leibler divergence.
Lemma 2.5.11. Let (Pθ) be a regular family. Then the KL-divergence K(θ, θ′) satisfies:
K(θ, θ′)∣∣∣θ′=θ
= 0,
d
dθ′K(θ, θ′)
∣∣∣θ′=θ
= 0,
d2
dθ′2K(θ, θ′)
∣∣∣θ′=θ
= I(θ).
In a small neighborhood of θ , the KL-divergence can be approximated by
K(θ, θ′) ≈ I(θ)|θ′ − θ|2/2.
Similar properties can be established for the rate function m(µ, θ, θ′) .
Lemma 2.5.12. Let (Pθ) be a regular family. Then the rate function m(µ, θ, θ′) satis-
fies:
m(µ, θ, θ′)∣∣∣θ′=θ
= 0,
d
dθ′m(µ, θ, θ′)
∣∣∣θ′=θ
= 0,
d2
dθ′2m(µ, θ, θ′)
∣∣∣θ′=θ
= µ(1− µ)I(θ).
In a small neighborhood of θ , the rate function m(µ, θ, θ′) can be approximated by
m(µ, θ, θ′) ≈ µ(1− µ)I(θ)|θ′ − θ|2/2.
50
Moreover, for any θ, θ′ ∈ Θ
m(µ, θ, θ′)∣∣∣µ=0
= 0,
d
dµm(µ, θ, θ′)
∣∣∣µ=0
= Eθ`(Y, θ, θ′) = K(θ, θ′),
d2
dµ2m(µ, θ, θ′)
∣∣∣µ=0
= −Varθ[`(Y, θ, θ′)
].
This implies an approximation for µ small
m(µ, θ, θ′) ≈ µK(θ, θ′)− µ2
2Varθ
[`(Y, θ, θ′)
].
Exercise 2.5.9. Check the statements of Lemmas 2.5.11 and 2.5.12.
2.6 Cramer-Rao Inequality
Let θ be an estimate of the parameter θ∗ . We are interested in establishing a lower
bound for the risk of this estimate. This bound indicates that under some conditions the
quadratic risk of this estimate can never be below a specific value.
2.6.1 Univariate parameter
We again start with the univariate case and consider the case of an unbiased estimate
θ . Suppose that the family (Pθ, θ ∈ Θ) is dominated by a σ -finite measure µ0 on the
real line and denote by p(y, θ) the density of Pθ w.r.t. µ0 :
p(y, θ)def=
dPθdµ0
(y).
Theorem 2.6.1 (Cramer-Rao Inequality). Let θ = θ(Y ) be an unbiased estimate of θ
for an i.i.d. sample from a regular family (Pθ) . Then
IEθ|θ − θ|2 = Varθ(θ) ≥1
nI(θ).
Moreover, if θ is not unbiased and τ(θ) = IEθθ , then with τ ′(θ)def= d
dθτ(θ) , it holds
Varθ(θ) ≥|τ ′(θ)|2
nI(θ)
and
IEθ|θ − θ|2 = Varθ(θ) + |τ(θ)− θ|2 ≥ |τ′(θ)|2
nI(θ)+ |τ(θ)− θ|2.
51
Proof. Consider first the case of an unbiased estimate θ with IEθθ ≡ θ . Differentiating
the identity (2.9) IE0 expL(Y , θ) ≡ 1 w.r.t. θ yields
0 ≡∫ [
L′(y, θ) exp{L(y, θ)
}]µ0(dy) = IEθL
′(Y , θ). (2.10)
Similarly, the identity IEθθ = θ implies
1 ≡∫ [
θL′(Y , θ) exp{L(Y , θ)
}]µ0(dy) = IEθ
[θL′(Y , θ)
].
Together with (2.10), this gives
IEθ[(θ − θ)L′(Y , θ)
]≡ 1.
By the Cauchy-Schwartz inequality
1 = IE2θ
[(θ − θ)L′(Y , θ)
]≤ IEθ(θ − θ)2 IEθ|L′(Y , θ)|2 = Varθ(θ)nI(θ). (2.11)
This implies the first assertion.
Now we consider the general case. The proof is similar. The property (2.10) continues
to hold. Next, the identity IEθθ = θ is replaced with IEθθ = τ(θ) yielding
IEθ[θL′(Y , θ)
]≡ τ ′(θ)
and
IEθ[{θ − τ(θ)}L′(Y , θ)
]≡ τ ′(θ).
Again by the Cauchy-Schwartz inequality∣∣τ ′(θ)∣∣2 = IE2θ
[{θ − τ(θ)}L′(Y , θ)
]≤ IEθ{θ − τ(θ)}2 IEθ|L′(Y , θ)|2
= Varθ(θ) nI(θ)
and the second assertion follows. The last statement is the usual decomposition of the
quadratic risk into the squared bias and the variance of the estimate.
2.6.2 Exponential families and R-efficiency
An interesting question is how good (precise) the Cramer-Rao lower bound is. In par-
ticular, when it is an equality. Indeed, if we restrict ourselves to unbiased estimates, no
estimate can have quadratic risk smaller than [nI(θ)]−1 . If an estimate has exactly the
risk [nI(θ)]−1 then this estimate is automatically efficient in the sense that it is the best
in the class in terms of the quadratic risk.
52
Definition 2.6.2. An unbiased estimate θ is R-efficient if
Varθ(θ) = [nI(θ)]−1.
Theorem 2.6.3. An unbiased estimate θ is R-efficient if and only if
θ = n−1∑
U(Yi),
where the function U(·) on IR satisfies∫U(y)dPθ(y) ≡ θ and the log-density `(y, θ) of
Pθ can be represented as
`(y, θ) = C(θ)U(y)−B(θ) + `(y), (2.12)
for some functions C(·) and B(·) on Θ and a function `(·) on IR .
Proof. Suppose first that the representation (2.12) for the log-density is correct. Then
`′(y, θ) = C ′(θ)U(y)−B′(θ) and the identity Eθ`′(y, θ) = 0 implies the relation between
the functions B(·) and C(·) :
θC ′(θ) = B′(θ). (2.13)
Next, differentiating the equality
0 ≡∫{U(y)− θ}dPθ(y) =
∫{U(y)− θ}eL(y,θ)dµ0(y)
w.r.t. θ implies in view of (2.13)
1 ≡ IEθ[{U(Y )− θ} ×
{C ′(θ)U(Y )−B′(θ)
}]= C ′(θ)IEθ
{U(Y )− θ
}2.
This yields Varθ{U(Y )
}= 1/C ′(θ) . This leads to the following representation for the
Fisher information:
I(θ) = Varθ{`′(Y, θ)
}= Varθ{C ′(θ)U(Y )−B′(θ)}
={C ′(θ)
}2Varθ
{U(Y )
}= C ′(θ).
The estimate θ = n−1∑U(Yi) satisfies
IEθθ = θ,
that is, it is unbiased. Moreover,
Varθ(θ)
= Varθ
{ 1
n
∑U(Yi)
}=
1
n2
∑Var{U(Yi)
}=
1
nC ′(θ)=
1
nI(θ)
53
and θ is R-efficient.
Now we show an reverse statement. Due to the proof of the Cramer-Rao inequality,
the only possibility of getting the equality in this inequality is if (2.11) holds as an
equality. It is well known that the Cauchy-Schwartz inequality IEξη ≤√IEξ2IEη2 is an
equality iff ξ, η are linearly dependent. This leads to the relation
L′(Y , θ) = c(θ)(θ − θ)− b(θ)
for some coefficients c(θ), b(θ) . This implies for some fixed θ0 and any θ
L(Y , θ)− L(Y , θ0) =
∫ θ
θ0
L′(Y , θ)dθ
= θ
∫ θ
θ0
c(θ)dθ −∫ θ
θ0
b(θ)dθ
= θC(θ)−B(θ)
with C(θ) =∫ θθ0c(θ)dθ and B(θ) =
∫ θθ0b(θ)dθ . Applying this equality to a sample with
n = 1 yields U(Y1) = θ(Y1) , and
`(Y1, θ) = `(Y1, θ0) + C(θ)U(Y1)−B(θ).
The desired representation follows.
Exercise 2.6.1. Apply the Cramer-Rao inequality and check R-efficiency to the empir-
ical mean estimate θ = n−1∑Yi for the Gaussian shift, Bernoulli, Poisson, exponential
and volatility families.
2.7 Cramer-Rao inequality. Multivariate parameter
This section extends the notions and results of the previous sections from the case of a
univariate parameter to the case of a multivariate parameter with θ ∈ Θ ⊂ IRp .
2.7.1 Regularity and Fisher Information. Multivariate parameter
The definition of regularity naturally extends to the case of a multivariate parameter
θ = (θ1, . . . , θp)> . It suffices to check the same conditions as in the univariate case for
every partial derivative ∂p(y,θ)/∂θj of the density p(y,θ) for j = 1, . . . , p .
Definition 2.7.1. The family (Pθ,θ ∈ Θ ⊂ IRp) is regular if the following conditions
are fulfilled:
1. The sets A(θ)def= {y : p(y, θ) = 0} are the same for all θ ∈ Θ .
54
2. Differentiability under the integration sign: for any function s(y) satisfying
∫s2(y)p(y,θ)dµ0(y) ≤ C, θ ∈ Θ
it holds
∂
∂θ
∫s(y)dPθ(y) =
∂
∂θ
∫s(y)p(y,θ)dµ0(y) =
∫s(y)
∂p(y,θ)
∂θdµ0(y).
3. Finite Fisher information: the log-density function `(y,θ) is differentiable in θ
and its derivative ∇`(y,θ) = ∂`(y,θ)/∂θ is square integrable w.r.t. Pθ :
∫ ∣∣∇`(y,θ)∣∣2dPθ(y) =
∫|∇p(y, θ)|2
p(y, θ)dµ0(y) <∞.
In the case of a multivariate parameter, the notion of the Fisher information leads to
the Fisher information matrix.
Definition 2.7.2. Let (Pθ,θ ∈ Θ ⊂ IRp) be a parametric family. The matrix
I(θ)def=
∫∇`(y,θ)∇>`(y,θ)p(y,θ)dµ0(y)
=
∫∇p(y,θ)∇>p(y,θ)
1
p(y,θ)dµ0(y)
is called the Fisher information matrix of (Pθ) at θ ∈ Θ .
This definition can be rewritten as
I(θ) = IEθ[∇`(Y1, θ){∇`(Y1, θ)}>
].
The additivity property of the Fisher information extends to the multivariate case as
well.
Lemma 2.7.3. Let (Pθ,θ ∈ Θ) be a regular family. Then the n -fold product family
(IPθ) with IPθ = P⊗nθ is also regular. The Fisher information matrix I(θ) satisfies
IEθ[∇L(Y ,θ){∇L(Y ,θ)}>
]= nI(θ). (2.14)
Exercise 2.7.1. Compute the Fisher information matrix for the i.i.d. experiment Yi =
θ + σεi with unknown θ and σ and εi i.i.d. standard normal.
55
2.7.2 Local properties of the Kullback-Leibler divergence and Hellinger
distance
The local relations between the Kullback-Leibler divergence, rate function and Fisher
information naturally extend to the case of a multivariate parameter. We start with the
Kullback-Leibler divergence.
Lemma 2.7.4. Let (Pθ) be a regular family. Then the KL-divergence K(θ,θ′) satisfies:
K(θ,θ′)∣∣∣θ′=θ
= 0,
d
dθ′K(θ,θ′)
∣∣∣θ′=θ
= 0,
d2
dθ′2K(θ,θ′)
∣∣∣θ′=θ
= I(θ).
In a small neighborhood of θ , the KL-divergence can be approximated by
K(θ,θ′) ≈ (θ′ − θ)>I(θ) (θ′ − θ)/2.
Similar properties can be established for the rate function m(µ,θ,θ′) .
Lemma 2.7.5. Let (Pθ) be a regular family. Then the rate function m(µ,θ,θ′) satisfies:
m(µ,θ,θ′)∣∣∣θ′=θ
= 0,
d
dθ′m(µ,θ,θ′)
∣∣∣θ′=θ
= 0,
d2
dθ′2m(µ,θ,θ′)
∣∣∣θ′=θ
= µ(1− µ)I(θ).
In a small neighborhood of θ , the rate function can be approximated by
m(µ,θ,θ′) ≈ µ(1− µ)(θ′ − θ)>I(θ) (θ′ − θ)/2.
Moreover, for any θ,θ′ ∈ Θ
m(µ,θ,θ′)∣∣∣µ=0
= 0,
d
dµm(µ,θ,θ′)
∣∣∣µ=0
= Eθ`(Y,θ,θ′) = K(θ,θ′),
d2
dµ2m(µ,θ,θ′)
∣∣∣µ=0
= −Varθ[`(Y,θ,θ′)
].
This implies an approximation for µ small:
m(µ,θ,θ′) ≈ µK(θ,θ′)− µ2
2Varθ
[`(Y,θ,θ′)
].
Exercise 2.7.2. Check the statements of Lemmas 2.7.4 and 2.7.5.
56
2.7.3 Multivariate Cramer-Rao Inequality
Let θ = θ(Y ) be an estimate of the unknown parameter vector. This estimate is called
unbiased if
IEθθ ≡ θ.
Theorem 2.7.6 (Multivariate Cramer-Rao Inequality). Let θ = θ(Y ) be an unbiased
estimate of θ for an i.i.d. sample from a regular family (Pθ) . Then
Varθ(θ) ≥{nI(θ)
}−1,
IEθ‖θ − θ‖2 = tr{
Varθ(θ)}≥ tr
[{nI(θ)
}−1].
Moreover, if θ is not unbiased and τ(θ) = IEθθ , then with ∇τ(θ)def= d
dθ τ(θ) , it holds
Varθ(θ) ≥ ∇τ(θ){nI(θ)
}−1{∇τ(θ)}>,
and
IEθ‖θ − θ‖2 = tr[Varθ(θ)
]+ ‖τ(θ)− θ‖2
≥ tr[∇τ(θ)
{nI(θ)
}−1{∇τ(θ)}>]
+ ‖τ(θ)− θ‖2.
Proof. Consider first the case of an unbiased estimate θ with IEθθ ≡ θ . Differentiating
the identity (2.9) IEθ expL(Y ,θ) ≡ 1 w.r.t. θ yields
0 ≡∫∇L(y,θ) exp
{L(y,θ)
}µ0(dy) = IEθ
[∇L(Y ,θ)
]≡ 0. (2.15)
Similarly, the identity IEθθ = θ implies
I ≡∫θ(y)
{∇L(y,θ)
}>exp{L(y,θ)
}µ0(dy) = IEθ
[θ {∇L(Y ,θ)}>
].
Together with (2.15), this gives
IEθ[(θ − θ) {∇L(Y ,θ)}>
]≡ I. (2.16)
Consider the random vector
hdef={nI(θ)
}−1∇L(Y ,θ).
By (2.15) IEθh = 0 and by (2.14)
Varθ(h) = IEθ(hh>
)= n−2IEθ
[I−1(θ)∇L(Y ,θ)
{I−1(θ)∇L(Y ,θ)
}>]= n−2I−1(θ)IEθ
[∇L(Y ,θ){∇L(Y ,θ)}>
]I−1(θ) =
{nI(θ)
}−1.
57
and the identities (2.15) and (2.16) imply that
IEθ[(θ − θ − h)h>
]= 0. (2.17)
The “no bias” property yields IEθ(θ − θ
)= 0 and IEθ
[(θ − θ)(θ − θ)>
]= Varθ(θ) .
Finally by the orthogonality (2.17) and
Varθ(θ) = Varθ(h) + Var(θ − θ − h
)={nI(θ)
}−1+ Varθ
(θ − θ − h
)and the variance of θ is not smaller than
{nI(θ)
}−1. Moreover, the equality is only
possible if θ − θ − h is equal to zero almost surely.
Now we consider the general case. The proof is similar. The property (2.15) continues
to hold. Next, the identity IEθθ = θ is replaced with IEθθ = τ(θ) yielding
IEθ[θ {∇L(Y ,θ)}>
]≡ ∇τ(θ)
and
IEθ[{θ − τ(θ)
}{∇L(Y ,θ)
}>] ≡ ∇τ(θ).
Define
hdef= ∇τ(θ)
{nI(θ)
}−1∇L(Y ,θ).
Then similarly to the above
IEθ[hh>
]= ∇τ(θ)
{nI(θ)
}−1 {∇τ(θ)}>,
IEθ[(θ − θ − h)h>
]= 0,
and the second assertion follows. The statements about the quadratic risk follow from
its usual decomposition into squared bias and the variance of the estimate.
2.7.4 Exponential families and R-efficiency
The notion of R-efficiency naturally extends to the case of a multivariate parameter.
Definition 2.7.7. An unbiased estimate θ is R-efficient if
Varθ(θ) ={nI(θ)
}−1.
58
Theorem 2.7.8. An unbiased estimate θ is R-efficient if and only if
θ = n−1∑
U(Yi),
where the vector function U(·) on IR satisfies∫U(y)dPθ(y) ≡ θ and the log-density
`(y,θ) of Pθ can be represented as
`(y,θ) = C(θ)>U(y)−B(θ) + `(y), (2.18)
for some functions C(·) and B(·) on Θ and a function `(·) on IR .
Proof. Suppose first that the representation (2.18) for the log-density is correct. Denote
by C ′(θ) the p × p Jacobi matrix of the vector function C : C ′(θ)def= d
dθC(θ) . Then
∇`(y,θ) = C ′(θ)U(y) − ∇B(θ) and the identity Eθ∇`(y,θ) = 0 implies the relation
between the functions B(·) and C(·) :
C ′(θ)θ = ∇B(θ). (2.19)
Next, differentiating the equality
0 ≡∫ [U(y)− θ
]dPθ(y) =
∫[U(y)− θ]eL(y,θ)dµ0(y)
w.r.t. θ implies in view of (2.19)
I ≡ IEθ[{U(Y )− θ}
{C ′(θ)U(Y )−∇B(θ)
}]>= C ′(θ)IEθ
[{U(Y )− θ} {U(Y )− θ}>
].
This yields Varθ[U(Y )
]= [C ′(θ)]−1 . This leads to the following representation for the
Fisher information:
I(θ) = Varθ[∇`(Y,θ)
]= Varθ[C ′(θ)U(Y )−∇B(θ)]
=[C ′(θ)
]2Varθ
[U(Y )
]= C(θ).
The estimate θ = n−1∑U(Yi) satisfies
IEθθ = θ,
that is, it is unbiased. Moreover,
Varθ(θ)
= Varθ
( 1
n
∑U(Yi)
)=
1
n2
∑Var[U(Yi)
]=
1
n
[C ′(θ)
]−1={nI(θ)
}−1
59
and θ is R-efficient.
As in the univariate case, one can show that equality in the Cramer-Rao bound is
only possible if ∇L(Y ,θ) and θ − θ are linearly dependent. This leads again to the
exponential family structure of the likelihood function.
Exercise 2.7.3. Complete the proof of the Theorem 2.7.8.
2.8 Maximum likelihood and other estimation methods
This section presents some other popular methods of estimating the unknown parameter
including minimum distance and M-estimation, maximum likelihood procedure, etc.
2.8.1 Minimum distance estimation
Let ρ(P, P ′) denote some functional (distance) defined for measures P, P ′ on the real
line. We assume that ρ satisfies the following conditions: ρ(Pθ1 , Pθ2) ≥ 0 and ρ(Pθ1 , Pθ2) =
0 iff θ1 = θ2 . This implies for every θ∗ ∈ Θ that
argminθ∈Θ
ρ(Pθ, Pθ∗) = θ∗.
The Glivenko-Cantelli theorem states that Pn converges weakly to the true distribution
Pθ∗ . Therefore, it is natural to define an estimate θ of θ∗ by replacing in this formula
the true measure Pθ∗ by its empirical counterpart Pn , that is, by minimizing the distance
ρ between the measures Pθ and Pn over the set (Pθ) . This leads to the minimum
distance estimate
θ = argminθ∈Θ
ρ(Pθ, Pn).
2.8.2 M -estimation and Maximum likelihood estimation
Another general method of building an estimate of θ∗ , the so-called M -estimation is
defined via a contrast function ψ(y,θ) given for every y ∈ IR and θ ∈ Θ . The principal
condition on ψ is that the integral IEθψ(Y1,θ′) is minimized for θ = θ′ :
θ = argminθ′
∫ψ(y,θ′)dPθ(y), θ ∈ Θ. (2.20)
In particular,
θ∗ = argminθ∈Θ
∫ψ(y,θ)dPθ∗(y),
60
and the M -estimate is again obtained by substitution, that is, by replacing the true
measure Pθ∗ with its empirical counterpart Pn :
θ = argminθ∈Θ
∫ψ(y,θ)dPn(y) = argmin
θ∈Θ
1
n
∑ψ(Yi,θ).
Exercise 2.8.1. Let Y be an i.i.d. sample from P ∈ (Pθ, θ ∈ Θ ⊂ IR) .
(i) Let also g(y) satisfy∫g(y)dPθ(y) ≡ θ , leading to the moment estimate
θ = n−1∑
g(Yi).
Show that this estimate can be obtained as the M-estimate for a properly selected function
ψ(·) .
(ii) Let∫g(y)dPθ(y) ≡ m(θ) for the given functions g(·) and m(·) whereas m(·) is
monotonous. Show that the moment estimate θ = m−1(Mn) with Mn = n−1∑g(Yi)
can be obtained as the M-estimate for a properly selected function ψ(·) .
We mention three prominent examples of the contrast function ψ and the resulting
estimates: least squares, least absolute deviation and maximum likelihood.
Least squares estimation
The least squares estimate (LSE) corresponds to the quadratic contrast ‖ψ(y) − θ‖2 ,
where ψ(y) is a p -dimensional function of the observation y satisfying∫ψ(y)dPθ(y) ≡ θ, θ ∈ Θ
Then the true parameter θ∗ fulfills the relation
θ∗ = argminθ∈Θ
∫‖ψ(y)− θ‖2dPθ∗(y)
because ∫‖ψ(y)− θ‖2dPθ∗(y) = ‖θ∗ − θ‖2 +
∫‖ψ(y)− θ∗‖2dPθ∗(y).
The substitution method leads to the estimate θ of θ∗ defined by minimization of the
empirical version of the integral∫‖ψ(y)− θ‖2dPθ∗(y) :
θdef= argmin
θ∈Θ
∫‖ψ(y)− θ‖2dPn(y) = argmin
θ∈Θ
∑‖ψ(Yi)− θ‖2.
This is again a quadratic optimization problem having a closed form solution called least
squares or ordinary least squares estimate.
61
Lemma 2.8.1. It holds
θ = argminθ∈Θ
∑‖ψ(Yi)− θ‖2 =
1
n
∑ψ(Yi).
One can see that the LSE θ coincides with the moment estimate based on the function
g(·) = ψ(·) . Indeed, the equality∫g(y)dPθ∗(y) = θ∗ leads directly to the LSE θ =
n−1∑g(Yi) .
Least absolute deviation (median) estimation
The next example of an M-estimate is given by the absolute deviation contrast fit. For
simplicity of presentation, we consider here only the case of a univariate parameter. The
contrast function ψ(y, θ) is given by ψ(y, θ)def= |ψ(y) − θ| . The solution of the related
optimization problem (2.20) is given by the median med(Pθ) of the distribution Pθ .
Definition 2.8.2. The value t is called the median of a distribution function F if
F (t) ≥ 1/2, F (t−) < 1/2.
If F (·) is a continuous function then the median t = med(F ) satisfies F (t) = 1/2 .
Theorem 2.8.3. For any cdf F , the median med(F ) satisfies
infθ∈IR
∫|y − θ| dF (y) =
∫|y −med(F )| dF (y).
Proof. Consider for simplicity the case of a continuous distribution function F . One has
|y− θ| = (θ− y)1(y < θ) + (y− θ)1(y ≥ θ) . Differentiating w.r.t. θ yields the following
equation for any extreme point of∫|y − θ| dF (y) :
−∫ θ
−∞dF (y) +
∫ ∞θ
dF (y) = 0.
The median is the only solution of this equation.
Let the family (Pθ) be such that θ = med(Pθ) for all θ ∈ IR . Then the M-estimation
approach leads to the least absolute deviation (LAD) estimate
θdef= argmin
θ∈IR
∫|y − θ| dFn(y) = argmin
θ∈IR
∑|Yi − θ|.
Due to Theorem 2.8.3, the solution of this problem is given by the median of the edf Fn .
62
Maximum likelihood estimation
Let now ψ(y,θ) = −`(y,θ) = − log p(y,θ) where p(y,θ) is the density of the measure
Pθ at y w.r.t. to some dominating measure µ0 . This choice leads to the maximum
likelihood estimate (MLE):
θ = argmaxθ∈Θ
n−1∑
log p(Yi,θ).
The condition (2.20) is fulfilled because
argminθ′
∫ψ(y,θ′)dPθ(y) = argmin
θ′
∫ {ψ(y,θ′)− ψ(y,θ)
}dPθ(y)
= argminθ′
∫log
p(y,θ)
p(y,θ′)dPθ(y)
= argminθ′
K(θ,θ′) = θ.
Here we used that the Kullback-Leibler divergence K(θ,θ′) attains its minimum equal
to zero at the point θ′ = θ which in turn follows from the concavity of the log-function
by the Jensen inequality.
Note that the definition of the MLE does not depend on the choice of the dominating
measure µ0 .
Exercise 2.8.2. Show that the MLE θ does not change if another dominating measure
is used.
Computing an M -estimate or MLE leads to solving an optimization problem for the
empirical quantity∑ψ(Yi,θ) w.r.t. the parameter θ . If the function ψ is differentiable
w.r.t. θ then the solution can be found from the estimating equation
∂
∂θ
∑ψ(Yi,θ) = 0.
Exercise 2.8.3. Show that any M -estimate and particularly the MLE can be repre-
sented as minimum distance estimate with a properly defined distance ρ .
Hint: define ρ(Pθ, Pθ∗) as∫ [ψ(y,θ)− ψ(y,θ∗)
]dPθ∗(y) .
Recall that the MLE θ is defined by maximizing the expression L(θ) =∑`(Yi,θ)
w.r.t. θ . Below we use the notation L(θ,θ′)def= L(θ) − L(θ′) , often called the log-
likelihood ratio.
In our study we will focus on the value of the maximum L(θ) = maxθ L(θ) .
Definition 2.8.4. Let L(θ) =∑`(Yi,θ) be the likelihood function. The value
L(θ)def= max
θL(θ)
63
is called the maximum log-likelihood or fitted log-likelihood. The excess L(θ)− L(θ∗)
is the difference between the maximum of the likelihood function L(θ) over θ and its
particular value at the true parameter θ∗ :
L(θ,θ∗)def= max
θL(θ)− L(θ∗).
The next section collects some examples of computing the MLE θ and the corre-
sponding maximum log-likelihood.
2.9 Maximum Likelihood for some parametric families
The examples of this section focus on the structure of the log-likelihood and the corre-
sponding MLE θ and the maximum log-likelihood L(θ) .
2.9.1 Gaussian shift
Let Pθ be the normal distribution on the real line with mean θ and the known variance
σ2 . The corresponding density w.r.t. the Lebesgue measure reads as
p(y,θ) =1√
2πσ2exp{−(y − θ)2
2σ2
}.
The log-likelihood L(θ) is
L(θ) =∑
log p(Yi, θ) = −n2
log(2πσ2)− 1
2σ2
∑(Yi − θ)2.
The corresponding normal equation L′(θ) = 0 yields
− 1
σ2
∑(Yi − θ) = 0 (2.21)
leading to the empirical mean solution θ = n−1∑Yi .
The computation of the fitted likelihood is a bit more involved.
Theorem 2.9.1. Let Yi = θ∗ + εi with εi ∼ N(0, σ2) . For any θ
L(θ, θ) = nσ−2(θ − θ)2/2. (2.22)
Moreover,
L(θ, θ∗) = nσ−2(θ − θ∗)2/2 = ξ2/2
where ξ is a standard normal r.v. so that 2L(θ, θ∗) has the fixed χ21 distribution with
one degree of freedom. If zα is the quantile of χ21/2 with P (ξ2/2 > zα) = α , then
E(zα) = {u : L(θ, u) ≤ zα} (2.23)
64
is an α -confidence set: IPθ∗(E(zα) 63 θ∗) = α .
For every r > 0 ,
IEθ∗∣∣2L(θ, θ∗)
∣∣r = cr ,
where cr = E|ξ|2r with ξ ∼ N(0, 1) .
Proof 1. Consider L(θ, θ)def= L(θ)− L(θ) as a function of the parameter θ . Obviously
L(θ, θ) = − 1
2σ2
∑[(Yi − θ)2 − (Yi − θ)2
],
so that L(θ, θ) is a quadratic function of θ . Next, it holds L(θ, θ)∣∣θ=θ
= 0 andddθL(θ, θ)
∣∣θ=θ
= − ddθL(θ)
∣∣θ=θ
= 0 due to the normal equation (2.21). Finally,
d2
dθ2L(θ, θ)
∣∣θ=θ
= − d2
dθ2L(θ)
∣∣θ=θ
= n/σ2.
This implies by the Taylor expansion of a quadratic function L(θ, θ) at θ = θ :
L(θ, θ) =n
2σ2(θ − θ)2.
Proof 2. First observe that for any two points θ′, θ , the log-likelihood ratio L(θ′, θ) =
log(dIPθ′/dIPθ) = L(θ′)− L(θ) can be represented in the form
L(θ′, θ) = L(θ′)− L(θ) = σ−2(S − nθ)(θ′ − θ)− nσ−2(θ′ − θ)2/2.
Substituting the MLE θ = S/n in place of θ′ implies
L(θ, θ) = nσ−2(θ − θ)2/2.
Now we consider the second statement about the distribution of L(θ, θ∗) . The sub-
stitution θ = θ∗ in (2.22) and the model equation Yi = θ∗+ εi imply θ− θ∗ = n−1/2σξ ,
where
ξdef=
1
σ√n
∑εi
is standard normal. Therefore,
L(θ, θ∗) = ξ2/2.
This easily implies the result of the theorem.
We see that under IPθ∗ the variable 2L(θ, θ∗) is χ21 distributed with one degree
of freedom, and this distribution does not depend on the sample size n and the scale
parameter σ . This fact is known in a more general form as chi-squared theorem.
65
Exercise 2.9.1. Check that the confidence sets
E◦(zα)def= [θ − n−1/2σzα, θ + n−1/2σzα],
where zα is defined by 2Φ(−zα) = α , and E(zα) from (2.23) coincide.
Exercise 2.9.2. Compute the constant cr from Theorem 2.9.1 for r = 0.5, 1, 1.5, 2 .
Already now we point out an interesting feature of the fitted log-likelihood L(θ, θ∗) .
It can be viewed as the normalized squared loss of the estimate θ because L(θ, θ∗) =
nσ−2|θ − θ∗|2 . The last statement of Theorem 2.9.1 yields that
IEθ∗ |θ − θ∗|2r = crσ2rn−r.
2.9.2 Variance estimation for the normal law
Let Yi be i.i.d. normal with mean zero and unknown variance θ∗ :
Yi ∼ N(0, θ∗), θ∗ ∈ IR+ .
The likelihood function reads as
L(θ) =∑
log p(Yi, θ) = −n2
log(2πθ)− 1
2θ
∑Y 2i .
The normal equation L′(θ) = 0 yields
L′(θ) = − n
2θ+
1
2θ2
∑Y 2i = 0
leading to
θ =1
nSn
with Sn =∑Y 2i . Moreover, for any θ
L(θ, θ) = −n2
log(θ/θ)− Sn2
(1/θ − 1/θ
)= nK(θ, θ)
where
K(θ, θ′) = −1
2
[log(θ/θ′) + 1/θ − 1/θ′
]is the Kullback-Leibler divergence for two Gaussian measures N(0, θ) and N(0, θ′) .
66
2.9.3 Univariate normal distribution
Let Yi be as in previous example N{α, σ2} but neither the mean α nor the variance
σ2 are known. This leads to estimating the vector θ = (θ1, θ2) = (α, σ2) from the i.i.d.
sample Y .
The maximum likelihood approach leads to maximizing the log-likelihood w.r.t. the
vector θ = (α, σ2)> :
L(θ) =∑
log p(Yi,θ) = −n2
log(2πθ2)−1
2θ2
∑(Yi − θ1)2.
Exercise 2.9.3. Check that the ML approach leads to the same estimates (2.2) as the
method of moments.
2.9.4 Uniform distribution on [0, θ]
Let Yi be uniformly distributed on the interval [0, θ] of the real line where the right
end point θ is unknown. The density p(y, θ) of Pθ w.r.t. the Lebesgue measure is
θ−11(y ≤ θ) . The likelihood reads as
Z(θ) = θ−n1(maxiYi ≤ θ).
This density is positive iff θ ≥ maxi Yi and it is maximized exactly for θ = maxi Yi .
One can see that the MLE θ is the limiting case of the moment estimate θk as k grows
to infinity.
2.9.5 Bernoulli or binomial model
Let Pθ be a Bernoulli law for θ ∈ [0, 1] . The density of Yi under Pθ can be written as
p(y, θ) = θy(1− θ)1−y.
The corresponding log-likelihood reads as
L(θ) =∑{
Yi log θ + (1− Yi) log(1− θ)}
= Sn logθ
1− θ+ n log(1− θ)
with Sn =∑Yi . Maximizing this expression w.r.t. θ results again in the empirical
mean
θ = Sn/n.
This implies
L(θ, θ) = nθ logθ
θ+ n(1− θ) log
1− θ1− θ
= nK(θ, θ)
67
where K(θ, θ′) = θ log(θ/θ′)+(1−θ) log{(1−θ)/(1−θ′) is the Kullback-Leibler divergence
for the Bernoulli law.
2.9.6 Multinomial model
The multinomial distribution Bmθ describes the number of successes in m experiments
when one success has the probability θ ∈ [0, 1] . This distribution can be viewed as the
sum of m binomials with the same parameter θ .
One has
Pθ(Y1 = k) =
(m
k
)θk(1− θ)m−k, k = 0, . . . ,m.
Exercise 2.9.4. Check that the ML approach leads to the estimate
θ =1
mn
∑Yi .
Compute L(θ, θ) .
2.9.7 Exponential model
Let Y1, . . . , Yn be i.i.d. exponential random variables with parameter θ∗ > 0 . This
means that Yi are nonnegative and satisfy IP (Yi > t) = e−t/θ∗
. The density of the
exponential law w.r.t. the Lebesgue measure is p(y, θ∗) = e−y/θ∗/θ∗ . The corresponding
log-likelihood can be written as
L(θ) = −n log θ −n∑i=1
Yi/θ = −S/θ − n log θ,
where S = Y1 + . . .+ Yn .
The ML estimating equation yields S/θ2 = n/θ or
θ = S/n.
For the fitted log-likelihood L(θ, θ) this gives
L(θ, θ) = −n(1− θ/θ)− n log(θ/θ) = nK(θ, θ).
Here once again K(θ, θ′) = θ/θ′−1− log(θ/θ′) is the Kullback-Leibler divergence for the
exponential law.
68
2.9.8 Poisson model
Let Y1, . . . , Yn be i.i.d. Poisson random variables satisfying IP (Yi = m) = |θ∗|me−θ∗/m!
for m = 0, 1, 2, . . . . The corresponding log-likelihood can be written as
L(θ) =n∑i=1
log(θYie−θ/Yi!
)= log θ
n∑i=1
Yi − θ − log(Yi!) = S log θ − nθ +R,
where S = Y1 + . . .+ Yn and R =∑n
i=1 log(Yi!) . Here we leave out that 0! = 1 .
The ML estimating equation immediately yields S/θ = n or
θ = S/n.
For the fitted log-likelihood L(θ, θ) this gives
L(θ, θ) = nθ log(θ/θ)− n(θ − θ) = nK(θ, θ).
Here again K(θ, θ′) = θ log(θ/θ′) − (θ − θ′) is the Kullback-Leibler divergence for the
Poisson law.
2.9.9 Shift of a Laplace (double exponential) law
Let P0 be the symmetric distribution defined by the equations
P0(|Y1| > y) = e−y/σ, y ≥ 0,
for some given σ > 0 . Equivalently one can say that the absolute value of Y1 is
exponential with parameter σ under P0 . Now define Pθ by shifting P0 by the value
θ . This means that
Pθ(|Y1 − θ| > y) = e−y/σ, y ≥ 0.
The density of Y1 − θ under Pθ is p(y) = (2σ)−1e−|y|/σ . The maximum likelihood
approach leads to maximizing the sum
L(θ) = −n log(2σ)−∑|Yi − θ|/σ,
or equivalently to minimizing the sum∑|Yi − θ| :
θ = argminθ
∑|Yi − θ|. (2.24)
This is just the least absolute deviation estimate given by the median of the edf:
θ = med(Fn).
69
Exercise 2.9.5. Show that the median solves the problem (2.24).
Hint: suppose that n is odd. Consider the ordered observations Y(1) ≤ Y(2) ≤ . . . ≤Y(n) . Show that the median of Pn is given by Y((n+1)/2) . Show that this point solves
(2.24).
2.10 Quasi Maximum Likelihood approach
Let Y = (Y1, . . . , Yn)> be a sample from a marginal distribution P . Let also (Pθ,θ ∈Θ) be a given parametric family with the log-likelihood `(y,θ) . The parametric approach
is based on the assumption that the underlying distribution P belongs to this family.
The quasi maximum likelihood method applies the maximum likelihood approach for
family (Pθ) even if the underlying distribution P does not belong to this family. This
leads again to the estimate θ that maximizes the expression L(θ) =∑`(Yi,θ) and
is called the quasi MLE. It might happen that the true distribution belongs to some
other parametric family for which one also can construct the MLE. However, there could
be serious reasons for applying the quasi maximum likelihood approach even in this
misspecified case. One of them is that the properties of the estimate θ are essentially
determined by the geometrical structure of the log-likelihood. The use of a parametric
family with a nice geometric structure (which are quadratic or convex functions of the
parameter) can seriously simplify the algorithmic burdens and improve the behavior of
the method.
2.10.1 LSE as quasi likelihood estimation
Consider the model
Yi = θ∗ + εi (2.25)
where θ∗ is the parameter of interest from IR and εi are random errors satisfying
IEεi = 0 . The assumption that εi are i.i.d. normal N(0, σ2) leads to the quasi log-
likelihood
L(θ) = −n2
log(2πσ2)− 1
2σ2
∑(Yi − θ)2.
Maximizing the expression L(θ) leads to minimizing the sum of squared residuals (Yi−θ)2 :
θ = argminθ
∑(Yi − θ)2 =
1
n
∑Yi .
This estimate is called a least squares estimate (LSE) or ordinary least squares estimate
(oLSE).
70
Example 2.10.1. Consider the model (2.25) with heterogeneous errors, that is, εi are
independent normal with zero mean and variances σ2i . The corresponding log-likelihood
reads
L◦(θ) = −1
2
∑{log(2πσ2i ) +
(Yi − θ)2
σ2i
}.
The MLE θ◦ is
θ◦def= argmax
θL◦(θ) = N−1
∑Yi/σ
2i , N =
∑σ−2i .
We now compare the estimates θ and θ◦ .
Lemma 2.10.1. The following assertions hold for the estimate θ :
1. θ is unbiased: IEθ∗ θ = θ∗ .
2. The quadratic risk of θ is equal to the variance Var(θ) given by
R(θ, θ∗)def= IEθ∗ |θ − θ∗|2 = Var(θ) = n−2
∑σ2i .
3. θ is not R-efficient unless all σ2i are equal.
Now we consider the MLE θ◦ .
Lemma 2.10.2. The following assertions hold for the estimate θ◦ :
1. θ◦ is unbiased: IEθ∗ θ◦ = θ∗ .
2. The quadratic risk of θ◦ is equal to the variance Var(θ◦) given by
R(θ◦, θ∗)def= IEθ∗ |θ◦ − θ∗|2 = Var(θ◦) = N−2
∑σ−2i = N−1.
3. θ◦ is R-efficient.
Exercise 2.10.1. Check the statements of Lemma 2.10.1 and 2.10.2.
Hint: compute the Fisher information for the model (2.25) using the property of
additivity:
I(θ) =∑
I(i)(θ) =∑
σ−2i = N,
where I(i)(θ) is the Fisher information in the marginal model Yi = θ+ εi with just one
observation Yi . Apply the Cramer-Rao inequality for one observation of the vector Y .
71
2.10.2 LAD and robust estimation as quasi likelihood estimation
Consider again the model (2.25). The classical least squares approach faces serious prob-
lems if the available data Y are contaminated with outliers. The reasons for contami-
nation could be missing data or typing errors, etc. Unfortunately, even a single outlier
can significantly disturb the sum L(θ) and thus, the estimate θ . A typical approach
proposed and developed by Huber is to apply another “influence function” ψ(Yi − θ) in
the sum L(θ) in place of the squared residual |Yi − θ|2 leading to the M-estimate
θ = argminθ
∑ψ(Yi − θ). (2.26)
A popular ψ -function for robust estimation is the absolute value |Yi− θ| . The resulting
estimate
θ = argminθ
∑|Yi − θ|
is called least absolute deviation and the solution is the median of the empirical distri-
bution Pn . Another proposal is called the Huber function: it is quadratic in a vicinity
of zero and linear outside:
ψ(x) =
x2 if |x| ≤ t,
a|x|+ b otherwise.
Exercise 2.10.2. Show that for each t > 0 , the coefficients a = a(t) and b = b(t) can
be selected to provide that ψ(x) and its derivative are continuous.
A remarkable fact about this approach is that every such estimate can be viewed as a
quasi MLE for the model (2.25). Indeed, for a given function ψ , define the measure Pθ
with the log-density `(y, θ) = −ψ(y−θ) . Then the log-likelihood is L(θ) = −∑ψ(Yi−θ)
and the corresponding (quasi) MLE coincides with (2.26).
Exercise 2.10.3. Suggest a σ -finite measure µ such that exp{−ψ(y−θ)
}is the density
of Yi for the model (2.25) w.r.t. the measure µ .
Hint: suppose for simplicity that
Cψdef=
∫exp{−ψ(x)
}dx <∞.
Show that C−1ψ exp{−ψ(y − θ)
}is a density w.r.t. the Lebesgue measure for any θ .
Exercise 2.10.4. Show that the LAD θ = argminθ∑|Yi − θ| is the quasi MLE for
the model (2.25) when the errors εi are assumed Laplacian (double exponential) with
density p(x) = (1/2)e−|x| .
72
2.11 Univariate exponential families
Most parametric families considered in the previous sections are particular cases of expo-
nential families (EF) distributions. This includes the Gaussian shift, Bernoulli, Poisson,
exponential, volatility models. The notion of an EF already appeared in the context of
the Cramer-Rao inequality. Now we study such families in further detail.
We say that P is an EF if all measures Pθ ∈ P are dominated by a σ -finite measure
µ0 on Y and the density functions p(y, θ) = dPθ/dµ0(y) are of the form
p(y, θ)def=
dPθdµ0
(y) = p(y)eyC(θ)−B(θ).
Here C(θ) and B(θ) are some given nondecreasing functions on Θ and p(y) is a non-
negative function on Y .
Usually one assumes some regularity conditions on the family P . One possibility
was already given when we discussed the Cramer-Rao inequality; see Definition 2.5.5.
Below we assume that that condition is always fulfilled. It basically means that we can
differentiate w.r.t. θ under the integral sign.
For an EF, the log-likelihood admits an especially simple representation, nearly linear
in y :
`(y, θ)def= log p(y, θ) = yC(θ)−B(θ) + log p(y)
so that the log-likelihood ratio for θ, θ′ ∈ Θ reads as
`(y, θ, θ′)def= `(y, θ)− `(y, θ′) = y
[C(θ)− C(θ′)
]−[B(θ)−B(θ′)
].
2.11.1 Natural parametrization
Let P =(Pθ)
be an EF. By Y we denote one observation from the distribution Pθ ∈ P .
In addition to the regularity conditions, one often assumes the natural parametrization
for the family P which means the relation EθY = θ . Note that this relation is fulfilled
for all the examples of EF’s that we considered so far in the previous section. It is obvious
that the natural parametrization is only possible if the following identifiability condition
is fulfilled: for any two different measures from the considered parametric family, the
corresponding mean values are different. Otherwise the natural parametrization is always
possible: just define θ as the expectation of Y . Below we use the abbreviation EFn for
an exponential family with natural parametrization.
Some properties of an EFn The natural parametrization implies an important prop-
erty for the functions B(θ) and C(θ) .
73
Lemma 2.11.1. Let(Pθ)
be a naturally parameterized EF. Then
B′(θ) = θC ′(θ).
Proof. Differentiating both sides of the equation∫p(y, θ)µ0(dy) = 1 w.r.t. θ yields
0 =
∫ {yC ′(θ)−B′(θ)
}p(y, θ)µ0(dy)
=
∫ {yC ′(θ)−B′(θ)
}Pθ(dy)
= θC ′(θ)−B′(θ)
and the result follows.
The next lemma computes the important characteristics of a natural EF such as the
Kullback-Leibler divergence K(θ, θ′) = Eθ log(p(Y, θ)/p(Y, θ′)
), the Fisher information
I(θ)def= Eθ|`′(Y, θ)|2 , and the rate function m(µ, θ, θ′) = − logEθ exp
{µ`(Y, θ, θ′)
}.
Lemma 2.11.2. Let (Pθ) be an EFn. Then with θ, θ′ ∈ Θ fixed, it holds for
• the Kullback-Leibler divergence K(θ, θ′) = Eθ log(p(Y, θ)/p(Y, θ′)
):
K(θ, θ′) =
∫log
p(y, θ)
p(y, θ′)Pθ(dy)
={C(θ)− C(θ′)
}∫yPθ(dy)−
{B(θ)−B(θ′)
}= θ
{C(θ)− C(θ′)
}−{B(θ)−B(θ′)
}; (2.27)
• the Fisher information I(θ)def= Eθ|`′(Y, θ)|2 :
I(θ) = C ′(θ);
• the rate function m(µ, θ, θ′) = − logEθ exp{µ`(Y, θ, θ′)
}:
m(µ, θ, θ′) = K(θ, θ + µ(θ′ − θ)
);
• the variance Varθ(Y ) :
Varθ(Y ) = 1/I(θ) = 1/C ′(θ). (2.28)
Proof. Differentiating the equality
0 ≡∫
(y − θ)Pθ(dy) =
∫(y − θ)eL(y,θ)µ0(dy)
74
w.r.t. θ implies in view of Lemma 2.11.1
1 ≡ IEθ[(Y − θ)
{C ′(θ)Y −B′(θ)
}]= C ′(θ)IEθ(Y − θ)2.
This yields Varθ(Y ) = 1/C ′(θ) . This leads to the following representation of the Fisher
information:
I(θ) = Varθ[`′(Y, θ)
]= Varθ[C
′(θ)Y −B′(θ)] =[C ′(θ)
]2Varθ(Y ) = C ′(θ).
Exercise 2.11.1. Check the equations for the Kullback-Leibler divergence and Fisher
information from Lemma 2.11.2.
MLE and maximum likelihood for an EFn Now we discuss the maximum likeli-
hood estimation for a sample from an EFn. The log-likelihood can be represented in the
form
L(θ) =n∑i=1
log p(Yi, θ) = C(θ)n∑i=1
Yi −B(θ)n∑i=1
1 +n∑i=1
log p(Yi) (2.29)
= SC(θ)− nB(θ) +R
where
S =n∑i=1
Yi, R =n∑i=1
log p(Yi).
The remainder term R is unimportant because it does not depend on θ and thus it
does not enter in the likelihood ratio. The maximum likelihood estimate θ is defined by
maximizing L(θ) w.r.t. θ , that is,
θ = argmaxθ∈Θ
L(θ) = argmaxθ∈Θ
{SC(θ)− nB(θ)
}.
In the case of an EF with the natural parametrization, this optimization problem admits
a closed form solution given by the next theorem.
Theorem 2.11.3. Let (Pθ) be an EFn. Then the MLE θ fulfills
θ = S/n = n−1n∑i=1
Yi .
It holds
IEθθ = θ, Varθ(θ) = [nI(θ)]−1 = [nC ′(θ)]−1
75
so that θ is R-efficient. Moreover, the fitted log-likelihood L(θ, θ)def= L(θ)−L(θ) satisfies
for any θ ∈ Θ :
L(θ, θ) = nK(θ, θ). (2.30)
Proof. Maximization of L(θ) w.r.t. θ leads to the estimating equation nB′(θ)−SC ′(θ) =
0 . This and the identity B′(θ) = θC ′(θ) yield the MLE
θ = S/n.
The variance Varθ(θ) is computed using (2.28) from Lemma 2.11.2. The formula (2.27)
for the Kullback-Leibler divergence and (2.29) yield the representation (2.30) for the
fitted log-likelihood L(θ, θ) for any θ ∈ Θ .
One can see that the estimate θ is the mean of the Yi ’s. As for the Gaussian
shift model, this estimate can be motivated by the fact that the expectation of every
observation Yi under Pθ is just θ and by the law of large numbers the empirical mean
converges to its expectation as the sample size n grows.
2.11.2 Canonical parametrization
Another useful representation of an EF is given by the so-called canonical parametriza-
tion. We say that υ is the canonical parameter for this EF if the density of each measure
Pυ w.r.t. the dominating measure µ0 is of the form:
p(y, υ)def=
dPυdµ0
(y) = p(y) exp{yυ − d(υ)
}.
Here d(υ) is a given convex function on Θ and p(y) is a nonnegative function on Y .
The abbreviation EFc will indicate an EF with the canonical parametrization.
Some properties of an EFc The next relation is an obvious corollary of the definition:
Lemma 2.11.4. An EFn (Pθ) always permits a unique canonical representation. The
canonical parameter υ is related to the natural parameter θ by υ = C(θ) , d(υ) = B(θ)
and θ = d′(υ) .
Proof. The first two relations follow from the definition. They imply B′(θ) = d′(υ) ·dυ/dθ = d′(υ) · C ′(θ) and the last statement follows from B′(θ) = θC ′(θ) .
The log-likelihood ratio `(y, υ, υ1) for an EFc reads as
`(Y, υ, υ1) = Y (υ − υ1)− d(υ) + d(υ1).
The next lemma collects some useful facts about an EFc.
76
Lemma 2.11.5. Let P =(Pυ, υ ∈ U
)be an EFc and let the function d(·) be two times
continuously differentiable. Then it holds for any υ, υ1 ∈ U :
(i). The mean EυY and the variance Varυ(Y ) fulfill
EυY = d′(υ), Varυ(Y ) = Eυ(Y − EυY )2 = d′′(υ).
(ii). The Fisher information I(υ)def= Eυ|`′(Y, υ)|2 satisfies
I(υ) = d′′(υ).
(iii). The Kullback-Leibler divergence Kc(υ, υ1) = Eυ`(Y, υ, υ1) satisfies
Kc(υ, υ1) =
∫log
p(y, υ)
p(y, υ1)Pυ(dy)
= d′(υ)(υ − υ1
)−{d(υ)− d(υ1)
}= d′′(υ) (υ1 − υ)2/2,
where υ is a point between υ and υ1 . Moreover, for υ ≤ υ1 ∈ U
Kc(υ, υ1) =
∫ υ1
υ(υ1 − u)d′′(u)du.
(iv). The rate function m(µ, υ1, υ)def= − log IEυ exp
{µ`(Y, υ1, υ)
}fulfills
m(µ, υ1, υ) = µKc(υ, υ1
)−Kc
(υ, υ + µ(υ1 − υ)
)Proof. Differentiating the equation
∫p(y, υ)µ0(dy) = 1 w.r.t. υ yields∫ {
y − d′(υ)}p(y, υ)µ0(dy) = 0,
that is, EυY = d′(υ) . The expression for the variance can be proved by one more
differentiating of this equation. Similarly one can check (ii) . The item (iii) can be
checked by simple algebra and (iv) follows from (i) .
Further, for any υ, υ1 ∈ U , it holds
`(Y, υ1, υ)− Eυ`(Y, υ1, υ) = (υ1 − υ){Y − d′(υ)
}and with u = µ(υ1 − υ)
logEυ exp{u(Y − d′(υ)
)}= −ud′(υ) + d(υ + u)− d(υ) + logEυ exp
{uY − d(υ + u) + d(υ)
}= d(υ + u)− d(υ)− ud′(υ) = Kc(υ, υ + u),
77
because
Eυ exp{uY − d(υ + u) + d(υ)
}= Eυ
dPυ+udPυ
= 1
and (iv) follows by (iii) .
Table 2.1 presents the canonical parameter and the Fisher information for the exam-
ples of exponential families from Section 2.9.
Table 2.1: υ(θ) , d(υ) , I(υ) = d′′(υ) and θ = θ(υ) for the examples from Section 2.9.
Model υ d(υ) I(υ) θ(υ)
Gaussian regression θ/σ2 υ2σ2/2 σ2 σ2υ
Bernoulli model log(θ/(1− θ)
)log(1 + eυ) eυ/(1 + eυ)2 eυ/(1 + eυ)
Poisson model log θ eυ eυ eυ
Exponential model 1/θ − log υ 1/υ2 1/υ
Volatility model −1/(2θ) − 12 log(−2υ) 1/(2υ2) −1/(2υ)
Exercise 2.11.2. Check (iii) and (iv) in Lemma 2.11.5.
Exercise 2.11.3. Check the entries of Table 2.1.
Exercise 2.11.4. Check that Kc(υ, υ′) = K(θ(υ), θ(υ′)
)Exercise 2.11.5. Plot Kc(υ∗, υ) as a function of υ for the families from Table 2.1.
Maximum likelihood estimation for an EFc The structure of the log-likelihood in
the case of the canonical parametrization is particularly simple:
L(υ) =
n∑i=1
log p(Yi, υ) = υ
n∑i=1
Yi − d(υ)
n∑i=1
1 +
n∑i=1
log p(Yi)
= Sυ − nd(υ) +R
where
S =n∑i=1
Yi, R =n∑i=1
log p(Yi).
Again, as in the case of an EFn, we can ignore the remainder term R . The estimating
equation dL(υ)/dυ = 0 for the maximum likelihood estimate υ reads as
d′(υ) = S/n.
78
This and the relation θ = d′(υ) lead to the following result.
Theorem 2.11.6. The maximum likelihood estimates θ and υ for the natural and
canonical parametrization are related by the equations
θ = d′(υ) υ = C(θ).
The next result describes the structure of the fitted log-likelihood and basically re-
peats the result of Theorem 2.11.3.
Theorem 2.11.7. Let (Pυ) be an EF with canonical parametrization. Then for any
υ ∈ U the fitted log-likelihood L(υ, υ)def= maxυ′ L(υ′, υ) satisfies
L(υ, υ) = nKc(υ, υ).
Exercise 2.11.6. Check the statement of Theorem 2.11.7.
2.11.3 Deviation probabilities for the maximum likelihood
Let Y1, . . . , Yn be i.i.d. observations from an EF P . This section presents a probability
bound for the fitted likelihood. To be more specific we assume that P is canonically
parameterized, P = (Pυ) . However, the bound applies to the natural and any other
parametrization because the value of maximum of the likelihood process L(θ) does not
depend on the choice of parametrization. The log-likelihood ratio L(υ′, υ) is given by the
expression (2.29) and its maximum over υ′ leads to the fitted log-likelihood L(υ, υ) =
nKc(υ, υ) .
Our first result concerns a deviation bound for L(υ, υ) . It utilizes the representation
for the fitted log-likelihood given by Theorem 2.11.3. As usual, we assume that the family
P is regular. In addition, we require the following condition.
(Pc) P = (Pυ, υ ∈ U ⊆ IR) is a regular EF. The parameter set U is convex. The
function d(υ) is two times continuously differentiable and the Fisher information
I(υ) = d′′(υ) satisfies I(υ) > 0 for all υ .
The condition (Pc) implies that for any compact set U0 there is a constant a =
a(U0) > 0 such that
|I(υ1)/I(υ2)|1/2 ≤ a, υ1, υ2 ∈ U0 .
Theorem 2.11.8. Let Yi be i.i.d. from a distribution Pυ∗ which belongs to an EFc
satisfying (Pc) . For any z > 0
IPυ∗(L(υ, υ∗) > z
)= IPυ∗
(nKc(υ, υ∗) > z
)≤ 2e−z.
79
Proof. The proof is based on two properties of the log-likelihood. The first one is that
the expectation of the likelihood ratio is just one: IEυ∗ expL(υ, υ∗) = 1 . This and the
exponential Markov inequality imply for z ≥ 0
IPυ∗(L(υ, υ∗) ≥ z
)≤ e−z. (2.31)
The second property is specific to the considered univariate EF and is based on geometric
properties of the log-likelihood function: linearity in the observations Yi and convexity
in the parameter υ . We formulate this important fact in a separate
Lemma 2.11.9. Let the EFc P fulfill (Pc) . For given z and any υ0 ∈ U , there exist
two values υ+ > υ0 and υ− < υ0 satisfying Kc(υ±, υ0) = z/n such that
{L(υ, υ0) > z} ⊆ {L(υ+, υ0) > z} ∪ {L(υ−, υ0) > z}.
Proof. It holds
{L(υ, υ0) > z} ={
supυ
[S(υ − υ0
)− n
{d(υ)− d(υ0)
}]> z}
⊆{S > inf
υ>υ0
z + n{d(υ)− d(υ0)
}υ − υ0
}∪{−S > inf
υ<υ0
z + n{d(υ)− d(υ0)
}υ0 − υ
}.
Define for every u > 0
f(u) =z + n
{d(υ0 + u)− d(υ0)
}u
.
This function attains its minimum at a point u satisfying the equation
z/n+ d(υ0 + u)− d(υ0)− d′(υ0 + u)u = 0
or, equivalently,
K(υ0 + u, υ0) = z/n.
The condition (Pc) provides that there is only one solution u ≥ 0 of this equation.
Exercise 2.11.7. Check that the equation K(υ0 + u, υ0) = z/n has only one positive
solution for any z > 0 .
Hint: use that K(υ0 + u, υ0) is a convex function of u with minimum at u = 0 .
Now, it holds with υ+ = υ0 + u{S > inf
υ>υ0
z + n[d(υ)− d(υ0)
]υ − υ0
}=
{S >
z + n[d(υ+)− d(υ0)
]υ+ − υ0
}⊆ {L(υ+, υ0) > z}.
80
Similarly{−S > inf
υ<υ0
z + n{d(υ)− d(υ0)
}υ0 − υ
}=
{−S >
z + n[d(υ−)− d(υ0)
]υ0 − υ−
}⊆ {L(υ−, υ0) > z}.
for some υ− < υ0 .
The assertion of the theorem is now easy to obtain. Indeed,
IPυ∗(L(υ, υ∗) ≥ z
)≤ IPυ∗
(L(υ+, υ∗) ≥ z
)+ IPυ∗
(L(υ−, υ∗) ≥ z
)≤ 2e−z
yielding the result.
Exercise 2.11.8. Let (Pυ) be a Gaussian shift experiment, that is, Pυ = N(υ, 1) .
• Check that L(υ, υ) = n|υ − υ|2/2 ;
• Given z ≥ 0 , find the points υ+ and υ− such that
{L(υ, υ∗) > z} ⊆ {L(υ+, υ∗) > z} ∪ {L(υ−, υ∗) > z}.
• Plot the mentioned sets {υ : L(υ, υ) > z} , {υ : L(υ+, υ) > z} , and {υ : L(υ−, υ) >
z} as functions of υ for a fixed S =∑Yi .
Remark 2.11.1. Note that the mentioned result only utilizes the geometric structure
of the univariate EFc. The most important feature of the log-likelihood ratio L(υ, υ∗) =
S(υ − υ∗)− d(υ) + d(υ∗) is its linearity w.r.t. the stochastic term S . This allows us to
replace the maximum over the whole set U by the maximum over the set consisting of two
points υ± . Note that the proof does not rely on the distribution of the observations Yi .
In particular, Lemma 2.11.9 continues to hold even within the quasi likelihood approach
when L(υ) is not the true log-likelihood. However, the bound (2.31) relies on the nature
of L(υ, υ∗) . Namely, it utilizes that IEυ∗ exp{L(υ, υ±)
}= 1 , which is generally false
in the quasi likelihood setup. Nevertheless, the exponential bound can be extended to
the quasi likelihood approach under the condition of bounded exponential moments for
L(υ, υ∗) : for some µ > 0 , it should hold IE exp{µL(υ, υ∗)
}= C(µ) <∞ .
Theorem 2.11.8 yields a simple construction of a confidence interval for the parameter
υ∗ and the concentration property of the MLE υ .
Theorem 2.11.10. Let Yi be i.i.d. from Pυ∗ ∈ P with P satisfying (Pc) .
81
1. If zα satisfies e−zα ≤ α/2 , then
E(zα) ={υ : nKc
(υ, υ
)≤ zα
}is a α -confidence set for the parameter υ∗ .
2. Define for any z > 0 the set A(z, υ∗) = {υ : Kc(υ, υ∗) ≤ z/n} . Then
IPυ∗(υ /∈ A(z, υ∗)
)≤ 2e−z.
The second assertion of the theorem claims that the estimate υ belongs with a high
probability to the vicinity A(z, υ∗) of the central point υ∗ defined by the Kullback-
Leibler divergence. Due to Lemma 2.11.5, (iii) Kc(υ, υ∗) ≈ I(υ∗) (υ − υ∗)2/2 , where
I(υ∗) is the Fisher information at υ∗ . This vicinity is an interval around υ∗ of length
of order n−1/2 . In other words, this result implies the root-n consistency of υ .
The deviation bound for the fitted log-likelihood from Theorem 2.11.8 can be viewed
as a bound for the normalized loss of the estimate υ . Indeed, define the loss function
℘(υ′, υ) = K1/2(υ′, υ) . Then Theorem 2.11.8 yields that the loss is with high probability
bounded by√z/n provided that z is sufficiently large. Similarly one can establish the
bound for the risk.
Theorem 2.11.11. Let Yi be i.i.d. from the distribution Pυ∗ which belongs to a canon-
ically parameterized EF satisfying (Pc) . The following properties hold:
(i). For any r > 0 there is a constant rr such that
IEυ∗Lr(υ, υ∗) = nrIEυ∗K
r(υ, υ∗) ≤ rr .
(ii). For every λ < 1
IEυ∗ exp{λL(υ, υ∗)
}= IEυ∗ exp
{λnK(υ, υ∗)
}≤ (1 + λ)/(1− λ).
Proof. By Theorem 2.11.8
IEυ∗Lr(υ, υ∗) = −
∫z≥0
zrdIPυ∗{L(υ, υ∗) > z
}= r
∫z≥0
zr−1IPυ∗{L(υ, υ∗) > z
}dz
≤ r
∫z≥0
2zr−1e−zdz
and the first assertion is fulfilled with rr = 2r∫z≥0 z
r−1e−zdz . The assertion (ii) is proved
similarly.
82
Deviation bound for other parameterizations The results for the maximum like-
lihood and their corollaries have been stated for an EFc. An immediate question that
arises in this respect is whether the use of the canonical parametrization is essential.
The answer is “no”: a similar result can be stated for any EF whatever the parametriza-
tion used. This fact is based on the simple observation that the maximum likelihood is
the value of the maximum of the likelihood process; this value does not depend on the
parametrization.
Lemma 2.11.12. Let (Pθ) be an EF. Then for any θ
L(θ, θ) = nK(Pθ, Pθ). (2.32)
Exercise 2.11.9. Check the result of Lemma 2.11.12.
Hint: use that both sides of (2.32) depend only on measures Pθ, Pθ and not on the
parametrization.
Below we write as before K(θ, θ) instead of K(Pθ, Pθ) . The property (2.32) and the
exponential bound of Theorem 2.11.8 imply the bound for a general EF:
Theorem 2.11.13. Let (Pθ) be a univariate EF. Then for any z > 0
IPθ∗(L(θ, θ∗) > z
)= IPθ∗
(nK(θ, θ∗) > z
)≤ 2e−z.
This result allows us to build confidence sets for the parameter θ∗ and concentration
sets for the MLE θ in terms of the Kullback-Leibler divergence:
A(z, θ∗) = {θ : K(θ, θ∗) ≤ z/n},
E(z) = {θ : K(θ, θ) ≤ z/n}.
Corollary 2.11.14. Let (Pθ) be an EF. If e−zα = α/2 then
IPθ∗(θ 6∈ A(zα, θ
∗))≤ α,
and
IPθ∗(E(zα) 63 θ
)≤ α.
Moreover, for any r > 0
IEθ∗Lr(θ, θ∗) = nrIEθ∗K
r(θ, θ∗) ≤ rr .
83
Asymptotic against likelihood-based approach The asymptotic approach rec-
ommends to apply symmetric confidence and concentration sets with width of order
[nI(θ∗)]−1/2 :
An(z, θ∗) = {θ : I(θ∗) (θ − θ∗)2 ≤ 2z/n},
En(z) = {θ : I(θ∗) (θ − θ)2 ≤ 2z/n},
E′n(z) = {θ : I(θ) (θ − θ)2 ≤ 2z/n}.
Then asymptotically, i.e. for large n , these sets do approximately the same job as the
non-asymptotic sets A(z, θ∗) and E(z) . However, the difference for finite samples can
be quite significant. In particular, for some cases, e.g. the Bernoulli of Poisson families,
the sets An(z, θ∗) and E′n(z) may extend beyond the parameter set Θ .
84
Chapter 3
Regression Estimation
This chapter discusses the estimation problem for the regression model. First a linear
regression model is considered, then a generalized linear modeling is discussed. We also
mention median and quantile regression.
3.1 Regression model
The (mean) regression model can be written in the form IE(Y |X) = f(X) , or equiva-
lently,
Y = f(X) + ε, (3.1)
where Y is the dependent (explained) variable and X is the explanatory variable (regres-
sor) which can be multidimensional. The target of analysis is the systematic dependence
of the explained variable Y from the explanatory variable X . The regression function
f describes the dependence of the mean of Y as a function of X . The value ε can be
treated as an individual deviation (error). It is usually assumed to be random with zero
mean. Below we discuss in more detail the components of the regression model (3.1).
3.1.1 Observations
In almost all practical situations, regression analysis is performed on the basis of available
data (observations) given in the form of a sample of pairs (Xi, Yi) for i = 1, . . . , n , where
n is the sample size. Here Y1, . . . , Yn are observed values of the regression variable Y
and X1, . . . , Xn are the corresponding values of the explanatory variable X . For each
observation Yi , the regression model reads as:
Yi = f(Xi) + εi
where εi is the individual i th error.
85
86
3.1.2 Design
The set X1, . . . , Xn of the regressor’s values is called a design. The set X of all possible
values of the regressor X is called the design space. If this set X is compact, then one
speaks of a compactly supported design.
The nature of the design can be different for different statistical models. However,
it is important to mention that the design is always observable. Two kinds of design
assumptions are usually used in statistical modeling. A deterministic design assumes
that the points X1, . . . , Xn are nonrandom and given in advance. Here are typical
examples:
Example 3.1.1. [Time series] Let Yt0 , Yt0+1, . . . , YT be a time series. The time points
t0, t0 + 1, . . . , T build a regular deterministic design. The regression function f explains
the trend of the time series Yt as a function of time.
Example 3.1.2. [Imaging] Let Yij be the observed grey value at the pixel (i, j) of an
image. The coordinate Xij of this pixel is the corresponding design value. The regression
function f(Xij) gives the true image value at Xij which is to be recovered from the
noisy observations Yij .
If the design is supported on a cube in IRd and the design points Xi form a grid in
this cube, then the design is called equidistant. An important feature of such a design
is that the number NA of design points in any “massive” subset A of the unit cube is
nearly the volume of this subset VA multiplied by the sample size n : NA ≈ nVA . Design
regularity means that the value NA is nearly proportional to nVA , that is, NA ≈ cnVAfor some positive constant c which may depend on the set A .
In some applications, it is natural to assume that the design values Xi are randomly
drawn from some design distribution. Typical examples are given by sociological studies.
In this case one speaks of a random design. The design values X1, . . . , Xn are assumed
to be independent and identically distributed from a law PX on the design space X
which is a subset of the Euclidean space IRd . The design variables X are also assumed
to be independent of the observations Y .
One special case of random design is the uniform design when the design distribution
is uniform on the unit cube in IRd . The uniform design possesses a similar, important
property to an equidistant design: the number of design points in a “massive” subset of
the unit cube is on average close to the volume of this set multiplied by n . The random
design is called regular on X if the design distribution is absolutely continuous with
respect to the Lebesgue measure and the design density p(x) = dPX(x)/dλ is positive
and continuous on X . This again ensures with a probability close to one the regularity
property NA ≈ cnVA with c = p(x) for some x ∈ A .
87
It is worth mentioning that the case of a random design can be reduced to the case of
a deterministic design by considering the conditional distribution of the data given the
design variables X1, . . . , Xn .
3.1.3 Errors
The decomposition of the observed response variable Y into the systematic component
f(x) and the error ε in the model equation (3.1) is not formally defined and cannot be
done without some assumptions on the errors εi . The standard approach is to assume
that the mean value of every εi is zero. Equivalently this means that the expected
value of the observation Yi is just the regression function f(Xi) . This case is called
mean regression or simply regression. It is usually assumed that the errors εi have finite
second moments. Homogeneous errors case means that all the errors εi have the same
variance σ2 = Var ε2i . The variance of heterogeneous errors εi may vary with i . In
many applications not only the systematic component f(Xi) = IEYi but also the error
variance VarYi = Var εi depend on the regressor (location) Xi . Such models are often
written in the form
Yi = f(Xi) + σ(Xi)εi .
The observation (noise) variance σ2(x) can be the target of analysis similarly to the
mean regression function.
The assumption of zero mean noise, IEεi = 0 , is very natural and has a clear in-
terpretation. However, in some applications, it can cause trouble, especially if data are
contaminated by outliers. In this case, the assumption of a zero mean can be replaced by
a more robust assumption of a zero median. This leads to the median regression model
which assumes IP (εi ≤ 0) = 1/2 , or, equivalently
IP(Yi − f(Xi) ≤ 0
)= 1/2.
A further important assumption concerns the joint distribution of the errors εi . In the
majority of applications the errors are assumed to be independent. However, in some
situations, the dependence of the errors is quite natural. One example can be given by
time series analysis. The errors εi are defined as the difference between the observed
values Yi and the trend function fi at the i th time moment. These errors are often
serially correlated and indicate short or long range dependence. Another example comes
from imaging. The neighbor observations in an image are often correlated due to the
imaging technique used for recoding the images. The correlation particularly results from
the automatic movement correction.
88
For theoretical study one often assumes that the errors εi are not only independent
but also identically distributed. This, of course, yields a homogeneous noise. The theo-
retical study can be simplified even further if the error distribution is normal. This case
is called Gaussian regression and is denoted as εi ∼ N(0, σ2) . This assumption is very
useful and greatly simplifies the theoretical study. The main advantage of Gaussian noise
is that the observations and their linear combinations are also normally distributed. This
is an exclusive property of the normal law which helps to simplify the exposition and
avoid technicalities.
Under the given distribution of the errors, the joint distribution of the observations
Yi is determined by the regression function f(·) .
3.1.4 Regression function
By the equation (3.1), the regression variable Y can be decomposed into a systematic
component and a (random) error ε . The systematic component is a deterministic func-
tion f of the explanatory variable X called the regression function. Classical regression
theory considers the case of linear dependence, that is, one fits a linear relation between
Y and X :
f(x) = a+ bx
leading to the model equation
Yi = θ1 + θ2Xi + εi .
Here θ1 and θ2 are the parameters of the linear model. If the regressor x is multidimen-
sional, then θ2 is a vector from IRd and θ2x becomes the scalar product of two vectors.
In many practical examples the assumption of linear dependence is too restrictive. It can
be extended by several ways. One can try a more sophisticated functional dependence
of Y n X , for instance polynomial. More generally, one can assume that the regression
function f is known up to the finite-dimensional parameter θ = (θ1, . . . , θp)> ∈ IRp .
This situation is called parametric regression and denoted by f(·) = fθ(·) . If the func-
tion fθ depends on θ linearly, that is, fθ(x) = θ1ψ1(x) + . . .+ θpψp(x) for some given
functions ψ1, . . . , ψp , then the model is called linear regression. An important special
case is given by polynomial regression when f(x) is a polynomial function of degree
p− 1 : f(x) = θ1 + θ2x+ . . .+ θpxp−1 .
In many applications a parametric form of the regression function cannot be justified.
Then one speaks of nonparametric regression.
89
3.2 Method of substitution and M-estimation
Observe that the parametric regression equation can be rewritten as
εi = Yi − f(Xi,θ∗).
If θ is an estimate of the parameter θ∗ , then the residuals εi = Yi − f(Xi, θ) are
estimates of the individual errors εi . So, the idea of the method is to select the parameter
estimate θ in a way that the empirical distribution Pn of the residuals εi mimics as well
as possible certain prescribed features of the error distribution. We consider one approach
called minimum contrast or M-estimation. Let ψ(y) be an influence or contrast function.
The main condition on the choice of this function is that
IEψ(εi + z) ≥ IEψ(εi)
for all i = 1, . . . , n and all z . Then the true value θ∗ clearly minimizes the expectation
of the sum∑
i ψ(Yi − f(Xi,θ)
):
θ∗ = argminθ
IE∑i
ψ(Yi − f(Xi,θ)
).
This leads to the M-estimate
θ = argminθ
∑i
ψ(Yi − f(Xi,θ)
).
This estimation method can be treated as replacing the true expectation of the errors by
the empirical distribution of the residuals.
We specify this approach for regression estimation by the classical examples of least
squares, least absolute deviation and maximum likelihood estimation corresponding to
ψ(x) = x2 , ψ(x) = |x| and ψ(x) = log p(x) , where p(x) is the error density. All these
examples belong within framework of M-estimation and the quasi maximum likelihood
approach.
3.2.1 Mean regression. Least squares estimate
The observations Yi are assumed to follow the model
Yi = f(Xi,θ∗) + εi , IEεi = 0 (3.2)
90
with an unknown target θ∗ . Suppose in addition that σ2i = IEε2i <∞ . Then for every
θ ∈ Θ and every i ≤ n due to (3.2)
IEθ∗{Yi − f(Xi,θ)
}2= IEθ∗
{εi + f(Xi,θ
∗)− f(Xi,θ)}2
= σ2i +∣∣f(Xi,θ
∗)− f(Xi,θ)∣∣2.
This yields for the whole sample
IEθ∗∑{
Yi − f(Xi,θ)}2
=∑{
σ2i +∣∣f(Xi,θ
∗)− f(Xi,θ)∣∣2}.
This expression is clearly minimized at θ = θ∗ . This leads to the idea of estimating the
parameter θ∗ by maximizing its empirical counterpart. The resulting estimate is called
the (ordinary) least squares estimate (LSE):
θLSE = argminθ
∑{Yi − f(Xi,θ)
}2.
This estimate is very natural and requires minimal information about the errors εi : just
IEεi = 0 and IEε2i <∞ .
3.2.2 Median regression. Least absolute deviation estimate
Consider the same regression model as in (3.2), but the errors εi are not zero-mean.
Instead we assume that their median is zero:
Yi = f(Xi,θ∗) + εi , med(εi) = 0
As previously, the target of estimation is the parameter θ∗ . Observe that εi = Yi −f(Xi,θ
∗) and hence, the latter r.v. has median zero. We now use the following simple
fact: if med(ε) = 0 , then for any z 6= 0
IE|ε+ z| ≥ IE|ε|. (3.3)
Exercise 3.2.1. Prove (3.3).
The property (3.3) implies for every θ
IEθ∗∑∣∣Yi − f(Xi,θ)
∣∣ ≥ IEθ∗∑∣∣Yi − f(Xi,θ∗)∣∣,
that is, θ∗ minimizes over θ the expectation under the true measure of the sum∑∣∣Yi−
f(Xi,θ)∣∣ . This leads to the empirical counterpart of θ∗ given by
θ = argminθ∈Θ
∑∣∣Yi − f(Xi,θ)∣∣.
91
3.2.3 Maximum likelihood regression
Let the density function p(·) of the errors εi be known. The regression equation (3.2)
implies εi = Yi− f(Xi,θ∗) . Therefore, every Yi has the density p(y− f(Xi,θ
∗)) . Inde-
pendence of the Yi ’s implies the product structure of the density of the joint distribution:∏p(yi − f(Xi,θ)),
yielding the log-likelihood
L(θ) =∑
`(Yi − f(Xi,θ))
with `(t) = log p(t) . The MLE is the point of maximum of L(θ) :
θ = argmaxθ
L(θ) = argmaxθ
∑`(Yi − f(Xi,θ)).
A closed form solution for this equation exists only in some special cases like linear
Gaussian regression. Otherwise this equation has to be solved numerically.
Consider an important special case corresponding to the i.i.d. Gaussian errors when
p(y) is the density of the normal law with mean zero and variance σ2 . Then
L(θ) = −n2
log(2πσ2)− 1
2σ2
∑∣∣Yi − f(Xi,θ)∣∣2.
The corresponding MLE maximizes L(θ) or, equivalently, minimizes the sum∑∣∣Yi −
f(Xi,θ)∣∣2 :
θ = argmaxθ∈Θ
L(θ) = argminθ∈Θ
∑∣∣Yi − f(Xi,θ)∣∣2.
This estimate has already been introduced as the ordinary least squares estimate (oLSE).
A small extension of the previous example is given by inhomogeneous Gaussian regres-
sion, when the errors εi are independent Gaussian zero-mean but the variances depend
on i : IEε2i = σ2i . Then the log-likelihood L(θ) is given by the sum
L(θ) =∑{
−∣∣Yi − f(Xi,θ)
∣∣22σ2i
− 1
2log(2πσ2i )
}.
Maximizing this expression w.r.t. θ is equivalent to minimizing the weighted sum∑σ−2i
∣∣Yi − f(Xi,θ)∣∣2 :
θ = argminθ
∑σ−2i
∣∣Yi − f(Xi,θ)∣∣2.
Such an estimate is also-called the weighted least squares (wLSE).
92
Another example corresponds to the case when the errors εi are i.i.d. double expo-
nential, so that IP (±ε1 > t) = e−t/σ for some given σ > 0 . Then p(y) = (2σ)−1e−y/σ
and
L(θ) = −n log(2σ)− σ−1∑∣∣Yi − f(Xi,θ)
∣∣.The MLE θ maximizes L(θ) or, equivalently, minimizes the sum
∑∣∣Yi − f(Xi,θ)∣∣ :
θ = argmaxθ∈Θ
L(θ) = argminθ∈Θ
∑∣∣Yi − f(Xi,θ)∣∣.
So the maximum likelihood regression with Laplacian errors leads back to the least ab-
solute deviation (LAD) estimate.
3.3 Linear regression
One standard way of modeling the regression relationship is based on a linear expansion
of the regression function. This approach is based on the assumption that the unknown
regression function f(·) can be represented as a linear combination of given basis func-
tions ψ1(·), . . . , ψp(·) :
f(x) = θ1ψ1(x) + . . .+ θpψp(x).
Typical examples are:
Example 3.3.1. [Multivariate linear regression] Let the regressor x = (x1, . . . , xd) be
d -dimensional. The linear regression function f(x) can be written as
f(x) = a+ b1x1 + . . .+ bdxd.
Here we have p = d + 1 and the basis functions are ψ1(x) ≡ 1 and ψm = xm−1 for
m = 2, . . . , p . The coefficient a is often called the intercept and b1, . . . , bd are the slope
coefficients. The vector of coefficients θ = (a, b1, . . . , bd)> uniquely describes the linear
relation.
Example 3.3.2. [Polynomial regression] Let x be univariate and f(·) be a polynomial
function of degree p− 1 , that is,
f(x) = θ1 + θ2x+ . . .+ θpxp−1.
Then the basic functions are ψ1(x) ≡ 1 , ψ2(x) ≡ x , ψp(x) ≡ xp−1 , while θ =
(θ1, . . . , θp)> is the corresponding vector of coefficients.
93
Example 3.3.3. [Series expansion] Let ψ1(x), . . . , ψp(x), . . . be a given system of func-
tions. Specific examples are trigonometric (Fourier, cosine), orthogonal polynomial
(Chebyshev, Legendre, Jacobi), and wavelet systems among many others. The com-
pleteness of this system means that a given function f under mild regularity conditions
can be uniquely expanded in the form
f(x) =∞∑m=1
θmψm(x).
However, such an expansion is untractable because it involves infinitely many coefficients
θm . A standard procedure is to truncate this expansion after the first p terms leading
to the approximation
f(x) ≈p∑
m=1
θmψm(x). (3.4)
Such an approximation becomes better and better as the number p of terms grows, but
then one must estimate more and more coefficients. A choice of a proper truncation value
p is one of the central problems in nonparametric function estimation. The parametric
approach simply assumes that the value p is fixed and the approximation (3.4) is treated
as exact equality: f(x) ≡ θ1ψ1(x) + . . .+ θpψp .
Exercise 3.3.1. Let the regressor x be d -dimensional, x = (x1, . . . , xd)> . Describe
the basis system and the corresponding vector of coefficients for the case when f is a
quadratic function of x .
Linear regression is often described using vector-matrix notation. Let Ψi be the
vector in IRp whose entries are the values ψm(Xi) of the basis functions at the design
point Xi , m = 1, . . . , p . Then f(Xi) = Ψ>i θ∗ , and the linear regression model can be
written as
Yi = Ψ>i θ∗ + εi , i = 1, . . . , n.
Denote by Y = (Y1, . . . , Yn)> the vector of observations (responses), and ε = (ε1, . . . , εn)>
the vector of errors. Let finally Ψ be the p × n matrix with columns Ψ1, . . . , Ψn , that
is, Ψ =(ψm(Xi)
)i=1,...,n
m=1,...,p. Note that each row of Ψ is composed by the values of the
corresponding basis function ψm at the design points Xi . Now the regression equation
reads as
Y = Ψ>θ∗ + ε.
The estimation problem for this linear model will be discussed in detail in Chapter 4.
94
3.3.1 Projection estimation
3.3.2 Piecewise linear estimation
3.3.3 Spline estimation
3.3.4 Wavelet estimation
3.3.5 Kernel estimation
3.4 Density function estimation
3.4.1 Linear projection estimation
3.4.2 Wavelet density estimation
3.4.3 Kernel density estimation
3.4.4 Estimation based on Fourier transformation
3.5 Generalized regression
Let the response Yi be observed at the design point Xi ∈ IRd , i = 1, . . . , n . A (mean)
regression model assumes that the observed values Yi are independent and can be de-
composed into the systematic component f(Xi) and the individual centered stochastic
error εi . In some cases such a decomposition is questionable. This especially concerns
the case when the data Yi are categorical, e.g. binary or discrete. Another striking
example is given by nonnegative observations Yi . In such cases one usually assumes
that the distribution of Yi belongs to some given parametric family (Pυ, υ ∈ U) and
only the parameter of this distribution depends on the design point Xi . We denote this
parameter value as f(Xi) ∈ U and write the model in the form
Yi ∼ Pf(Xi) .
As previously, f(·) is called a regression function and its values at the design points Xi
completely specify the joint data distribution:
Y ∼∏i
Pf(Xi).
Below we assume that (Pυ) is a univariate exponential family with the log-density
`(y, υ) .
The parametric modeling approach assumes that the regression function f can be
specified by a finite-dimensional parameter θ ∈ Θ ⊂ IRp : f(x) = f(x,θ) . As usual, by
95
θ∗ we denote the true parameter value. The log-likelihood function for this model reads
as
L(θ) =∑i
`(Yi, f(Xi,θ)
).
The corresponding MLE θ maximizes L(θ) :
θ = argmaxθ
∑i
`(Yi, f(Xi,θ)
).
The estimating equation ∇L(θ) = 0 reads as∑i
`′(Yi, f(Xi,θ)
)∇f(Xi,θ) = 0
where `′(y, υ)def= ∂`(y, υ)/∂υ .
The approach essentially depends on the parametrization of the considered EF. Usu-
ally one applies either the natural or canonical parametrization. In the case of the
natural parametrization, `(y, υ) = C(υ)y − B(υ) , where the functions C(·), B(·) sat-
isfy B′(υ) = υC ′(υ) . This implies `′(y, υ) = yC ′(υ) − B′(υ) = (y − υ)C ′(υ) and the
estimating equation reads as∑i
(Yi − f(Xi,θ)
)C ′(f(Xi,θ)
)∇f(Xi,θ) = 0
Unfortunately, a closed form solution for this equation exists only in very special cases.
Even the questions of existence and uniqueness of the solution cannot be studied in
whole generality. Some numerical algorithms are usually applied to solve the estimating
equation.
Exercise 3.5.1. Specify the estimating equation for generalized EFn regression and find
the solution for the case of the constant regression function f(Xi, θ) ≡ θ .
Hint: If f(Xi, θ) ≡ θ , then the Yi are i.i.d. from Pθ .
The canonical parametrization is often applied in combination with linear modeling
of the regression function. If (Pυ) is an EFc with the log-density `(y, υ) = yυ − d(υ) ,
then the log-likelihood L(θ) can be represented in the form
L(θ) =∑i
{Yif(Xi,θ)− d
(f(Xi,θ)
)}.
The corresponding estimating equation is∑i
{Yi − d′
(f(Xi,θ)
)}∇f(Xi,θ) = 0.
96
Exercise 3.5.2. Specify the estimating equation for generalized EFc regression and find
the solution for the case of constant regression with f(Xi, υ) ≡ υ . Relate the natural
and canonical representation.
3.6 Generalized linear models
Consider the generalized regression model
Yi ∼ Pf(Xi) ∈ P.
In addition we assume a linear (in parameters) structure of the regression function f(X) .
Such modeling is particularly useful to combine with the canonical parametrization of
the considered EF with the log-density `(y, υ) = yυ − d(υ) . The reason is that the
stochastic part in the log-likelihood of an EFc linearly depends on the parameter. So,
below we assume that P = (Pυ, υ ∈ U) is an EFc.
Linear regression f(Xi) = Ψ>i θ with given feature vectors Ψi ∈ IRp leads to the
model with the log-likelihood
L(θ) =∑i
{YiΨ
>i θ − d
(Ψ>i θ
)}Such a setup is called generalized linear model (GLM). Note that the log-likelihood can
be represented as
L(θ) = S>θ −A(θ),
where
S =∑i
YiΨi, A(θ) =∑i
d(Ψ>i θ
).
The corresponding MLE θ maximizes L(θ) . Again, a closed form solution only exists
in special cases. However, an important advantage of the GLM approach is that the
solution always exists and is unique. The reason is that the log-likelihood function L(θ)
is concave in θ .
Lemma 3.6.1. The MLE θ solves the following estimating equation:
∇L(θ) = S −∇A(θ) =
n∑i=1
(Yi − d′(Ψ>i θ)
)Ψi = 0. (3.5)
The solution exists and is unique.
97
Proof. Define the matrix
B(θ) =
n∑i=1
d′′(Ψ>i θ)ΨiΨ>i .
Since d′′(υ) is strictly positive for all u , the matrix B(θ) is positively defined as well.
It holds
∇2L(θ) = −∇2A(θ) = −n∑i=1
d′′(Ψ>i θ)ΨiΨ>i = −B(θ).
Thus, the function L(θ) is strictly concave w.r.t. θ and the estimating equation
∇L(θ) = S −∇A(θ) = 0 has the unique solution θ .
The solution of (3.5) can be easily obtained numerically by the Newton-Raphson
algorithm: select the initial estimate θ(0) . Then for every k ≥ 1 apply
θ(k+1) = θ(k) +{B(θ(k))
}−1{S −∇A(θ(k))
}until convergence.
3.6.1 Logit regression for binary data
Suppose that the observed data Yi are independent and binary, that is, each Yi is
either zero or one, i = 1, . . . , n . Such models are often used in e.g. sociological and
medical study, two-class classification, binary imaging, among many other fields. We
treat each Yi as a Bernoulli r.v. with the corresponding parameter fi = f(Xi) . This is
a special case of generalized regression also called binary response models. The parametric
modeling assumption means that the regression function f(·) can be represented in the
form f(Xi) = f(Xi,θ) for a given class of functions {f(·,θ),θ ∈ Θ ∈ IRp} . Then the
log-likelihood L(θ) reads as
L(θ) =∑i
`(Yi, f(Xi,θ)),
where `(y, υ) is the log-density of the Bernoulli law. For linear modeling, it is more
useful to work with the canonical parametrization. Then `(y, υ) = yυ− log(1 + eυ) , and
the log-likelihood reads
L(θ) =∑i
[Yif(Xi,θ)− log
(1 + ef(Xi,θ)
)].
In particular, if the regression function f(·,θ) is linear, that is, f(Xi,θ) = Ψ>i θ , then
L(θ) =∑i
[YiΨ
>i θ − log(1 + eΨ
>i θ)].
98
The corresponding estimate reads as
θ = argmaxθ
L(θ) = argmaxθ
∑i
[YiΨ
>i θ − log(1 + eΨ
>i θ)]
This modeling is usually referred to as logit regression.
Exercise 3.6.1. Specify the estimating equation for the case of logit regression.
3.6.2 Poisson regression
Suppose that the observations Yi are nonnegative integer numbers. The Poisson distri-
bution is a natural candidate for modeling such data. It is supposed that the underlying
Poisson parameter depends on the regressor Xi . Typical examples arise in different
types of imaging including medical positron emission and magnet resonance tomography,
satellite and low-luminosity imaging, queueing theory, high frequency trading, etc. (to
be continued).
3.7 Quasi Maximum Likelihood estimation
This section very briefly discusses an extension of the maximum likelihood approach. A
more detailed discussion will be given in context of linear modeling in Chapter 4. To be
specific, consider a regression model
Yi = f(Xi) + εi.
The maximum likelihood approach requires to specify the two main ingredients of this
model: a parametric class {f(x,θ),θ ∈ Θ} of regression functions and the distribution
of the errors εi . Sometimes such information is lacking. One or even both modeling
assumptions can be misspecified. In such situations one speaks of a quasi maximum
likelihood approach, where the estimate θ is defined via maximizing over θ the random
function L(θ) even through it is not necessarily the real log-likelihood. Some examples
of this approach have already been given.
Below we distinguish between misspecification of the first and second kind. The first
kind corresponds to the parametric assumption about the regression function: assumed
is the equality f(Xi) = f(Xi,θ∗) for some θ∗ ∈ Θ . In reality one can only expect
a reasonable quality of approximating f(·) by f(·,θ∗) . A typical example is given by
linear (polynomial) regression. The linear structure of the regression function is useful
and tractable but it can only be a rough approximation of the real relation between Y
and X . The quasi maximum likelihood approach suggests to ignore this misspecification
and proceed as if the parametric assumption is fulfilled. This approach raises a number
99
of questions: what is the target of estimation and what is really estimated by such
quasi ML procedure? In Chapter 4 we show in the context of linear modeling that
the target of estimation can be naturally defined as the parameter θ† providing the best
approximation of the true regression function f(·) by its parametric counterpart f(·,θ) .
The second kind of misspecification concerns the assumption about the errors εi . In
the most of applications, the distribution of errors is unknown. Moreover, the errors can
be dependent or non-identically distributed. Assumption of a specific i.i.d. structure
leads to a model misspecification and thus, to the quasi maximum likelihood approach.
We illustrate this situation by few examples.
Consider the regression model Yi = f(Xi,θ∗) + εi and suppose for a moment that
the errors εi are i.i.d. normal. Then the principal term of the corresponding log-
likelihood is given by the negative sum of the squared residuals:∑∣∣Yi− f(Xi,θ)
∣∣2 , and
its maximization leads to the least squares method. So, one can say that the LSE method
is the quasi MLE when the errors are assumed to be i.i.d. normal. That is, the LSE can
be obtained as the MLE for the imaginary Gaussian regression model when the errors
εi are not necessarily i.i.d. Gaussian.
If the data are contaminated or the errors have heavy tails, it could be unwise to
apply the LSE method. The LAD method is known to be more robust against outliers
and data contamination. At the same time, it has already been shown in Section 3.2.3
that the LAD estimates is the MLE when the errors are Laplacian (double exponential).
In other words, LAD is the quasi MLE for the model with Laplacian errors.
Inference for the quasi ML approach is discussed in detail in Chapter 4 in the context
of linear modeling.
100
Chapter 4
Estimation in linear models
This chapter discusses the estimation problem for a linear model by a quasi maximum
likelihood method. We especially focus on the validity of the presented results under
possible model misspecification. Another important issue is the way of measuring the
estimation loss and risk. We distinguish below between response estimation or predic-
tion and the parameter estimation. The most advanced results like chi-squared result
in Section 4.6 are established under the assumption of a Gaussian noise. However, a
misspecification of noise structure is allowed and addressed.
4.1 Modeling assumptions
A linear model assumes that the observations Yi follow the equation:
Yi = Ψ>i θ∗ + εi (4.1)
for i = 1, . . . , n , where θ∗ = (θ∗1, . . . , θ∗p)> ∈ IRp is an unknown parameter vector, Ψi
are given vectors in IRp and the εi ’s are individual errors with zero mean. A typical
example is given by linear regression (see Section 3.3) when the vectors Ψi are the values
of a set of functions (e.g polynomial, trigonometric) series at the design points Xi .
A linear Gaussian model assumes in addition that the vector of errors ε = (ε1, . . . εn)>
is normally distributed with zero mean and a covariance matrix Σ :
ε ∼ N(0, Σ).
In this chapter we suppose that Σ is given in advance. We will distinguish between
three cases:
1. the errors εi are i.i.d. N(0, σ2) , or equivalently, the matrix Σ is equal to σ2IIn
with IIn being the unit matrix in IRn .
101
102
2. the errors are independent but not homogeneous, that is, IEε2i = σ2i . Then the
matrix Σ is diagonal: Σ = diag(σ21, . . . , σ2n) .
3. the errors εi are dependent with a covariance matrix Σ .
In practical applications one mostly starts with the white Gaussian noise assumption
and more general cases 2 and 3 are only considered if there are clear indications of the
noise inhomogeneity or correlation. The second situation is typical e.g. for the eigenvector
decomposition in an inverse problem. The last case is the most general and includes the
first two.
4.2 Quasi maximum likelihood estimation
Denote by Y = (Y1, . . . , Yn)> (resp. ε = (ε1, . . . , εn)> ) the vector of observations (resp.
of errors) in IRn and by Ψ the p× n matrix with columns Ψi . Let also Ψ> denote its
transpose. Then the model equation can be rewritten as:
Y = Ψ>θ∗ + ε, ε ∼ N(0, Σ).
An equivalent formulation is that Σ−1/2(Y −Ψ>θ) is a standard normal vector in IRn .
The log-density of the distribution of the vector Y = (Y1, . . . , Yn)> w.r.t. the Lebesgue
measure in IRn is therefore of the form
L(θ) = −n2
log(2π)−log(detΣ
)2
− 1
2‖Σ−1/2(Y − Ψ>θ)‖2
= −n2
log(2π)−log(detΣ
)2
− 1
2(Y − Ψ>θ)>Σ−1(Y − Ψ>θ).
In case 1 this expression can be rewritten as
L(θ) = −n2
log(2πσ2)− 1
2σ2
n∑i=1
(Yi − Ψ>i θ)2.
In case 2 the expression is similar:
L(θ) = −n∑i=1
{1
2log(2πσ2i ) +
(Yi − Ψ>i θ)2
2σ2i
}.
The maximum likelihood estimate (MLE) θ of θ∗ is defined by maximizing the log-
likelihood L(θ) :
θ = argmaxθ∈IRp
L(θ) = argminθ∈IRp
(Y − Ψ>θ)>Σ−1(Y − Ψ>θ). (4.2)
103
We omit the other terms in the expression of L(θ) because they do not depend on θ .
This estimate is the least squares estimate (LSE) because it minimizes the sum of squared
distances between the observations Yi and the linear responses Ψ>i θ . Note that (4.2) is
a quadratic optimization problem which has a closed form solution. Differentiating the
right hand-side of (4.2) w.r.t. θ yields the normal equation
ΨΣ−1Ψ>θ = ΨΣ−1Y .
If the p×p -matrix ΨΣ−1Ψ> is non-degenerate then the normal equation has the unique
solution
θ =(ΨΣ−1Ψ>
)−1ΨΣ−1Y = SY , (4.3)
where
S =(ΨΣ−1Ψ>
)−1ΨΣ−1
is a p× n matrix. We denote by θm the entries of the vector θ , m = 1, . . . , p .
If the matrix ΨΣ−1Ψ> is degenerate, then the normal equation has infinitely many
solutions. However, one can still apply the formula (4.3) where (ΨΣ−1Ψ>)−1 is a pseudo-
inverse of the matrix ΨΣ−1Ψ> .
The ML-approach leads to the parameter estimate θ . Note that due to the model
(4.1), the product f = Ψ>θ is an estimate of the mean fdef= IEY of the vector of
observations Y :
f = Ψ>θ = Ψ>(ΨΣ−1Ψ>
)−1ΨΣ−1Y = ΠY ,
where
Π = Ψ>(ΨΣ−1Ψ>
)−1ΨΣ−1
is an n × n matrix (linear operator) in IRn . The vector f is called a prediction or
response regression estimate.
Below we study the properties of the estimates θ and f . In this study we try to
address both types of possible model misspecification: due to a wrong assumption about
the error distribution and due to a possibly wrong linear parametric structure. Namely
we consider the model
Yi = fi + εi, ε ∼ N(0, Σ0). (4.4)
The response values fi are usually treated as the value of the regression function f(·) at
the design points Xi . The parametric model (4.1) can be viewed as an approximation of
104
(4.4) while Σ is an approximation of the true covariance matrix Σ0 . If f is indeed equal
to Ψ>θ∗ and Σ = Σ0 , then θ and f are MLEs, otherwise quasi MLEs. In our study
we mostly restrict ourselves to the case 1 assumption about the noise ε : ε ∼ N(0, σ2IIn) .
The general case can be reduced to this one by a simple data transformation, namely, by
multiplying the equation (4.4) Y = f + ε with the matrix Σ−1/2 , see Section 4.6 for
more detail.
4.2.1 Estimation under the homogeneous noise assumption
If a homogeneous noise is assumed, that is Σ = σ2IIn and ε ∼ N(0, σ2IIn) , then the
formulae for the MLEs θ, f slightly simplify. In particular, the variance σ2 cancels and
the resulting estimate is the ordinary least squares (oLSE):
θ =(ΨΨ>
)−1ΨY = SY
with S =(ΨΨ>
)−1Ψ . Also
f = Ψ>(ΨΨ>
)−1ΨY = ΠY
with Π = Ψ>(ΨΨ>
)−1Ψ .
Exercise 4.2.1. Derive the formulae for θ, f directly from the log-likelihood L(θ) for
homogeneous noise.
If the assumption ε ∼ N(0, σ2IIn) about the errors is not precisely fulfilled, then the
oLSE can be viewed as a quasi MLE.
4.2.2 Linear basis transformation
Denote by ψ>1 , . . . ,ψ>p the rows of the matrix Ψ . Then the ψi ’s are vectors in IRn
and we call them the basis vectors. In the linear regression case the ψi ’s are obtained as
the values of the basis functions at the design points. Our linear parametric assumption
simply means that the underlying vector f can be represented as a linear combination
of the vectors ψ1, . . . ,ψp :
f = θ∗1ψ1 + . . .+ θ∗pψp .
In other words, f belongs to the linear subspace in IRn spanned by the vectors ψ1, . . . ,ψp .
It is clear that this assumption still holds if we select another basis in this subspace.
Let U be any linear orthogonal transformation in IRp with UU> = IIp . Then the
linear relation f = Ψ>θ∗ can be rewritten as
f = Ψ>UU>θ∗ = Ψ>u∗
105
with Ψ = U>Ψ and u∗ = U>θ∗ . Here the columns of Ψ mean the new basis vectors ψm
in the same subspace while u∗ is the vector of coefficients describing the decomposition
of the vector f w.r.t. this new basis:
f = u∗1ψ1 + . . .+ u∗pψp .
The natural question is how the expression for the MLEs θ and f change with the
change of the basis. The answer is straightforward. For notational simplicity, we only
consider the case with Σ = σ2IIn . The model can be rewritten as
Y = Ψ>u∗ + ε
yielding the solutions
u =(Ψ Ψ>
)−1ΨY = SY , f = Ψ>
(Ψ Ψ>
)−1ΨY = ΠY ,
where Ψ = U>Ψ implies
S =(Ψ Ψ>
)−1Ψ = U>S,
Π = Ψ>(Ψ Ψ>
)−1Ψ = Π.
This yields
u = U>θ
and moreover, the estimate f is not changed for any linear transformation of the basis.
The first statement can be expected in view of θ∗ = Uu∗ , while the second one will be
explained in the next section: Π is the linear projector on the subspace spanned by the
basis vectors and this projector is invariant w.r.t. basis transformations.
Exercise 4.2.2. Consider univariate polynomial regression of degree p− 1 . This means
that f is a polynomial function of degree p − 1 observed at the points Xi with errors
εi that are assumed to be i.i.d. normal. The function f can be represented as
f(x) = θ∗1 + θ∗2x+ . . .+ θ∗pxp−1
using the basis functions ψm(x) = xm−1 for m = 0, . . . , p − 1 . At the same time, for
any point x0 , this function can also be written as
f(x) = u∗1 + u∗2(x− x0) + . . .+ u∗p(x− x0)p−1
using the basis functions ψm = (x− x0)m−1 .
106
• Write the matrices Ψ and ΨΨ> and similarly Ψ and Ψ Ψ> .
• Describe the linear transformation A such that u = Aθ for p = 1 .
• Describe the transformation A such that u = Aθ for p > 1 .
Hint: use the formula
u∗m =1
(m− 1)!f (m−1)(x0), m = 1, . . . , p
to identify the coefficient u∗m via θ∗m, . . . , θ∗p .
4.2.3 Orthogonal and orthonormal design
Orthogonality of the design matrix Ψ means that the basis vectors ψ1, . . . , ψp are or-
thonormal in the sense
ψ>mψm′ =n∑i=1
ψm,iψm′,i =
0 if m 6= m′,
λm if m = m′,
for some positive values λ1, . . . , λp . Equivalently one can write
ΨΨ> = Λ = diag(λ1, . . . , λp).
This feature of the design is very useful and it essentially simplifies the computation and
analysis of the properties of θ . Indeed, ΨΨ> = Λ implies
θ = Λ−1ΨY , f = Ψ>θ = Ψ>Λ−1ΨY
with Λ−1 = diag(λ−11 , . . . , λ−1p ) . In particular, the first relation means
θm = λ−1m
n∑i=1
Yiψm,i,
that is, θm is the scalar product of the data and the basis vector ψm for m = 1, . . . , p .
The estimate of the response f reads as
f = θ1ψ1 + . . .+ θpψp.
Theorem 4.2.1. Consider the model Y = Ψ>θ + ε with homogeneous errors ε :
IEεε> = σ2IIn . If the design Ψ is orthogonal, that is, if ΨΨ> = Λ for a diagonal
matrix Λ , then the estimated coefficients θm are uncorrelated: Var(θ) = σ2Λ . More-
over, if ε ∼ N(0, σ2IIn) , then θ ∼ N(θ∗, σ2Λ) .
107
An important message of this result is that the orthogonal design allows for splitting
the original multivariate problem into a collection of independent univariate problems:
each coefficient θ∗m is estimated by θm independently on the remaining coefficients.
The calculus can be further simplified in the case of an orthogonal design with ΨΨ> =
IIp . Then one speaks about an orthonormal design. This also implies that every basis
function (vector) ψm is standardized: ‖ψm‖2 =∑n
i=1 ψ2m,i = 1 . In the case of an
orthonormal design, the estimate θ is particularly simple: θ = ΨY . Correspondingly,
the target of estimation θ∗ satisfies θ∗ = Ψf . In other words, the target is the collection
(θ∗m) of the Fourier coefficients of the underlying function (vector) f w.r.t. the basis Ψ
while the estimate θ is the collection of empirical Fourier coefficients θm :
θ∗m =n∑i=1
fiψm,i , θm =n∑i=1
Yiψm,i
An important feature of the orthonormal design is that it preserves the noise homogeneity:
Var(θ)
= σ2Ip .
4.2.4 Spectral representation
Consider a linear model
Y = Ψ>θ + ε (4.5)
with homogeneous errors ε : Var(ε) = σ2IIn . The rows of the matrix Ψ can be viewed
as basis vectors in IRn and the product Ψ>θ is a linear combinations of these vectors
with the coefficients (θ1, . . . , θp) . Effectively linear least squares estimation does a kind of
projection of the data onto the subspace generated by the basis functions. This projection
is of course invariant w.r.t. a basis transformation within this linear subspace. This
fact can be used to reduce the model to the case of an orthogonal design considered in
the previous section. Namely, one can always find a linear orthogonal transformation
U : IRp → IRp ensuring the orthogonality of the transformed basis. This means that the
rows of the matrix Ψ = UΨ are orthogonal and the matrix Ψ Ψ> is diagonal:
Ψ Ψ> = UΨΨ>U> = Λ = diag(λ1, . . . , λp).
The original model reads after this transformation in the form
Y = Ψ>u+ ε, Ψ Ψ> = Λ,
where u = Uθ ∈ IRp . Within this model, the transformed parameter u can be estimated
using the empirical Fourier coefficients Zm = ψ>mY , where ψm is the m th row of Ψ ,
108
m = 1, . . . , p . The original parameter vector θ can be recovered via the equation
θ = U>u . This set of equations can be written in the form
Z = Λu+ Λ1/2ξ (4.6)
where Z = ΨY = UΨY is a vector in IRp and ξ = Λ−1/2Ψε = Λ−1/2UΨε ∈ IRp . The
equation (4.6) is called the spectral representation of the linear model (4.5). The reason
is that the basic transformation U can be built by a singular value decomposition of Ψ .
This representation is widely used in context of linear inverse problems; see Section 4.8.
Theorem 4.2.2. Consider the model (4.5) with homogeneous errors ε : IEεε> = σ2IIn .
Then there exists an orthogonal transform U : IRp → IRp leading to the spectral represen-
tation (4.6) with homogeneous uncorrelated errors ξ : IEξξ> = σ2IIp . If ε ∼ N(0, σ2IIn) ,
then the vector ξ is normal as well: ξ = N(0, σ2IIp) .
Exercise 4.2.3. Prove the result of Theorem 4.2.2.
Hint: select any U ensuring U>ΨΨ>U = Λ . Then
IEξξ> = Λ−1/2UΨIEεε>Ψ>U>Λ−1/2 = σ2Λ−1/2U>ΨΨ>UΛ−1/2 = σ2IIp.
A special case of the spectral representation corresponds to the orthonormal design
with ΨΨ> = IIp . In this situation, the spectral model reads as Z = u + ξ , that is, we
simply observe the target u corrupted with a homogeneous noise ξ . Such an equation
is often called the sequence space model and it is intensively used in the literature for the
theoretical study; cf. Section 4.7 below.
4.3 Properties of the response estimate f
This section discusses some properties of the estimate f = Ψ>θ = ΠY of the response
vector f . It is worth noting that the first and essential part of the analysis does not
rely on the underlying model distribution, only on our parametric assumptions that
f = Ψ>θ∗ and Cov(ε) = Σ = σ2IIn . The real model only appears when studying the
risk of estimation. We will comment on the cases of misspecified f and Σ .
When Σ = σ2IIn , the operator Π in the representation f = ΠY of the estimate f
reads as
Π = Ψ>(ΨΨ>
)−1Ψ. (4.7)
First we make use of the linear structure of the model (4.1) and of the estimate f to
derive a number of its simple but important properties.
109
4.3.1 Decomposition into a deterministic and a stochastic component
The model equation Y = f + ε yields
f = ΠY = Π(f + ε) = Πf +Πε. (4.8)
The first element of this sum, Πf is purely deterministic, but it depends on the unknown
response vector f . Moreover, it will be shown in the next lemma that Πf = f if the
parametric assumption holds and the vector f indeed can be represented as Ψ>θ∗ . The
second element is stochastic as a linear transformation of the stochastic vector ε but is
independent of the model response f . The properties of the estimate f heavily rely on
the properties of the linear operator Π from (4.7) which we collect in the next section.
4.3.2 Properties of the operator Π
Let ψ1, . . . ,ψp be the columns of the matrix Ψ> . These are the vectors in IRn also
called the basis vectors.
Lemma 4.3.1. Let the matrix ΨΨ> be non-degenerate. Then the operator Π fulfills
the following conditions:
(i) Π is symmetric (self-adjoint), that is, Π> = Π .
(ii) Π is a projector in IRn , i.e. Π>Π = Π2 = Π and Π(1n −Π) = 0 , where 1n
means the unity operator in IRn .
(iii) For an arbitrary vector v from IRn , it holds ‖v‖2 = ‖Πv‖2 + ‖v −Πv‖2 .
(iv) The trace of Π is equal to the dimension of its image, tr Π = p .
(v) Π projects the linear space IRn on the linear subspace Lp =⟨ψ1, . . . ,ψp
⟩, which
is spanned by the basis vectors ψ1, . . .ψp , that is,
‖f −Πf‖ = infg∈Lp
‖f − g‖.
(vi) The matrix Π can be represented in the form
Π = U>ΛpU
where U is an orthonormal matrix and Λp is a diagonal matrix with the first p
diagonal elements equal to 1 and the others equal to zero:
Λp = diag{1, . . . , 1︸ ︷︷ ︸p
, 0, . . . , 0︸ ︷︷ ︸n−p
}.
110
Proof. It holds
{Ψ>(ΨΨ>
)−1Ψ}>
= Ψ>(ΨΨ>
)−1Ψ
and
Π2 = Ψ>(ΨΨ>
)−1ΨΨ>
(ΨΨ>
)−1Ψ = Ψ>
(ΨΨ>
)−1Ψ = Π,
which proves the first two statements of the lemma. The third one follows directly from
the first two. Next,
tr Π = tr Ψ>(ΨΨ>
)−1Ψ = tr ΨΨ>
(ΨΨ>
)−1= tr IIp = p.
The second property means that Π is a projector in IRn and the fourth one means that
the dimension of its image space is equal to p . The basis vectors ψ1, . . . ,ψp are the
rows of the matrix Ψ . It is clear that
ΠΨ> = Ψ>(ΨΨ>
)−1ΨΨ> = Ψ>.
Therefore, the vectors ψm are invariants of the operator Π and in particular, all these
vectors belong to the image space of this operator. If now g is a vector in Lp , then
it can be represented as g = c1ψ1 + . . . + cpψp and therefore, Πg = g and ΠLp =
Lp . Finally, the non-singularity of the matrix ΨΨ> means that the vectors ψ1, . . . ,ψp
forming the rows of Ψ are linearly independent. Therefore, the space Lp spanned by
the vectors ψ1, . . . ,ψp is of dimension p , and hence it coincides with the image space
of the operation Π .
The last property is the usual diagonal decomposition of a projector.
Exercise 4.3.1. Consider the case of an orthogonal design with ΨΨ> = IIp . Specify the
projector Π of Lemma 4.3.1 for this situation, particularly its decomposition from (vi).
4.3.3 Quadratic loss and risk of the response estimation
In this section we study the quadratic risk of estimating the response f . The reason for
studying the quadratic risk of estimating the response f will be made clear when we
discuss the properties of the fitted likelihood in the next section.
The loss ℘(f ,f) of the estimate f can be naturally defined as the squared norm of
the difference f − f :
℘(f ,f) = ‖f − f‖2 =
n∑i=1
|fi − fi|2.
111
Correspondingly, the quadratic risk of the estimate f is the mean of this loss
R(f) = IE℘(f ,f) = IE[(f − f)>(f − f)
]. (4.9)
The next result describes the loss and risk decomposition for two cases: when the
parametric assumption f = Ψ>θ∗ is correct and in the general case.
Theorem 4.3.2. Suppose that the errors εi from (4.1) are independent with IE εi = 0
and IE ε2i = σ2 , i.e. Σ = σ2IIn . Then the loss ℘(f ,f) = ‖ΠY − f‖2 and the risk
R(f) of the LSE f fulfill
℘(f ,f) = ‖f −Πf‖2 + ‖Πε‖2,
R(f) = ‖f −Πf‖2 + pσ2.
Moreover, if f = Ψ>θ∗ , then
℘(f ,f) = ‖Πε‖2,
R(f) = pσ2.
Proof. We apply (4.9) and the decomposition (4.8) of the estimate f . It follows
℘(f ,f) = ‖f − f‖2 = ‖f −Πf −Πε‖2
= ‖f −Πf‖2 + 2(f −Πf)>Πε+ ‖Πε‖2.
This implies the decomposition for the loss of f by Lemma 4.3.1, (ii). Next we compute
the mean of ‖Πε‖2 applying again Lemma 4.3.1. Indeed
IE‖Πε‖2 = IE(Πε)>Πε = IE tr{Πε(Πε)>
}= IE tr
(Πεε>Π>
)= tr
{ΠIE(εε>)Π
}= σ2 tr(Π2) = pσ2.
Now consider the case when f = Ψ>θ∗ . By Lemma 4.3.1 f = Πf and and the last two
statements of the theorem clearly follow.
4.3.4 Misspecified “colored noise”
Here we briefly comment on the case when ε is not a white noise. So, our assumption
about the errors εi is that they are uncorrelated and homogeneous, that is, Σ = σ2IIn
while the true covariance matrix is given by Σ0 . Many properties of the estimate f =
ΠY which are simply based on the linearity of the model (4.1) and of the estimate
112
f itself continue to apply. In particular, the loss ℘(f ,f
)= ‖f − f‖2 can again be
decomposed as
‖f − f‖2 = ‖f −Πf‖2 + ‖Πε‖2.
Theorem 4.3.3. Suppose that IEε = 0 and Var(ε) = Σ0 . Then the loss ℘(f ,f) and
the risk R(f) of the LSE f fulfill
℘(f ,f) = ‖f −Πf‖2 + ‖Πε‖2,
R(f) = ‖f −Πf‖2 + tr(ΠΣ0Π
).
Moreover, if f = Ψ>θ∗ , then
℘(f ,f) = ‖Πε‖2,
R(f) = tr(ΠΣ0Π
).
Proof. The decomposition of the loss from Theorem 4.3.2 only relies on the geometric
properties of the projector Π and does not use the covariance structure of the noise.
Hence, it only remains to check the expectation of ‖Πε‖2 . Observe that
IE‖Πε‖2 = IE tr[Πε(Πε)>
]= tr
[ΠIE(εε>)Π
]= tr
(ΠΣ0Π
)as required.
4.4 Properties of the MLE θ
In this section we focus on the properties of the quasi MLE θ built for the idealized
linear Gaussian model Y = Ψ>θ∗ + ε with ε ∼ N(0, σ2IIn) . As in the previous section,
we do not assume the parametric structure of the underlying model and consider a more
general model Y = f + ε with an unknown vector f and errors ε with zero mean
and covariance matrix Σ0 . Due to (4.3), it holds θ = SY with S =(ΨΨ>
)−1Ψ .
An important feature of this estimate is its linear dependence on the data. The linear
model equation Y = f + ε and linear structure of the estimate θ = SY allow us for
decomposing the vector θ into a deterministic and stochastic terms:
θ = SY = S(f + ε
)= Sf + Sε. (4.10)
The first term Sf is deterministic but depends on the unknown vector f while the
second term Sε is stochastic but it does not involve the model response f . Below we
study the properties of each component separately.
113
4.4.1 Properties of the stochastic component
The next result describes the distributional properties of the stochastic component ζ =
Sε for S =(ΨΨ>
)−1Ψ and thus, of the estimate θ .
Theorem 4.4.1. Assume Y = f + ε with IEε = 0 and Var(ε) = Σ0 . The stochastic
component ζ = Sε in (4.10) fulfills
IEζ = 0, V 2 def= Var(ζ) = SΣ0S>, IE‖ζ‖2 = trV 2 = tr
(SΣ0S>
).
Moreover, if Σ = Σ0 = σ2IIn , then
V 2 = σ2(ΨΨ>
)−1, IE‖ζ‖2 = tr(V 2) = σ2 tr
[(ΨΨ>
)−1]. (4.11)
Similarly for the estimate θ it holds
IEθ = Sf , Var(θ) = V 2.
If the errors ε are Gaussian, then the both ζ and θ are Gaussian as well:
ζ ∼ N(0, V 2) θ ∼ N(Sf , V 2).
Proof. For the variance V 2 of ζ holds
Var(ζ) = IEζζ> = IESεε>S> = SΣ0S>.
Next we use that IE‖ζ‖2 = IEζ>ζ = IE tr(ζζ>) = trV 2 . If Σ = Σ0 = σ2IIn , then (4.11)
follows by simple algebra.
If ε is a Gaussian vector, then ζ as its linear transformation is Gaussian as well.
The properties of θ follow directly from the decomposition (4.10).
With Σ0 6= σ2IIn , the variance V 2 can be represented as
V 2 =(ΨΨ>
)−1ΨΣ0Ψ
>(ΨΨ>)−1.Exercise 4.4.1. Let ζ be the stochastic component of θ built for the misspecified linear
model Y = Ψ>θ∗ + ε with Var(ε) = Σ . Let also the true noise variance is Σ0 . Then
Var(θ) = V 2 with
V 2 =(ΨΣ−1Ψ>
)−1ΨΣ−1Σ0Σ
−1Ψ>(ΨΣ−1Ψ>
)−1.
The main finding in the presented study is that the stochastic part ζ = Sε of the
estimate θ is completely independent of the structure of the vector f . In other words,
the behavior of the stochastic component ζ does not change even if the linear parametric
assumption is misspecified.
114
4.4.2 Properties of the deterministic component
Now we study the deterministic term starting with the parametric situation f = Ψ>θ∗ .
Here we only specify the results for the case 1 with Σ = σ2IIn .
Theorem 4.4.2. Let f = Ψ>θ∗ . Then θ = SY with S =(ΨΨ>
)−1Ψ is unbiased, that
is, IEθ = Sf = θ∗ .
Proof. For the proof, just observe that Sf =(ΨΨ>
)−1ΨΨ>θ∗ = θ∗ .
Now we briefly discuss what happens when the linear parametric assumption is not
fulfilled, that is, f cannot be represented as Ψ>θ∗ . In this case it is not yet clear what
θ really estimates. The answer is given in the context of the general theory of minimum
contrast estimation. Namely, define θ† as the point which maximizes the expectation of
the (quasi) log-likelihood L(θ) :
θ† = argmaxθ
IEL(θ). (4.12)
Theorem 4.4.3. The solution θ† of the optimization problem (4.12) is given by
θ† = Sf =(ΨΨ>
)−1Ψf .
Moreover,
Ψ>θ† = Πf = Ψ>(ΨΨ>
)−1Ψf .
In particular, if f = Ψ>θ∗ , then θ† = θ∗ and Ψ>θ† = f .
Proof. The use of the model equation Y = f + ε and of the properties of the stochastic
component ζ yield by simple algebra
argmaxθ
IEL(θ) = argminθ
IE(f − Ψ>θ + ε
)>(f − Ψ>θ + ε
)= argmin
θ
{(f − Ψ>θ)>(f − Ψ>θ) + IE
(ε>ε
)}= argmin
θ
{(f − Ψ>θ)>(f − Ψ>θ)
}.
Differentiating w.r.t. θ leads to the equation
Ψ(f − Ψ>θ) = 0
and the solution θ† =(ΨΨ>
)−1Ψf which is exactly the expected value of θ by Theo-
rem 4.4.1.
115
Exercise 4.4.2. State the result of Theorems 4.4.2 and 4.4.3 for the MLE θ built in
the model Y = Ψ>θ∗ + ε with Var(ε) = Σ .
Hint: check that the statements continue to apply with S =(ΨΣ−1Ψ>
)−1ΨΣ−1 .
The last results and the decomposition (4.10) explain the behavior of the estimate θ
in a very general situation. The considered model is Y = f + ε . We assume a linear
parametric structure and independent homogeneous noise. The estimation procedure
means in fact a kind of projection of the data Y on a p -dimensional linear subspace in
IRn spanned by the given basis vectors ψ1, . . . ,ψp . This projection, as a linear operator,
can be decomposed into a projection of the deterministic vector f and a projection of
the random noise ε . If the linear parametric assumption f ∈⟨ψ1, . . . ,ψp
⟩is correct,
that is, f = θ∗1ψ1 + . . . + θ∗pψp , then this projection keeps f unchanged and only the
random noise is reduced via this projection. If f cannot be exactly expanded using the
basis ψ1, . . . ,ψp , then the procedure recovers the projection of f onto this subspace.
The latter projection can be written as Ψ>θ† and the vector θ† can be viewed as the
target of estimation.
4.4.3 Risk of estimation. R-efficiency
This section briefly discusses how the obtained properties of the estimate θ can be used
to evaluate the risk of estimation. A particularly important question is the optimality of
the MLE θ . The main result of the section claims that θ is R-efficient if the model is
correctly specified and is not if there is a misspecification.
We start with the case of a correct parametric specification Y = Ψ>θ∗ + ε , that
is, the linear parametric assumption f = Ψ>θ∗ is exactly fulfilled and the noise ε is
homogeneous: ε ∼ N(0, σ2IIn) . Later we extend the result to the case when the LPA
f = Ψ>θ∗ is not fulfilled and to the case when the noise is not homogeneous but still
correctly specified. Finally we discuss the case when the noise structure is misspecified.
Under LPA Y = Ψ>θ∗ + ε with ε ∼ N(0, σ2IIn) , the estimate θ is also normal
with mean θ∗ and the variance V 2 = σ2SS> = σ2(ΨΨ>
)−1. Define a p× p symmetric
matrix D by the equation
D2 =1
σ2
n∑i=1
ΨiΨ>i =
1
σ2ΨΨ>.
Clearly V 2 = D−2 .
Now we show that θ is R -efficient. Actually this fact can be derived from the
Cramer-Rao Theorem because the Gaussian model is a special case of an exponential
family. However, we check this statement directly by computing the Cramer-Rao ef-
116
ficiency bound. Recall that the Fisher information matrix I(θ) for the log-likelihood
L(θ) is defined as the variance of ∇L(θ) under IPθ .
Theorem 4.4.4 (Gauss-Markov). Let Y = Ψ>θ∗ + ε with ε ∼ N(0, σ2IIn) . Then θ is
R-efficient estimate of θ∗ : IEθ = θ∗ ,
IE[(θ − θ∗
)(θ − θ∗
)>]= Var
(θ)
= D−2.
and for any unbiased linear estimate θ satisfying IEθ = θ∗ , it holds
Var(θ)≥ Var
(θ)
= D−2.
Proof. Theorems 4.4.1 and 4.4.2 imply that θ ∼ N(θ∗, V 2) with V 2 = σ2(ΨΨ>)−1 =
D−2 . Next we show that for any θ
Var[∇L(θ)
]= D2,
that is, the Fisher information does not depend on the model function f . The log-
likelihood L(θ) for the model Y ∼ N(Ψ>θ∗, σ2IIn) reads as
L(θ) = − 1
2σ2(Y − Ψ>θ)>(Y − Ψ>θ)− n
2log(2πσ2).
This yields for its gradient ∇L(θ) :
∇L(θ) = σ−2Ψ(Y − Ψ>θ)
and in view of Var(Y ) = Σ = σ2IIn , it holds
Var[∇L(θ)
]= σ−4Ψ Var(Y )Ψ> = σ−2ΨΨ>
as required.
The R-efficiency θ follows from the Cramer-Rao efficiency bound because{
Var(θ)}−1
=
Var{∇L(θ)
}. However, we present an independent proof of this fact. Actually we prove
a sharper result that the variance of a linear unbiased estimate θ coincides with the
variance of θ only if θ coincides almost surely with θ , otherwise its is larger. The idea
of the proof is quite simple. Consider the difference θ − θ and show that the condition
IEθ = IEθ = θ∗ implies orthogonality IE{θ(θ − θ)>
}= 0 . This, in turns, implies
Var(θ) = Var(θ) + Var(θ − θ) ≥ Var(θ) . So, it remains to check the orthogonality of
θ and θ − θ . Let θ = AY for a p × n matrix A and IEθ ≡ θ∗ and all θ∗ . These
two equalities and IEY = Ψ>θ∗ imply that AΨ>θ∗ ≡ θ∗ , i.e. AΨ> is the identity
117
p × p matrix. The same is true for θ = SY yielding SΨ> = IIp . Next, in view of
IEθ = IEθ = θ∗
IE{
(θ − θ)θ>}
= IE(A− S)εε>S> = σ2(A− S)Ψ>(ΨΨ>)−1 = 0,
and the assertion follows.
Exercise 4.4.3. Check the details of the proof of the theorem. Show that the statement
Var(θ) ≥ Var(θ)
only uses that θ is unbiased and that IEY = Ψ>θ∗ and Var(Y ) =
σ2IIn .
Exercise 4.4.4. Compute ∇2L(θ) . Check that it is non-random, does not depend on
θ , and fulfills for every θ the identity
∇2L(θ) ≡ −Var[∇L(θ)
]= −D2.
A colored noise
The majority of the presented results continue to apply in the case of heterogeneous
and even dependent noise with Var(ε) = Σ0 . The key facts behind this extension
are the decomposition (4.10) and the properties of the stochastic component ζ from
Section 4.4.1: ζ ∼ N(0, V 2) . In the case of a colored noise, the definition of V and D
is changed for
D2 def= V −2 = ΨΣ−10 Ψ>.
Exercise 4.4.5. State and prove the analog of Theorem 4.4.4 for the colored noise
ε ∼ N(0, Σ0) .
A misspecified LPA
An interesting feature of our results so far is that they equally apply for the correct
linear specification f = Ψ>θ∗ and for the case when the identity f = Ψ>θ is not
precisely fulfilled whatever θ is taken. In this situation the target of analysis is the
vector θ† describing the best linear approximation of f by Ψ>θ . We already know
from the results of Section 4.4.1 and 4.4.2 that the estimate θ is also normal with mean
θ† = Sf =(ΨΨ>
)−1Ψf and the variance V 2 = σ2SS> = σ2
(ΨΨ>
)−1.
Theorem 4.4.5. Assume Y = f + ε with ε ∼ N(0, σ2IIn) . Let θ† = Sf . Then θ is
R-efficient estimate of θ† : IEθ = θ† ,
IE[(θ − θ†
)(θ − θ†
)>]= Var
(θ)
= D−2,
118
and for any unbiased linear estimate θ satisfying IEθ = θ† , it holds
Var(θ)≥ Var
(θ)
= D−2.
Proof. The proofs only utilize that θ ∼ N(θ†, V 2) with V 2 = D−2 . The only small
remark concerns the equality Var[∇L(θ)
]= D2 from Theorem 4.4.4.
Exercise 4.4.6. Check the identity Var[∇L(θ)
]= D2 from Theorem 4.4.4 for ε ∼
N(0, Σ0) .
4.4.4 The case of a misspecified noise
Here we again consider the linear parametric assumption Y = Ψ>θ∗ + ε . However,
contrary to the previous section, we admit that the noise ε is not homogeneous normal:
ε ∼ N(0, Σ0) while our estimation procedure is the quasi MLE based on the assumption
of noise homogeneity ε ∼ N(0, σ2IIn) . We already know that the estimate θ is unbiased
with mean θ∗ and variance V 2 = SΣ0S> , where S =(ΨΨ>
)−1Ψ . This gives
V 2 =(ΨΨ>
)−1ΨΣ0Ψ
>(ΨΨ>)−1.The question is whether the estimate θ based on the misspecified distributional
assumption is efficient. The Cramer-Rao result delivers the lower bound for the quadratic
risk in form of Var(θ) ≥[Var(∇L(θ)
)]−1. We already know that the use of the correctly
specified covariance matrix of the errors leads to an R-efficient estimate θ . The next
result show that the use of a misspecified matrix Σ results in an estimate which is
unbiased but not R-efficient, that is, the best estimation risk is achieved if we apply the
correct model assumptions.
Theorem 4.4.6. Let Y = Ψ>θ∗ + ε with ε ∼ N(0, Σ0) . Then
Var[∇L(θ)
]= ΨΣ−10 Ψ>.
The estimate θ =(ΨΨ>
)−1ΨY is unbiased, that is, IEθ = θ∗ , but it is not R-efficient
unless Σ0 = Σ .
Proof. Let θ0 be the MLE for the correct model specification with the noise ε ∼N(0, Σ0) . As θ is unbiased, the difference θ − θ0 is orthogonal to θ0 and it holds
for the variance of θ
Var(θ) = Var(θ0) + Var(θ − θ0);
cf. with the proof of Gauss-Markov-Theorem 4.4.4.
Exercise 4.4.7. Compare directly the variances of θ and of θ0 .
119
4.5 Linear models and quadratic log-likelihood
Linear Gaussian modeling leads to a specific log-likelihood structure; see Section 4.2.
Namely, the log-likelihood function L(θ) is quadratic in θ , the coefficients of the
quadratic terms are deterministic and the cross term is linear both in θ and in the
observations Yi . Here we show that this geometric structure of the log-likelihood char-
acterizes linear models. We say that L(θ) is quadratic if it is a quadratic function of θ
and there is a deterministic symmetric matrix D2 such that for any θ◦,θ
L(θ)− L(θ◦) = (θ − θ◦)>∇L(θ◦)− (θ − θ◦)>D2(θ − θ◦)/2. (4.13)
Here ∇L(θ)def= dL(θ)
dθ . As usual we define
θdef= argmax
θL(θ),
θ∗ = argmaxθ
IEL(θ).
The next result describes some properties of the estimate θ which are entirely based on
the geometric (quadratic) structure of the function L(θ) . All the results are stated by
using the matrix D2 and the vector ζ = ∇L(θ∗) .
Theorem 4.5.1. Let L(θ) be quadratic for a non-degenerated matrix D2 . Then
θ − θ∗ = D−2ζ. (4.14)
with ζdef= ∇L(θ∗) . Moreover, IEζ = 0 , and it holds with V 2 = Var(ζ) = IEζζ>
IEθ = θ∗
Var(θ)
= D−2V 2D−2.
Further, for any θ ,
L(θ)− L(θ) = (θ − θ)>D2(θ − θ)/2 = ‖D(θ − θ)‖2/2. (4.15)
Finally, it holds for the excess L(θ,θ∗)def= L(θ)− L(θ∗)
2L(θ,θ∗) = (θ − θ∗)>D2(θ − θ∗) = ζ>D−2ζ = ‖ξ‖2 (4.16)
with ξ = D−1ζ .
Proof. The equation (4.13) with θ◦ = θ∗ implies for any θ
∇L(θ) = ∇L(θ◦)−D2(θ − θ◦) = ζ −D−2(θ − θ∗). (4.17)
120
Therefore, it holds for the expectation IEL(θ)
∇IEL(θ) = IEζ −D−2(θ − θ∗),
and the equation ∇IEL(θ∗) = 0 implies IEζ = 0 .
To show (4.15), apply again the property (4.13) with θ◦ = θ :
L(θ)− L(θ) = (θ − θ)>∇L(θ)− (θ − θ)>D2(θ − θ)/2
= −(θ − θ)>D2(θ − θ)/2.
Here we used that ∇L(θ) = 0 because θ is an extreme point of L(θ) . The last result
(4.16) is a special case with θ = θ∗ in view of (4.14).
This theorem delivers an important message: the main properties of the MLE θ can
be explained via the geometric (quadratic) structure of the log-likelihood. An interesting
question to clarify is whether a quadratic log-likelihood structure specific for linear Gaus-
sian model. The answer is positive: there is one-to-one correspondence between linear
Gaussian models and quadratic log-likelihood functions. Indeed, the identity (4.17) with
θ◦ = θ∗ can be rewritten as
∇L(θ)−D2θ ≡ ζ +D2θ∗.
If we fix any θ and define Y = ∇L(θ)−D2θ , this yields
Y = D2θ∗ + ζ.
Similarly, Ydef= D−1
{∇L(θ)−D2θ
}yields the equation
Y = Dθ∗ + ξ, (4.18)
where ξ = D−1ζ . We can summarize as follows.
Theorem 4.5.2. Let L(θ) be quadratic with a non-degenerated matrix D2 . Then Ydef=
D−1{∇L(θ)−D2θ
}does not depend on θ and L(θ)−L(θ∗) is the quasi log-likelihood
ratio for the linear Gaussian model (4.18) with ξ standard normal. It is the true log-
likelihood if and only if ζ ∼ N(0, D2) .
Proof. The model (4.18) with ξ ∼ N(0, IIp) leads to the log-likelihood ratio
(θ − θ∗)>DY − ‖D(θ − θ∗)‖2/2 = (θ − θ∗)>ζ − ‖D(θ − θ∗)‖2/2
which coincides with L(θ) − L(θ∗) from (4.13). Also ζ ∼ N(0, D2) if and only if
ξ = D−1ζ is standard normal.
121
4.6 Inference based on the maximum likelihood
All the results presented above for linear models were based on the explicit representation
of the (quasi) MLE θ . Here we present the approach based on the analysis of the
maximum likelihood. This approach does not require to fix any analytic expression for
the point of maximum of the (quasi) likelihood process L(θ) . Instead we work directly
with the maximum of this process. We establish exponential inequalities for the “fitted
likelihood” L(θ,θ∗) . We also show how these results can be used to study the accuracy
of the MLE θ , in particular, for building confidence sets.
One more benefit of the ML-based approach is that it equally applies to a homoge-
neous and to a heterogeneous noise provided that the noise structure is not misspecified.
The celebrated chi-squared result about the maximum likelihood L(θ,θ∗) claims that
the distribution of 2L(θ,θ∗) is chi-squared with p degrees of freedom χ2p and it does
not depend on the noise covariance; see Section 4.6.
Now we specify the setup. The starting point of the ML-approach is the linear
Gaussian model assumption Y = Ψ>θ∗ + ε with ε ∼ N(0, Σ) . The corresponding
log-likelihood ratio L(θ) can be written as
L(θ) = −1
2(Y − Ψ>θ)>Σ−1(Y − Ψ>θ) +R, (4.19)
where the remainder term R does not depend on θ . Now one can see that L(θ) is a
quadratic function of θ . Moreover, ∇2L(θ) = ΨΣ−1Ψ> , so that L(θ) is quadratic with
D2 = ΨΣ−1Ψ> . This enables us to apply the general results of Section 4.5 which are
only based on the geometric (quadratic) structure of the log-likelihood L(θ) : the true
data distribution can be arbitrary.
Theorem 4.6.1. Consider L(θ) from (4.19). For any θ , it holds with D2 = ΨΣ−1Ψ>
L(θ,θ) = (θ − θ)>D2(θ − θ)/2. (4.20)
In particular, if Σ = σ2IIn then the fitted log-likelihood is proportional to the quadratic
loss ‖f − fθ‖2 for f = Ψ>θ and fθ = Ψ>θ :
L(θ,θ) =1
2σ2∥∥Ψ>(θ − θ)
∥∥2 =1
2σ2∥∥f − fθ∥∥2.
If θ∗def= argmaxθ IEL(θ) = D−2ΨΣ−1f for f = IEY , then
2L(θ,θ∗) = ζ>D−2ζ = ‖ξ‖2 (4.21)
with ζ = ∇L(θ∗) and ξdef= D−1ζ . Moreover, if the model Y = Ψ>θ∗ + ε with
ε ∼ N(0, Σ) is correct, then ξ ∼ N(0, IIp) and 2L(θ,θ∗) ∼ χ2p is chi-squared with p
degrees of freedom.
122
Proof. The results (4.20) and (4.21) follow from Theorem 4.5.1; see (4.15) and (4.16).
Further,
ζ = ∇L(θ∗) = ΨΣ−1(Y − Ψ>θ∗) = ΨΣ−1ε.
So, if Y is Gaussian then ζ is Gaussian as well as linear transformation of a Gaussian
vector. By Theorem 4.5.1, IEζ = 0 . Moreover, Var(ε) = Σ implies
Var(ζ) = IEΨ>Σ−1εε>Σ−1Ψ> = ΨΣ−1Ψ> = D2
yielding that ξ = D−1ζ is standard normal.
The last result 2L(θ,θ∗) ∼ χ2p is sometimes called the “chi-squared phenomenon”:
the distribution of the maximum likelihood only depends on the number of parameters
to be estimated and is independent of the design Ψ , of the noise covariance matrix Σ ,
etc. This particularly explains the use of word “phenomenon” in the name of the result.
Exercise 4.6.1. Check that the linear transformation Y = Σ−1/2Y of the data does
not change the value of the log-likelihood ratio L(θ,θ∗) and hence, of the maximum
likelihood L(θ,θ∗) .
Hint: use the representation
L(θ) =1
2(Y − Ψ>θ)>Σ−1(Y − Ψ>θ) +R
=1
2(Y − Ψ>θ)>(Y − Ψ>θ) +R
and check that the transformed data Y is described by the model Y = Ψ>θ∗ + ε with
Ψ = ΨΣ−1/2 and ε = Σ−1/2ε ∼ N(0, IIn) yielding the same log-likelihood ratio as in the
original model.
Exercise 4.6.2. Assume homogeneous noise in (4.19) with Σ = σ2IIn . Then it holds
2L(θ,θ∗) = σ−2‖Πε‖2
where Π = Ψ>(ΨΨ>
)−1Ψ is the projector in IRn on the subspace spanned by the vectors
ψ1, . . . ,ψp .
Hint: use that ζ = σ−2Ψε , D2 = σ2ΨΨ> , and
σ−2‖Πε‖2 = σ−2ε>Π>Πε = σ−2ε>Πε = ζ>D−2ζ.
We write the result of Theorem 4.6.1 in the form 2L(θ,θ∗) ∼ χ2p , where χ2
p stands
for the chi-squared distribution with p degrees of freedom. This result can be used to
123
build likelihood-based confidence ellipsoids for the parameter θ∗ . Given z > 0 , define
E(z) ={θ : L(θ,θ) ≤ z
}={θ : sup
θ′L(θ′)− L(θ) ≤ z
}. (4.22)
Theorem 4.6.2. Assume Y = Ψ>θ∗ + ε with ε ∼ N(0, Σ) and consider the MLE θ .
Define zα by P(χ2p > 2zα
)= α . Then E(zα) from (4.22) is an α -confidence set for θ∗ .
Exercise 4.6.3. Let D2 = ΨΣ−1Ψ> . Check that the likelihood-based CS E(zα) and
estimate-based CS E(zα) = {θ : ‖D(θ − θ)‖ ≤ zα} , z2α = 2zα , coincide in the case of
the linear modeling:
E(zα) ={θ :∥∥D(θ − θ)
∥∥2 ≤ 2zα}.
Another corollary of the chi-squared result is a concentration bound for the maximum
likelihood. A similar result was stated for the univariate exponential family model: the
value L(θ, θ∗) is stochastically bounded with exponential moments, and the bound does
not depend on the particular family, parameter value, sample size, etc. Now we can
extend this result to the case of a linear Gaussian model. Indeed, Theorem 4.6.1 states
that the distribution of 2L(θ,θ∗) is chi-squared and only depends on the number of
parameters to be estimated. The latter distribution concentrates on the ball of radius of
order p1/2 and the deviation probability is exponentially small.
Theorem 4.6.3. Assume Y = Ψ>θ∗ + ε with ε ∼ N(0, Σ) . Then for every x > 0 , it
holds with κ ≥ 6.6
IP(2L(θ,θ∗) > p+
√κxp ∨ (κx)
)= IP
(∥∥D(θ − θ)∥∥2 > p+
√κxp ∨ (κx)
)≤ exp(−x). (4.23)
Proof. Define ξdef= D(θ − θ∗) . By Theorem 4.4.4 ξ is standard normal vector in IRp
and by Theorem 4.6.1 2L(θ,θ∗) = ‖ξ‖2 . Now the statement (4.23) follows from the
general deviation bound for the Gaussian quadratic forms; see Theorem 9.1.1.
The main message of this result can be explained as follows: the deviation probability
that the estimate θ does not belong to the elliptic set E(z) = {θ : ‖D(θ − θ)‖ ≤ z}starts to vanish when z2 exceeds the dimensionality p of the parameter space. Similarly,
the coverage probability that the true parameter θ∗ is not covered by the confidence set
E(z) starts to vanish when 2z exceeds p .
Corollary 4.6.4. Assume Y = Ψ>θ∗ + ε with ε ∼ N(0, Σ) . Then for every x > 0 , it
holds with 2z = p+√κxp ∨ (κx) for κ ≥ 6.6
IP(E(z) 63 θ∗
)≤ exp(−x).
124
Exercise 4.6.4. Compute z ensuring the covering of 95% in the dimension p = 1, 2, 10, 20 .
4.6.1 A misspecified LPA
Now we discuss the behavior of the fitted log-likelihood for the misspecified linear para-
metric assumption IEY = Ψ>θ∗ . Let the response function f not be linearly ex-
pandable as f = Ψ>θ∗ . Following to Theorem 4.4.3, define θ† = Sf with S =(ΨΣ−1Ψ>
)−1ΨΣ−1 . This point provides the best approximation of the nonlinear re-
sponse f by a linear parametric fit Ψ>θ .
Theorem 4.6.5. Assume Y = f + ε with ε ∼ N(0, Σ) . Let θ† = Sf . Then
2L(θ,θ†) = ζ>D−2ζ = ‖ξ‖2 ∼ χ2p ,
where D2 = ΨΣ−1Ψ> , ζ = ∇L(θ†) = ΨΣ−1ε , ξ = D−1ζ is standard normal vector in
IRp and χ2p is a chi-squared random variable with p degrees of freedom. In particular,
E(zα) is an α -CS for the vector θ† and the bound of Corollary 4.6.4 applies.
Exercise 4.6.5. Prove the result of Theorem 4.6.5.
4.6.2 A misspecified noise structure
This section addresses the question about the features of the maximum likelihood in the
case when the likelihood is built under a wrong assumption about the noise structure. To
be more specific, we consider the likelihood for the homogeneous noise Σ = σ2IIn while
the true noise covariance is only assumed to be non-degenerated. As one can expect,
the chi-squared result is not valid anymore in this situation and the distribution of the
maximum likelihood depends on the true noise covariance. However, the nice geometric
structure of the maximum likelihood manifested by Theorems 4.6.1 and 4.6.2 does not
rely on the true data distribution and it is only based on our structural assumptions
on the considered model. This helps to get rigorous results about the behaviors of the
maximum likelihood and particularly about its concentration properties.
Recall the notation D2 = σ−2ΨΨ> . It is a symmetric p × p matrix describing the
covariance structure of the estimate θ under the homogeneous noise.
Theorem 4.6.6. Let θ be built for the model Y = Ψ>θ∗ + ε with ε ∼ N(0, σ2IIn) ,
while the true noise covariance is Σ0 : IEε = 0 and Var(ε) = Σ0 . Then
2L(θ,θ∗) = ‖D(θ − θ∗)‖2 = ‖ξ‖2, (4.24)
where ξ is a random vector in IRp with IEξ = 0 and
Var(ξ) = Bdef= σ−2D−1ΨΣ0Ψ
>D−1.
125
Moreover, if ε ∼ N(0, Σ0) , then ξ ∼ N(0, B) .
Proof. The equality 2L(θ,θ∗) = ‖D(θ − θ∗)‖2 = ‖ξ‖2 has been already proved in
Theorem 4.6.1. Moreover, by Theorem 4.4.1 θ− θ∗ = Sε with S =(ΨΨ>
)−1Ψ , so that
and
Var(θ) = S Var(ε)S> = SΣ0S> = σ−4D−2ΨΣ0Ψ>D−2
This implies that
Var(ξ) = IEξξ> = DVar(θ)D = DSΣ0S>D = σ−2D−1ΨΣ0Ψ>D−1.
It remains to note that if ε is a Gaussian vector, then ξ = DSε is Gaussian as well.
One can see that the chi-squared result is not valid any more if the noise structure is
misspecified. An interesting question is whether the CS E(z) can be applied in the case
of a misspecified noise under some proper adjustment of the value z . Surprisingly, the
answer is not entirely negative. The reason is that the vector ξ from (4.24) is zero mean
and its norm has a similar behavior as in the case of the correct noise specification: the
probability IP(‖ξ‖ > z
)starts to degenerate when z2 exceeds IE‖ξ‖2 . A general bound
from Theorem 9.1.2 in Section 9 implies the following bound for the coverage probability.
Corollary 4.6.7. Under the conditions of Theorem 4.6.6, for every x > 0 , it holds with
p = tr(B) , v2 = 2 tr(B2) , and a∗ = ‖B‖∞
IP(2L(θ,θ∗) > p + (2vx1/2) ∨ (6a∗x)
)≤ exp(−x).
Exercise 4.6.6. Show that an overestimation of the noise in the sense Σ ≥ Σ0 preserves
the coverage probability for the CS E(zα) , that is, if 2zα is the 1 − α quantile of χ2p ,
then IP(E(zα) 63 θ∗
)≤ α .
4.7 Ridge regression, projection, and shrinkage
This section discusses the important situation when the number of predictors ψj and
hence the number of parameters p in the linear model Y = Ψ>θ∗ + ε is not small
relative to the sample size. Then the application of the least square or the maximum
likelihood approach meets serious problems. The first one relates to the numerical issues.
The definition of the LSE θ involves the inversion of the p×p matrix ΨΨ> and such an
inversion becomes a delicate task for p large. The other problem concerns the inference
for the estimated parameter θ∗ . The risk bound and the width of the confidence set
are proportional to the parameter dimension p and thus, with large p , the inference
126
statements become almost uninformative. In particular, if p is of order the sample size
n , even consistency is not achievable. One faces a really critical situation. We already
know that the MLE is the efficient estimate in the class of all unbiased estimates. At
the same time it is highly inefficient in overparametrized models. The only way out
of this situation is to sacrifice the unbiasedness property in favor of reducing the model
complexity: some procedures can be more efficient than MLE even if they are biased. This
section discusses one way of resolving these problems by regularization or shrinkage. To
be more specific, for the rest of the section we consider the following setup. The observed
vector Y follows the model
Y = f + ε (4.25)
with a homogeneous error vector ε : IEε = 0 , Var(ε) = σ2IIn . Noise misspecification is
not considered in this section.
Furthermore, we assume a basis or a collection of basis vectors ψ1, . . . ,ψp is given
with p large. This allows for approximating the response vector f = IEY in the form
f = Ψ>θ∗ , or, equivalently,
f = θ∗1ψ1 + . . .+ θ∗pψp .
In many cases we will assume that the basis is already orthogonalized: ΨΨ> = IIp . The
model (4.25) can be rewritten as
Y = Ψ>θ∗ + ε, Var(ε) = σ2IIn .
The MLE or oLSE of the parameter vector θ∗ for this model reads as
θ =(ΨΨ>
)−1ΨY , f = Ψ>θ = Ψ>
(ΨΨ>
)−1ΨY .
If the matrix ΨΨ> is degenerate or badly posed, computing the MLE θ meets serious
problems. Below we discuss how these problems can be treated.
4.7.1 Regularization and ridge regression
Let R be a positive symmetric p × p matrix. Then the sum ΨΨ> + R is positive
symmetric as well and can be inverted whatever the matrix Ψ is. This suggests to
replace(ΨΨ>
)−1by(ΨΨ>+R
)−1leading to the regularized least squares estimate θR
of the parameter vector θ and the corresponding response estimate fR :
θRdef=(ΨΨ> +R
)−1ΨY , fR
def= Ψ>
(ΨΨ> +R
)−1ΨY . (4.26)
127
Such a method is also called ridge regression. An example of choosing R is the multiple
of the unit matrix: R = αIIp where α > 0 and IIp stands for the unit matrix. This
method is also called Tikhonov regularization and it results in the parameter estimate
θα and the response estimate fα :
θαdef=(ΨΨ> + αIIp
)−1ΨY , fα
def= Ψ>
(ΨΨ> + αIIp
)−1ΨY . (4.27)
A proper choice of the matrix R for the ridge regression method (4.26) or the parameter
α for the Tikhonov regularization (4.27) is an important issue. Below we discuss several
approaches which lead to the estimate (4.26) with a specific choice of the matrix R . The
properties of the estimates θR and fR will be studied in context of penalized likelihood
estimation in the next section.
4.7.2 Penalized likelihood. Bias and variance
The estimate (4.26) can be obtained in a natural way within the (quasi) ML approach us-
ing the penalized least squares. The classical unpenalized method is based on minimizing
the sum of residuals squared:
θ = argmaxθ
L(θ) = arginfθ‖Y − Ψ>θ‖2
with L(θ) = σ−2‖Y − Ψ>θ‖2/2 . (Here we omit the terms which do not depend on θ .)
Now we introduce an additional penalty on the objective function which penalizes for
the complexity of the candidate vector θ which is expressed by the value ‖Gθ‖2/2 for a
given symmetric matrix G . This choice of complexity measure implicitly assumes that
the vector θ ≡ 0 has the smallest complexity equal to zero and this complexity increases
with the norm of Gθ . Define the penalized log-likelihood
LG(θ)def= L(θ)− ‖Gθ‖2/2
= −(2σ2)−1‖Y − Ψ>θ‖2 − ‖Gθ‖2/2− (n/2) log(2πσ2). (4.28)
The penalized ML problem reads as
θG = argmaxθ
LG(θ) = argminθ
{(2σ2)−1‖Y − Ψ>θ‖2 + ‖Gθ‖2/2
}.
A straightforward calculus leads to the expression (4.26) for θG with R = σ2G2 :
θGdef=(ΨΨ> + σ2G2
)−1ΨY . (4.29)
128
We see that θG is again a linear estimate: θG = SGY with SG =(ΨΨ> + σ2G2
)−1Ψ .
The results of Section 4.4 explains that θG in fact estimates the value θG defined by
θG = argmaxθ
IELG(θ)
= arginfθ
IE{‖Y − Ψ>θ‖2 + σ2‖Gθ‖2
}=(ΨΨ> + σ2G2
)−1Ψf = SGf . (4.30)
In particular, if f = Ψ>θ∗ , then
θG =(ΨΨ> + σ2G2
)−1ΨΨ>θ∗ (4.31)
and θG 6= θ∗ unless G = 0 . In other words, the penalized MLE θG is biased.
Exercise 4.7.1. Check that IEθα = θα for θα =(ΨΨ> + αIIp
)−1ΨΨ>θ∗ , the bias
‖θα − θ∗‖ grows with the regularization parameter α .
The penalized MLE θG leads to the response estimate fG = Ψ θG .
Exercise 4.7.2. Check that the penalized ML approach leads to the response estimate
fG = Ψ>θG = Ψ>(ΨΨ> + σ2G2
)−1ΨY = ΠGY
with ΠG = Ψ>(ΨΨ> + σ2G2
)−1Ψ . Show that ΠG is a sub-projector in the sense that
‖ΠGu‖ ≤ ‖u‖ for any u ∈ IRn .
Exercise 4.7.3. Let Ψ be orthonormal: ΨΨ> = IIp . Then the penalized MLE θG can
be represented as
θG = (IIp + σ2G2)−1Z,
where Z = ΨY is the vector of empirical Fourier coefficients. Specify the result for the
case of a diagonal matrix G = diag(g1, . . . , gp) and describe the corresponding response
estimate fG .
The previous results indicate that introducing the penalization leads to some bias
of estimation. One can ask about a benefit of using a penalized procedure. The next
result shows that penalization decreases the variance of estimation and thus, makes the
procedure more stable.
Theorem 4.7.1. Let θG be a penalized MLE from (4.29). Under noise homogeneity
Var(ε) = σ2IIn , it holds IEθG = θG , see (4.31), and
Var(θG) = σ2SGS>G = σ2(ΨΨ> + σ2G2
)−1ΨΨ>
(ΨΨ> + σ2G2
)−1.
129
In particular, Var(θG) ≤ Var(θ) , Var(θG) ≤(σ−2ΨΨ> + G2
)−1. Moreover, the bias
‖θG − θ∗‖ monotonously increases in G2 while the variance monotonously decreases
with the penalization G .
If ε ∼ N(0, σ2IIn) , then θG is also normal with mean θG and the variance σ2SGS>G .
Proof. The first two moments of θG are computed from θG = SGY . Monotonicity of
the bias and variance of θG is proved below in Exercise 4.7.6.
Exercise 4.7.4. Let Ψ be orthonormal: ΨΨ> = IIp . Describe Var(θG) . Show that the
variance decreases with the penalization G in the sense that G1 ≥ G implies Var(θG1) ≤Var(θG) .
Exercise 4.7.5. Let ΨΨ> = IIp and let G = diag(g1, . . . , gp) be a diagonal matrix.
Compute the squared bias ‖θG − θ∗‖2 and show that it monotonously increases in each
gj for j = 1, . . . , p .
Exercise 4.7.6. Let G be a symmetric matrix and θG the corresponding penalized
MLE. Show that the variance Var(θG) decreases while the bias ‖θG − θ∗‖ increases in
G2 .
Hint: first reduce the situation to the case of the orthogonal design matrix Ψ with
ΨΨ> = Λ = diag(λ1, . . . , λp) by an orthonormal basis transformation. For ΨΨ> = Λ ,
show that for any vector w ∈ IRp and u = Λ1/2w , it holds
w>Var(θG)w = σ2u>(IIp + σ2Λ−1/2G2Λ−1/2)−2u
and this value decreases with G2 because IIp + σ2Λ−1/2G2Λ−1/2 increases. Show in a
similar way that
‖θG − θ∗‖2 = σ4‖(Λ+ σ2G2)−1G2θ∗‖2 = σ2u>B(IIp +B)−1Bu
with u = Λ1/2θ∗ and B = Λ−1/2G2Λ−1/2 . Show that the matrix B(IIp + B)−1B is
monotonously increasing in B and thus in G2 using the diagonalization arguments and
monotonicity of the function x2/(1 + x) in x ≥ 0 .
Putting together the results about the bias and the variance of θG yields the state-
ment about the quadratic risk.
Theorem 4.7.2. Assume the model Y = Ψ>θ∗ + ε with Var(ε) = σ2IIn . Then the
estimate θG fulfills
IE‖θG − θ∗‖2 = ‖θG − θ∗‖2 + σ2 tr(SGS>G
).
130
This result is called the bias-variance decomposition. The choice of a proper regular-
ization is usually based on this decomposition: one selects a regularization from a given
class to provide the minimal possible risk. This approach is referred to as bias-variance
trade-off.
4.7.3 Inference for the penalized MLE
Here we discuss some properties of the penalized MLE θG . In particular, we focus on the
construction of confidence and concentration sets based on the penalized log-likelihood.
We know that the regularized estimate θG is the empirical counterpart of the value θG
which solves the regularized deterministic problem (4.30). We also know that the key
results are expressed via the value of the supremum supθ LG(θ) − LG(θG) . The next
result extends Theorem 4.6.1 to the penalized likelihood.
Theorem 4.7.3. Let LG(θ) be the penalized log-likelihood from (4.28). Then
2LG(θG,θG) =(θG − θG
)>(σ−2ΨΨ> +G2
)(θG − θG
)(4.32)
= σ−2ε>ΠG ε (4.33)
with ΠG = Ψ>(ΨΨ> + σ2G2
)−1Ψ .
In general the matrix ΠG is not a projector and hence, σ−2ε>ΠG ε is not χ2 -
distributed, the chi-squared result does not apply.
Exercise 4.7.7. Prove (4.32).
Hint: apply the Taylor expansion to LG(θ) at θG . Use that ∇LG(θG) = 0 and
∇2LG(θ) ≡ σ−2ΨΨ> +G2 .
Exercise 4.7.8. Prove (4.33).
Hint: show that θG − θG = SGε with SG =(ΨΨ> + σ2G2
)−1Ψ .
The straightforward corollaries of Theorem 4.7.3 are the concentration and confidence
probabilities. Define the confidence set EG(z) for θG as
EG(z)def={θ : LG(θG,θ) ≤ z
}.
The definition implies the following result for the coverage probability: IP(EG(z) 63 θG
)≤
IP(LG(θG,θG) > z
). Now the representation (4.33) for LG(θG,θG) reduces the problem
to a deviation bound for a quadratic form. We apply the general result of Section 9.
Theorem 4.7.4. Let LG(θ) be the penalized log-likelihood from (4.28) and let ε ∼N(0, σ2IIn) . Then it holds with pG = tr(ΠG) and v2G = 2 tr(Π2
G) that
IP(2LG(θG,θG) > pG + (2vGx
1/2) ∨ (6x))≤ exp(−x).
131
Similarly one can state the concentration result. Define D2G = σ−2ΨΨ> +G2 . Then
2LG(θG,θG) =∥∥DG
(θG − θG
)∥∥2and the result of Theorem 4.7.4 can be restated as the concentration bound:
IP(‖DG(θG − θG)‖2 > pG + (2vGx
1/2) ∨ (6x))≤ exp(−x).
In other words, θG concentrates on the set A(z,θ∗) ={θ : ‖θ−θG‖2 ≤ 2z
}for 2z > pG .
4.7.4 Projection and shrinkage estimates
Consider a linear model Y = Ψ>θ∗ + ε in which the matrix Ψ is orthonormal in the
sense ΨΨ> = IIp . Then the multiplication with Ψ maps this model in the sequence
space model Z = θ∗ + ξ , where Z = ΨY = (z1, . . . , zp)> is the vector of empirical
Fourier coefficients zj = ψ>j Y . The noise ξ = Ψε borrows the feature of the original
noise ε : if ε is zero mean and homogeneous, the same applies to ξ . The number of
coefficients p can be large or even infinite. To get a sensible estimate, one has to apply
some regularization method. The simplest one is called projection: one just considers
the first m empirical coefficients z1, . . . , zm and drop the others. The corresponding
parameter estimate θm reads as
θm,j =
zj if j ≤ m,
0 otherwise.
The response vector f = IEY is estimated by Ψ>θm leading to the representation
fm = z1ψ1 + . . .+ zmψm
with zj = ΨY . In other words, fm is just a projection of the observed vector Y onto the
subspace Lm spanned by the first m basis vectors ψ1, . . . ,ψm : Lm =⟨ψ1, . . . ,ψm
⟩.
This explains the name of the method. Clearly one can study the properties of θm
or fm using the methods of previous sections. However, one more question for this
approach is still open: a proper choice of m . The standard way of accessing this issue is
based on the analysis of the quadratic risk.
Consider first the prediction risk defined as R(fm) = IE‖fm − f‖2 . Below we focus
on the case of a homogeneous noise with Var(ε) = σ2IIp . An extension to the colored
noise is possible. Recall that fm effectively estimates the vector fm = Πmf , where
Πm is the projector on Lm ; see Section 4.3.3. Moreover, the quadratic risk R(fm) can
132
be decomposed as
R(fm) = ‖f −Πmf‖2 + σ2m = σ2m+
p∑j=m+1
θ∗j2.
Obviously the squared bias ‖f − Πmf‖2 decreases with m while the variance σ2m
linearly grows with m . Risk minimization leads to the so called bias-variance trade-off :
one selects m which minimizes the risk R(fm) over all possible m :
m∗def= argmin
mR(fm) = argmin
m
{‖f −Πmf‖2 + σ2m
}.
Unfortunately this choice requires some information about the bias ‖f −Πmf‖ which
depends on the unknown vector f . As this information is not available in typical situa-
tion, the value m∗ is also called an oracle choice. A data-driven choice of m is one of
the central issue in the nonparametric statistics.
The situation is not changed if we consider the estimation risk IE‖θm−θ∗‖2 . Indeed,
the basis orthogonality ΨΨ> = IIp implies for f = Ψ>θ∗
‖fm − f‖2 = ‖Ψ>θm − Ψ>θ∗‖2 = ‖θm − θ∗‖2
and minimization of the estimation risk coincides with minimization of the prediction
risk.
A disadvantage of the projection method is that it either keeps each empirical co-
efficient zm or completely discards it. An extension of the projection method is called
shrinkage: one multiplies every empirical coefficient zj with a factor αj ∈ (0, 1) . This
leads to the shrinkage estimate θα with
θα,j = αjzj .
Here α stands for the vector of coefficients αj for j = 1, . . . , p . A projection method
is a special case of this shrinkage with αj equal to one or zero. Another popular choice
of the coefficients αj is given by
αj = (1− j/m)β1(j ≤ m) (4.34)
for some β > 0 and m ≤ p . This choice ensures that the coefficients αj smoothly
approach zero as j approach the value m , and αj vanish for j > m . In this case,
the vector α is completely specified by two parameters m and β . The projection
method corresponds to β = 0 . The design orthogonality ΨΨ> = IIp yields again that
the estimation risk IE‖θα − θ∗‖2 coincides with the prediction risk IE‖fα − f‖2 .
133
Exercise 4.7.9. Let Var(ε) = σ2IIp . The risk R(fα) of the shrinkage estimate fα
fulfills
R(fα)def= IE‖fα − f‖2 =
p∑j=1
θ∗j2(1− αj)2 +
p∑j=1
α2jσ
2.
Specify the cases of α = α(m,β) from (4.34). Evaluate the variance term∑
j α2jσ
2 .
Hint: approximate the sum over j by the integral∫
(1− x/m)2β+ dx .
The oracle choice is again defined by risk minimization:
α∗def= argmin
αR(fα),
where minimization is taken over the class of all considered coefficient vectors α .
One way of obtaining a shrinkage estimate in the sequence space model Z = θ∗ + ξ
is by using a roughness penalization. Let G be a symmetric matrix. Consider the
regularized estimate θG from (4.26). The next result claims that if G is a diagonal
matrix, then θG is a shrinkage estimate. Moreover, a general penalized MLE can be
represented as shrinkage by an orthogonal basis transformation.
Theorem 4.7.5. Let G be a diagonal matrix, G = diag(g1, . . . , gp) . The penalized MLE
θG in the sequence space model Z = θ∗ + ξ with ξ ∼ N(0, σ2IIp) coincides with the
shrinkage estimate θα for αj = (1 + σ2g2j )−1 ≤ 1 . Moreover, a penalized MLE θG for
a general matrix G can be reduced to a shrinkage estimate by a basis transformation in
the sequence space model.
Proof. The first statement for a diagonal matrix G follows from the representation
θG = (IIp + σ2G2)−1Z . Next, let U be an orthogonal transform leading to the diagonal
representation G2 = U>D2U with D2 = diag(g1, . . . , gp) . Then
U θG = (IIp + σ2D2)−1UZ
that is, U θG is a shrinkage estimate in the transformed model UZ = Uθ∗ + Uξ .
In other words, roughness penalization results in some kind of shrinkage. Interestingly,
the inverse statement holds as well.
Exercise 4.7.10. Let θα is a shrinkage estimate for a vector α = (αj) . Then there is
a diagonal penalty matrix G such that θα = θG .
Hint: define the j th diagonal entry gj by the equation αj = (1 + σ2g2j )−1 .
134
4.7.5 Smoothness constraints and roughness penalty approach
Another way of reducing the complexity of the estimation procedure is based on smooth-
ness constraints. The notion of smoothness originates from regression estimation. A
non-linear regression function f is expanded using a Fourier or some other functional
basis and θ∗ is the corresponding vector of coefficients. Smoothness properties of the
regression function imply certain rate of decay of the corresponding Fourier coefficients:
the larger frequency is, the fewer amount of information about the regression function is
contained in the related coefficient. This leads to the natural idea to replace the original
optimization problem over the whole parameter space with the constrained optimization
over a subset of “smooth” parameter vectors. Here we consider one popular example of
Sobolev smoothness constraints which effectively means that the s th derivative of the
function f has a bounded L2 -norm. A general Sobolev ball can be defined using a
diagonal matrix G :
BG(R)def= ‖Gθ‖ ≤ R.
Now we consider a constrained ML problem:
θG,R = argmaxθ∈BG(R)
L(θ) = argminθ∈Θ: ‖Gθ‖≤R
‖Y − Ψ>θ‖2. (4.35)
The Lagrange multiplier method leads to an unconstrained problem
θG,λ = argminθ
{‖Y − Ψ>θ‖2 + λ‖Gθ‖2
}.
A proper choice of λ ensures that the solution θG,λ belongs to BG(R) and solves also
the problem (4.35). So, the approach based on a Sobolev smoothness assumption, leads
back to regularization and shrinkage.
4.8 Shrinkage in a linear inverse problem
This section extends the previous approaches to the situation with indirect observations.
More precisely, we focus on the model
Y = Af + ε (4.36)
where A is a given linear operator (matrix) and f is the target of analysis. With the
obvious change of notation this problem can be put back in the general linear setup
Y = Ψ>θ + ε . The special focus is due to the facts that the target can be high
dimensional or even functional and that the product A>A is usually badly posed and
135
its inversion is a hard task. Below we consider separately the cases when the spectral
representation for this problem is available and the general case.
4.8.1 Spectral cut-off and spectral penalization. Diagonal estimates
Suppose that the eigenvectors of the matrix A>A are available. This allows for reduc-
ing the model to the spectral representation by an orthogonal change of the coordinate
system: Z = Λu + Λ1/2ξ with a diagonal matrix Λ = diag{λ1, . . . , λp} and a ho-
mogeneous noise Var(ξ) = σ2IIp ; see Section 4.2.4. Below we assume without loss of
generality that the eigenvalues λj are ordered and decrease with j . This spectral rep-
resentation means that one observes empirical Fourier coefficients zm described by the
equation zj = λjuj +λ1/2j ξj for j = 1, . . . , p . The LSE or qMLE estimate of the spectral
parameter u is given by
u = Λ−1Z = (λ−11 z1, . . . , λ−1p zp)
>.
Exercise 4.8.1. Consider the spectral representation Z = Λu + Λ1/2ξ . The LSE u
reads as u = Λ−1Z .
If the dimension p of the model is high or, specifically, if the spectral values λj
rapidly go to zero, it might be useful to only track few coefficients u1, . . . , um and to
set all the remaining ones to zero. The corresponding estimate um = (um,1, . . . , um,p)>
reads as
um,jdef=
λ−1j zj if j ≤ m,
0 otherwise.
It is usually referred to as a spectral cut-off estimate.
Exercise 4.8.2. Consider the linear model Y = Af + ε . Let U be an orthogonal
transform in IRp providing UA>AU> = Λ with a diagonal matrix Λ leading to the
spectral representation for Z = UAY . Write the corresponding spectral cut-off estimate
fm for the original vector f . Show that computing this estimate only requires to know
the first m eigenvalues and eigenvectors of the matrix A>A .
Similarly to the direct case, a spectral cut-off can be extended to spectral shrinkage:
one multiplies every empirical coefficient zj with a factor αj ∈ (0, 1) . This leads to
the spectral shrinkage estimate uα with uα,j = αjλ−1j zj . Here α stands for the vector
of coefficients αj for j = 1, . . . , p . A spectral cut-off method is a special case of this
shrinkage with αj equal to one or zero.
136
Exercise 4.8.3. Specify the spectral shrinkage uα with a given vector α for the situ-
ation of Exercise 4.8.2.
The spectral cut-off method can be described as follows. Let ψ1,ψ2, . . . be the
intrinsic orthonormal basis of the problem composed of the standardized eigenvectors of
A>A and leading to the spectral representation Z = Λu+Λ1/2ξ with the target vector
u . In terms of the original target f , one is looking for a solution or an estimate in the
form f =∑
j ujψj . The design orthogonality allows to estimate every coefficient uj
independently of the others using the empirical Fourier coefficient ψ>j Y . Namely, uj =
λ−1j ψ>j Y = λ−1j zj . The LSE procedure tries to recover f as the full sum f =
∑j ujψj .
The projection method suggests to cut this sum at the index m : fm =∑
j≤m ujψj ,
while the shrinkage procedure is based on downweighting the empirical coefficients uj :
fα =∑
j αj ujψj .
Next we study the risk of the shrinkage method. Orthonormality of the basis ψj
allows to represent the loss as ‖uα − u∗‖2 = ‖fα − f‖2 . Under the noise homogeneity
one obtains the following result.
Theorem 4.8.1. Let Z = Λu∗ + Λ1/2ξ with Var(ξ) = σ2IIp . It holds for the shrinkage
estimate uα
R(uα)def= IE‖uα − u∗‖2 =
p∑j=1
|αj − 1|2u∗j2 +
p∑j=1
α2jσ
2λ−1j .
Proof. The empirical Fourier coefficients zj are uncorrelated and IEzj = λju∗j , Var zj =
σ2λj . This implies
IE‖uα − u∗‖2 =
p∑j=1
IE|αjλ−1j zj − u∗j |2 =
p∑j=1
{|αj − 1|2u∗j
2 + α2jσ
2λ−1j}
as required.
Risk minimization leads to the oracle choice of the vector α or
α∗ = argminα
R(uα)
where the minimum is taken over the set of all admissible vectors α .
Similar analysis can be done for the spectral cut-off method.
Exercise 4.8.4. The risk of the spectral cut-off estimate um fulfills
R(um) =m∑j=1
λ−1j σ2 +
p∑j=m+1
|u∗j |2.
Specify the choice of the oracle cut-off index m∗ .
137
4.8.2 Galerkin method
A general problem with the spectral shrinkage approach is that it requires to precisely
know the intrinsic basis ψ1,ψ2, . . . or equivalently the eigenvalue decomposition of A
leading to the spectral representation. After this basis is fixed, one can apply the projec-
tion or shrinkage method using the corresponding Fourier coefficients. In some situations
this basis is hardly available or difficult to compute. A possible way out of this problem
is to take some other orthogonal basis φ1,φ2, . . . which is tractable and convenient but
does not lead to the spectral representation of the model. The Galerkin method is based
on projecting the original high dimensional problem to a lower dimensional problem in
term of the new basic {φj} . Namely, without loss of generality suppose that the target
function f can be decomposed as
f =∑j
ujφj .
This can be achieved e.g. if f belongs to some Hilbert space and {φj} is an orthonormal
basis in this space. Now we cut this sum and replace this exact decomposition by a finite
approximation
f ≈ fm =∑j≤m
ujφj = Φ>mum ,
where um = (u1, . . . , um)> and the matrix Φm is built of the vectors φ1, . . . ,φm : Φm =
(φ1, . . . ,φm) . Now we plug this decomposition in the original equation Y = Af + ε .
This leads to the linear model Y = AΦ>mum + ε = Ψ>mum + ε with Ψm = ΦmA> . The
corresponding (quasi) MLE reads
um =(ΨmΨ
>m
)−1ΨmY .
Note that for computing this estimate one only needs to evaluate the action of the
operator A on the basis functions φ1, . . . ,φm and on the data Y . With this estimate
um of the vector u∗ , one obtains the response estimate fm of the form
fm = Φ>mum = u1φ1 + . . .+ umφm .
The properties of this estimate can be studied in the same way as for a general qMLE in
a linear model: the true data distribution follows (4.36) while we use the approximating
model Y = Afm + ε with ε ∼ N(0, σ2I) for building the quasi likelihood.
A further extension of the qMLE approach concerns the case when the operator A
is not precisely known. Instead, an approximation or an estimate A is available. The
pragmatic way of tackling this problem is to use the model Y = Afm+ε for building the
138
quasi likelihood. The use of the Galerkin method is quite natural in this situation because
the spectral representation for A will not necessarily result in a similar representation
for the true operator A .
4.9 Semiparametric estimation
This section discusses the situation when the target of estimation does not coincide with
the parameter vector. This problem is usually referred to as semiparametric estimation.
One typical example is the problem of estimating a part of the parameter vector. More
generally one can try to estimate a given function/functional of the unknown parameter.
We focus here on linear modeling, that is, the considered model and the considered
mapping of the parameter space to the target space are linear. For the ease of presentation
we assume everywhere the homogeneous noise with Var(ε) = σ2IIn .
4.9.1 (θ,η) - and υ -setup
This section presents two equivalent descriptions of the semiparametric problem. The
first one assumes that the total parameter vector can be decomposed into the target
parameter θ and the nuisance parameter η . The second one operates with the total
parameter υ and the target θ is a linear mapping of υ .
We start with the (θ,η) -setup. Let the response Y be modeled in dependence of two
sets of factors: {ψj , j = 1, . . . , p} and {φm,m = 1, . . . , p1} . We are mostly interested
in understanding the impact of the first set {ψj} but we cannot ignore the influence of
the {φm} ’s. Otherwise the model would be incomplete. This situation can be described
by the linear model
Y = Ψ>θ∗ + Φ>η∗ + ε, (4.37)
where Ψ is the p× n matrix with the columns ψj , while Φ is the p1 × n -matrix with
the columns φm . We primarily aim at recovering the vector θ∗ , while the coefficients
η∗ are of secondary importance. The corresponding (quasi) log-likelihood reads as
L(θ,η) = −(2σ2)−1‖Y − Ψ>θ − Φ>η‖2 +R,
where R denotes the remainder term which does not depend on the parameters θ,η .
The more general υ -setup considers a general linear model
Y = Υ>υ∗ + ε, (4.38)
where Υ is p∗×n matrix of p∗ factors, and the target of estimation is a linear mapping
θ∗ = Pυ∗ for a given operator P from IRp∗
to IRp . Obviously the (θ,η) -setup is a
139
special case of the υ -setup. However, a general υ -setup can be reduced back to the
(θ,η) -setup by a change of variable.
Exercise 4.9.1. Consider the sequence space model Y = υ∗ + ξ in IRp and let the
target of estimation be the sum of the coefficients υ∗1 + . . . + υ∗p . Describe the υ -setup
for the problem. Reduce to (θ,η) -setup by an orthogonal change of the basis.
In the υ -setup, the (quasi) log-likelihood reads as
L(υ) = −(2σ2)−1‖Y − Υ>υ‖2 +R,
where R is the remainder which does not depend on υ . It implies quadraticity of the
log-likelihood L(υ) with the matrix D2 = ∇2L(υ) and the gradient ∇L(υ∗) given by
D2 = σ−2ΥΥ>, ∇L(υ∗) = σ−2Υε.
Exercise 4.9.2. Show that for the model (4.37) holds with Υ =(ΨΦ
)D2 = σ−2ΥΥ> = σ−2
(ΨΨ> ΨΦ>
ΦΨ> ΦΦ>
),
∇L(υ∗) = σ−2Υε = σ−2
(Ψε
Φε
).
4.9.2 Orthogonality and product structure
Consider the model (4.37) under the orthogonality condition ΨΦ> = 0 . This condition
effectively means that the factors of interest {ψj} are orthogonal to the nuisance factors
{φm} . An important feature of this orthogonal case is that the model has the product
structure leading to the additive form of the log-likelihood. Consider the partial θ -model
Y = Ψ>θ + ε with the (quasi) log-likelihood
L(θ) = −(2σ2)−1‖Y − Ψ>θ‖2 +R
Similarly L1(η) = −(2σ2)−1‖Y − Φ>η‖2 + R1 denotes the log-likelihood in the partial
η -model Y = Φ>θ + ε .
Theorem 4.9.1. Assume the condition ΨΦ> = 0 . Then
L(θ,η) = L(θ) + L1(η) +R(Y ) (4.39)
where R(Y ) is independent of θ and η . This implies the block diagonal structure of
the matrix D2 = σ−2ΥΥ> :
D2 = σ−2
(ΨΨ> 0
0 ΦΦ>
)=
(D2 0
0 H2
),
140
with D2 = σ−2ΨΨ> , H2 = σ−2ΦΦ> . Moreover, for any υ = (θ,η)
∇L(υ) =
(∇L(θ)
∇L1(η)
).
Now we demonstrate how the general case can be reduced to the orthogonal one by
a linear transformation of the nuisance parameter. Let C be a p × p1 matrix. Define
η = η + C>θ . Then the model equation Y = Ψ>θ + Φ>η + ε can be rewritten as
Y = Ψ>θ + Φ>(η − C>θ) + ε = (Ψ − CΦ)>θ + Φ>η + ε.
Now we select C to ensure the orthogonality. This leads to the equation
(Ψ − CΦ)Φ> = 0
or C = ΨΦ>(ΦΦ>
)−1. So, the original model can be rewritten as
Y = Ψ>θ + Φ>η + ε,
Ψ = Ψ − CΦ = Ψ(I −Πη) (4.40)
where Πη = Φ>(ΦΦ>
)−1Φ being the projector on the linear subspace spanned by the
nuisance factors {φm} . This construction has a natural interpretation: correction the θ -
factors ψ1, . . . ,ψp by removing their interaction with the nuisance factors φ1, . . . ,φp1
reduces the general case to the orthogonal one. We summarize:
Theorem 4.9.2. The linear model (4.37) can be represented in the orthogonal form
Y = Ψ>θ + Φ>η + ε
where Ψ from (4.40) satisfies ΨΦ> = 0 and η = η + C>θ for C = ΨΦ>(ΦΦ>
)−1.
Moreover, it holds for υ = (θ,η)
L(υ) = L(θ) + L1(η) +R(Y ) (4.41)
with
L(θ) = −(2σ2)−1‖Y − Ψ>θ‖2 +R,
L1(η) = −(2σ2)−1‖Y − Φ>η‖2 +R1.
Exercise 4.9.3. Show that for C = ΨΦ>(ΦΦ>
)−1∇L(θ) = ∇θL(υ)− C∇ηL(υ).
141
Exercise 4.9.4. Show that the remainder term R(Y ) in the last equation is the same
as in the orthogonal case (4.41).
Exercise 4.9.5. Show that Ψ Ψ> < ΨΨ> if ΨΦ> 6= 0 .
4.9.3 Partial estimation
This section explains the important notion of partial estimation which is quite natural
and transparent in the (θ,η) -setup. Let some value η◦ of the nuisance parameter be
fixed. A particular case of this sort is just ignoring the factors {φm} corresponding to
the nuisance component, that is, one uses η◦ ≡ 0 . This approach is reasonable in certain
situation, e.g. in context of projection method or spectral cut-off.
Define the estimate θ(η◦) by partial optimization of the joint log-likelihood L(θ,η◦)
w.r.t. the first parameter θ :
θ(η◦) = argmaxθ
L(θ,η◦).
Obviously θ(η◦) is the MLE in the residual model Y − Φ>η◦ = Ψ>θ∗ + ε :
θ(η◦) =(ΨΨ>
)−1Ψ(Y − Φ>η◦).
This allows for describing the properties of the partial estimate θ(η◦) similarly to the
usual parametric situation.
Theorem 4.9.3. Consider the model (4.37). Then the partial estimate θ(η◦) fulfills
IEθ(η◦) = θ∗ +(ΨΨ>
)−1ΨΦ>(η∗ − η◦), Var
{θ(η◦)
}= σ2
(ΨΨ>
)−1.
In words, θ(η) has the same variance as the MLE in the partial model Y = Ψ>θ∗+ε
but it is biased if ΨΦ>(η∗ − η◦) 6= 0 . The ideal situation corresponds to the case
when η◦ = η∗ . Then θ(η∗) is the MLE in the correctly specified θ -model: with
Y (η∗)def= Y − Φ>η∗ ,
Y (η∗) = Ψ>θ∗ + ε.
An interesting and natural question is a legitimation of the partial estimation method:
under which conditions it is justified and does not produce any estimation bias. The
answer is given by Theorem 4.9.1: the orthogonality condition ΨΦ> = 0 would ensure
the desired feature because of the decomposition (4.41).
142
Theorem 4.9.4. Assume orthogonality ΨΦ> = 0 . Then the partial estimate θ(η◦)
does not depend on the nuisance parameter η◦ used:
θ = θ(η◦) = θ(η∗) =(ΨΨ>
)−1ΨY .
In particular, one can ignore the nuisance parameter and estimate θ∗ from the partial
incomplete model Y = Ψ>θ∗ + ε .
Exercise 4.9.6. Check that the partial derivative ∂∂θL(θ,η) does not depend on η
under the orthogonality condition.
The partial estimation can be considered in context of estimating the nuisance pa-
rameter η by inverting the role of θ and η . Namely, given a fixed value θ◦ , one can
optimize the joint log-likelihood L(θ,η) w.r.t. the second argument η leading to the
estimate
η(θ◦)def= argmax
ηL(θ◦,η)
In the orthogonal situation the initial point θ◦ is not important and one can use the
partial incomplete model Y = Φ>η∗ + ε .
4.9.4 Profile estimation
This section discusses one general profile likelihood method of estimating the target pa-
rameter θ in the semiparametric situation. Later we show its optimality and R-efficiency.
The method suggests to first estimate the entire parameter vector υ by using the (quasi)
ML method. Then the operator P is applied to the obtained estimate υ to produce the
estimate θ . One can describe this method as
υ = argmaxυ
L(υ), θ = P υ. (4.42)
The first step here is the usual LS estimation of υ∗ in the linear model (4.38):
υ = arginfυ‖Y − Υ>υ‖2 =
(ΥΥ>
)−1ΥY .
The estimate θ is obtained by applying P to υ :
θ = P υ = P(ΥΥ>
)−1ΥY = SY (4.43)
with S = P(ΥΥ>
)−1Υ . The properties of this estimate can be studied using the decom-
position Y = f + ε with f = IEY ; cf. Section 4.4. In particular, it holds
IEθ = Sf , Var(θ) = S Var(ε)S>. (4.44)
143
If the noise ε is homogeneous with Var(ε) = σ2IIn , then
Var(θ) = σ2SS> = σ2P(ΥΥ>
)−1P>. (4.45)
The next theorem summarizes our findings.
Theorem 4.9.5. Consider the model (4.38) with homogeneous error Var(ε) = σ2IIn .
The profile MLE θ follows (4.43). Its means and variance are given by (4.44) and
(4.45).
The profile MLE is usually written in the (θ,η) -setup. Let υ = (θ,η) . Then the
target θ is obtained by projecting the MLE (θ, η) on the θ -coordinates. This procedure
can be formalized as
θ = argmaxθ
maxη
L(θ,η).
Another way of describing the profile MLE is based on the partial optimization con-
sidered in the previous section. Define for each θ the value L(θ) by optimizing the
log-likelihood L(υ) under the condition Pυ = θ :
L(θ)def= sup
υ: Pυ=θL(υ) = sup
ηL(θ,η). (4.46)
Then θ is defined by maximizing the partial fit L(θ) :
θdef= argmax
θL(θ). (4.47)
Exercise 4.9.7. Check that (4.42) and (4.47) lead to the same estimate θ .
We use for the function L(θ) obtained by partial optimization (4.48) the same nota-
tion as for the function obtained by the orthogonal decomposition (4.41) in Section 4.9.2.
Later we show that these two functions indeed coincide. This helps in understanding the
structure of the profile estimate θ .
Consider first the orthogonal case ΨΦ> = 0 . This assumption gradually simplifies the
study. In particular, the result of Theorem 4.9.4 for partial estimation can be obviously
extended to the profile method in view of product structure (4.41): when estimating the
parameter θ , one can ignore the nuisance parameter η and proceed as if the partial
model Y = Ψ>θ∗ + ε were correct. Theorem 4.9.1 implies:
Theorem 4.9.6. Assume that ΨΦ> = 0 in the model (4.37). Then the profile MLE θ
from (4.47) coincides with the MLE from the partial model Y = Ψ>θ∗ + ε :
θ = argmaxθ
L(θ) = argminθ‖Y − Ψ>θ‖2 =
(ΨΨ>
)−1ΨY .
144
It holds IEθ = θ∗ and
θ − θ∗ = D−2ζ = D−1ξ
with D2 = σ−2ΨΨ> , ζ = σ−2Ψε , and ξ = D−1ζ . Finally, L(θ) from (4.48) fulfills
2{L(θ)− L(θ∗)
}= ‖D
(θ − θ∗
)‖2 = ζ>D−2ζ = ‖ξ‖2. (4.48)
The general case can be reduced to the orthogonal one by the construction from
Theorem 4.9.2. Let
Ψ = Ψ − ΨΠη = Ψ − ΨΦ>(ΦΦ>
)−1Φ
be the corrected Ψ -factors after removing their interactions with the Φ -factors.
Theorem 4.9.7. Consider the model (4.37), and let the matrix D2 = σ2(Ψ Ψ>
)−1is
non-degenerated. Then the profile MLE θ reads as
θ = argminθ‖Y − Ψ>θ‖2 =
(Ψ Ψ>
)−1ΨY . (4.49)
It holds IEθ = θ∗ and
θ − θ∗ =(Ψ Ψ>
)−1Ψε = D−2ζ = D−1ξ (4.50)
with D2 = σ−2Ψ Ψ> , ζ = σ−2Ψε , and ξ = D−1ζ . Finally, L(θ) from (4.48) fulfills
2{L(θ)− L(θ∗)
}= ‖D
(θ − θ∗
)‖2 = ζ
>D−2ζ = ‖ξ‖2. (4.51)
Finally we present the same result in terms of the original log-likelihood L(υ) .
Theorem 4.9.8. Write D2 = ∇2L(υ) for the model (4.37) in the block form
D2 =
(D2 A
A> H2
)(4.52)
Let D2 and H2 be invertible. Then D2 and ζ in (4.50) can be represented as
D2 = D2 −AH−2A>,
ζ = ∇θL(υ∗)−AH−2∇ηL(υ∗).
Proof. In view of Theorem 4.9.7, it suffices to check the formulas for D2 and ζ . One
has for Ψ = Ψ(IIn −Πη) and A = σ−2ΨΦ>
D2 = σ−2Ψ Ψ> = σ−2Ψ(IIn −Πη
)Ψ>
= σ−2ΨΨ> − σ−2ΨΦ>(ΦΦ>
)−1ΦΨ> = D2 −AH−2A>
145
Similarly, in view of AH−2 = ΨΦ>(ΦΦ>
)−1, ∇θL(υ∗) = Ψε , and ∇ηL(υ∗) = Φε
ζ = Ψε = Ψε− ΨΦ>(ΦΦ>
)−1Φε = ∇θL(υ∗)−AH−2∇ηL(υ∗).
as required.
It is worth stressing again that the result of Theorems 4.9.6 through 4.9.8 is purely
geometrical. We only used the condition IEε = 0 in the model (4.37) and the quadratic
structure of the log-likelihood function L(υ) . The distribution of the vector ε does not
enter in the results and proofs. However, the representation (4.50) allows for straightfor-
ward analysis of the probabilistic properties of the estimate θ .
Theorem 4.9.9. Consider the model (4.37) and let Var(Y ) = Var(ε) = Σ0 . Then
Var(θ) = σ−4D−2ΨΣ0Ψ>D−2, Var(ξ) = σ−4D−1ΨΣ0Ψ
>D−1.
In particular, if Var(Y ) = σ2IIn , this implies that
Var(θ) = D−2, Var(ξ) = IIp.
Exercise 4.9.8. Check the result of Theorem 4.9.9. Specify this result to the orthogonal
case ΨΦ> = 0 .
4.9.5 Semiparametric efficiency bound
The main goal of this section is to show that the profile method in the semiparametric
estimation leads to R-efficient procedures. Remind that the target of estimation is θ∗ =
Pυ∗ for a given linear mapping P . The profile MLE θ is one natural candidate. The
next result claims its optimality.
Theorem 4.9.10 (Gauss-Markov). Let Y follow Y = Υ>υ∗+ε for homogeneous errors
ε . Then the estimate θ of θ∗ = Pυ∗ from (4.43) is unbiased and
Var(θ) = σ2P(ΥΥ>
)−1P>
yielding
IE‖θ − θ∗‖2 = σ2 tr{P(ΥΥ>
)−1P>}.
Moreover, this risk is minimal in the class of all unbiased linear estimates of θ∗ .
Proof. The statements about the properties of θ have been already proved. The lower
bound can be proved by the same arguments as in the case of the MLE estimation in
146
Section 4.4.3. We only outline the main steps. Let θ be any unbiased linear estimate
of θ∗ . The idea is to show that the difference θ − θ is orthogonal to θ in the sense
IE(θ − θ
)θ>
= 0 . This implies that the variance of θ is the sum of Var(θ) and
Var(θ − θ
)and therefore larger than Var(θ) .
Let θ = BY for some matrix B . Then IEθ = BIEY = BΥ>υ∗ . The no-bias
property yields the identity IEθ = θ∗ = Pυ∗ and thus
BΥ> − P = 0. (4.53)
Next, IEθ = IEθ = θ∗ and thus
IEθθ>
= θ∗θ∗> + Var(θ),
IEθθ>
= θ∗θ∗> + IE(θ − IEθ)(θ − IEθ)>.
Obviously θ−IEθ = Bε and θ−IEθ = Sε yielding Var(θ) = σ2SS> and IEBε(Sε)> =
σ2BS> . So
IE(θ − θ)θ>
= σ2(B − S)S>.
The identity (4.53) implies
(B − S)S> ={B − P
(ΥΥ>
)−1Υ}Υ>(ΥΥ>
)−1P>
= (BΥ> − P )(ΥΥ>
)−1P> = 0
and the result follows.
Now we specify the efficiency bound for the (θ,η) -setup (4.37). In this case P is
just the projector onto the θ -coordinates.
4.9.6 Inference for the profile likelihood approach
This section discusses the construction of confidence and concentration sets for the profile
ML estimation. The key fact behind this construction is the chi-squared result which
extends without any change from the parametric to semiparametric framework.
The definition θ from (4.47) suggests to define a CS for θ∗ as the level set of
L(θ) = supυ:Pυ=θ L(υ) :
E(z)def={θ : L(θ)− L(θ) ≤ z
}.
This definition can be rewritten as
E(z)def={θ : sup
υL(υ)− sup
υ:Pυ=θL(υ) ≤ z
}.
147
It is obvious that the unconstrained optimization of the log-likelihood L(υ) w.r.t. υ
is not smaller than the optimization under the constrain that Pυ = θ . The point
θ belongs to E(z) if the difference between these two values does not exceed z . As
usual, the main question is the choice of a value z which ensures the prescribed coverage
probability of θ∗ . This naturally leads to studying the deviation probability
IP(
supυ
L(υ)− supυ:Pυ=θ∗
L(υ) > z).
The study of this value is especially simple in the orthogonal case. The answer can be
expected: the expression and the value are exactly the same as in the case without any
nuisance parameter η , it simply has no impact. In particular, the chi-squared result still
holds.
In this section we follow the line and the notation of Section 4.9.4. In particular, we
use the block notation (4.52) for the matrix D2 = ∇2L(υ) .
Theorem 4.9.11. Consider the model (4.37). Let the matrix D2 be non-degenerated.
If ε ∼ N(0, σ2IIn) , then
2{L(θ)− L(θ∗)
}∼ χ2
p , (4.54)
that is, this 2{L(θ)− L(θ∗)
}is chi-squared with p degrees of freedom.
Proof. The result is based on representation (4.51) 2{L(θ)− L(θ∗)
}= ‖ξ‖2 from Theo-
rem 4.9.7. It remains to note that normality of ε implies normality of ξ and the moment
conditions IEξ = 0 , Var(ξ) = IIp imply (4.54).
This result means that the chi-squared result continues to hold in the general semi-
parametric framework as well. One possible explanation is as follows: it applies in the
orthogonal case, and the general situation can be reduced to the orthogonal case by a
change of coordinates which preserves the value of the maximum likelihood.
The statement (4.54) of Theorem 4.9.11 has an interesting geometric interpretation
which is often used in analysis of variance. Consider the expansion
L(θ)− L(θ∗) = L(θ)− L(θ∗,η∗)−{L(θ∗)− L(θ∗,η∗)
}.
The quantity L1def= L(θ) − L(υ∗) coincides with L(υ,υ∗) ; see (4.42). Thus, 2L1 chi-
squared with p∗ degrees of freedom by the chi-squared result. Moreover, 2σ2L(υ,υ∗) =
‖Πυε‖2 , where Πυ = Υ>(ΥΥ>
)−1Υ is the projector on the linear subspace spanned by
the joint collection of factors {ψj} and {φm} . Similarly, the quantity L2def= L(θ∗) −
L(θ∗,η∗) = supη L(θ∗,η)−L(θ∗,η∗) is the maximum likelihood in the partial η -model.
Therefore, 2L2 is also chi-squared distributed with p1 degrees of freedom, and 2σ2L2 =
148
‖Πηε‖2 , where Πη = Φ>(ΦΦ>
)−1Φ is the projector on the linear subspace spanned by
the η -factors {φm} . Now we use the decomposition Πυ = Πη + Πυ − Πη , in which
Πυ −Πη is also a projector on the subspace of dimension p . This explains the result
(4.54) that the difference of these two quantities is chi-squared with p = p∗− p1 degrees
of freedom. The above consideration leads to the following result.
Theorem 4.9.12. It holds for the model (4.37) with Πθ = Πυ −Πη
2L(θ)− 2L(θ∗) = σ−2(‖Πυε‖2 − ‖Πηε‖2
)= σ−2‖Πθε‖2 = σ−2ε>Πθε. (4.55)
Exercise 4.9.9. Check the formula (4.55). Show that it implies (4.54).
4.9.7 Plug-in method
Although the profile MLE can be represented in a closed form, its computing can be a
hard task if the dimensionality p1 of the nuisance parameter is high. Here we discuss an
approach which simplifies the computations but leads to a suboptimal solution.
We start with the approach called plug-in. It is based on the assumption that a pilot
estimate η of the nuisance parameter η is available. Then one obtains the estimate θ
of the target θ∗ from the residuals Y − Φ>η .
This means that the residual vector Y = Y − Φ>η is used as observations and the
estimate θ is defined as the best fit to such observations in the θ -model:
θ = argminθ‖Y − Ψ>θ‖2 =
(ΨΨ>
)−1Ψ Y . (4.56)
A very particular case of the plug-in method is partial estimation from Section 4.9.3 with
η ≡ η◦ .
The plug-in method can be naturally described in context of partial estimation. We
use the following representation of the plug-in method: θ = θ(η) .
Exercise 4.9.10. Check the identity θ = θ(η) for the plug-in method. Describe the
plug-in estimate for η ≡ 0 .
The behavior of the θ heavily depends upon the quality of the pilot η . A detailed
study is complicated and a closed form solution is only available for the special case of a
linear pilot estimate. Let η = AY . Then (4.56) implies
θ =(ΨΨ>
)−1Ψ(Y − Φ>AY ) = SY
with S =(ΨΨ>
)−1Ψ(IIn − Φ>A) . This is a linear estimate whose properties can be
studied in a usual way.
149
4.9.8 Two step procedure
The ideas of partial and plug-in estimation can be combined yielding the so called two
step procedures. One starts with the initial guess θ◦ for the target θ∗ . A very special
choice is θ◦ ≡ 0 . This leads to the partial η -model Y (θ◦) = Φ>η + ε for the residuals
Y (θ◦) = Y − Ψ>θ◦ . Next compute the partial MLE η(θ◦) =(ΦΦ>
)−1ΦY (θ◦) in this
model and use it as a pilot for the plug-in method: compute the residuals
Y (θ◦) = Y − Φ>η(θ◦) = Y −ΠηY (θ◦)
with Πη = Φ>(ΦΦ>
)−1Φ , and then estimate the target parameter θ by fitting Ψ>θ to
the residuals Y (θ◦) . This method results in the estimate
θ(θ◦) =(ΨΨ>
)−1Ψ Y (θ◦) (4.57)
A simple comparison with the formula (4.49) reveals that the pragmatic two step ap-
proach is sub-optimal: the resulting estimate does not fit the profile MLE θ unless we
have an orthogonal situation with ΨΠη = 0 . In particular, the estimate θ(θ◦) from
(4.57) is biased.
Exercise 4.9.11. Consider the orthogonal case with ΨΦ> = 0 . Show that the two step
estimate θ(θ◦) coincides with the partial MLE θ =(ΨΨ>
)−1ΨY .
Exercise 4.9.12. Compute the mean of θ(θ◦) . Show that there exists some θ∗ such
that IE{θ(θ◦)
}6= θ∗ unless the orthogonality condition ΨΦ> = 0 is fulfilled.
Exercise 4.9.13. Compute the variance of θ(θ◦) .
Hint: use that Var{Y (θ◦)
}= Var(Y ) = σ2IIn . Derive that Var
{Y (θ◦)
}= σ2(IIn−Πη) .
Exercise 4.9.14. Let Ψ be orthogonal, i.e. ΨΨ> = IIp . Show that Var{θ(θ◦)
}=
σ2(IIp − ΨΠηΨ>) .
4.9.9 Alternating method
The idea of partial and two step estimation can be applied in an iterative way. One
starts with some initial value for θ◦ and sequentially performs the two steps of partial
estimation. Set
η0 = η(θ◦) = argminη‖Y − Ψ>θ◦ − Φ>η‖2 =
(ΦΦ>
)−1Φ(Y − Ψ>θ◦).
150
With this estimate fixed, compute θ1 = θ(η1) and continue in this way. Generically,
with θk and ηk computed, one recomputes
θk+1 = θ(ηk) =(ΨΨ>
)−1Ψ(Y − Φ>ηk), (4.58)
ηk+1 = η(θk+1) =(ΦΦ>
)−1Φ(Y − Ψ>θk+1). (4.59)
The procedure is especially transparent if the partial design matrices Ψ and Φ are
orthonormal: ΨΨ> = IIp , ΦΦ> = Ip1 . Then
θk+1 = Ψ(Y − Φ>ηk),
ηk+1 = Φ(Y − Ψ>θk+1).
In words, having an estimate θ of the parameter θ∗ one computes the residuals Y =
Y −Ψ>θ and then build the estimate η of the nuisance η∗ by the empirical coefficients
ΦY . Then this estimate η is used in a similar way to recompute the estimate of θ∗ ,
and so on.
It is worth noting that every doubled step of alternation improves the current value
L(θk, ηk) . Indeed, θk+1 is defined by maximizing L(θ, ηk) , that is, L(θk+1, ηk) ≥L(θk, ηk) . Similarly, L(θk+1, ηk+1) ≥ L(θk+1, ηk) yielding
L(θk+1, ηk+1) ≥ L(θk, ηk). (4.60)
A very interesting question is whether the procedure (4.58), (4.59) converges and
whether it converges to the maximum likelihood solution. The answer is positive and in
the simplest orthogonal case the result is straightforward.
Exercise 4.9.15. Consider the orthogonal situation with ΨΦ> = 0 . Then the above
procedure stabilizes in one step with the solution from Theorem 4.9.4.
In the non-orthogonal case the situation is much more complicated. The idea is
to show that the alternating procedure can be represented a sequence of actions of a
shrinking linear operator to the data. The key observation behind the result is the
following recurrent formula for Ψ>θk and Φ>ηk :
Ψ>θk+1 = Πθ(Y − Φ>ηk) =(Πθ −ΠθΠη
)Y +ΠθΠηΨ
>θk, (4.61)
Φ>ηk+1 = Πη(Y − Ψ>θk+1) =(Πη −ΠηΠθ
)Y +ΠηΠθΦ
>ηk (4.62)
with Πθ = Ψ>(ΨΨ>
)−1Ψ and Πη = Φ>
(ΦΦ>
)−1Φ .
Exercise 4.9.16. Show (4.61) and (4.62).
151
This representation explains necessary and sufficient conditions for convergence of
the alternating procedure. Namely, the spectral norm ‖ΠηΠθ‖∞ (the largest singular
value) of the product operator ΠηΠθ should be strictly less than one, and similarly for
ΠθΠη .
Exercise 4.9.17. Show that ‖ΠθΠη‖∞ = ‖ΠηΠθ‖∞ .
Theorem 4.9.13. Suppose that ‖ΠηΠθ‖∞ = λ < 1 . Then the alternating procedure
converges geometrically, the limiting values θ and η are unique and fulfill
Ψ>θ = (IIn −ΠθΠη)−1(Πθ −ΠθΠη)Y ,
Φ>η = (IIn −ΠηΠθ)−1(Πη −ΠηΠθ)Y , (4.63)
and θ coincides with the profile MLE θ from (4.47).
Proof. The convergence will be discussed below. Now we comment on the identity θ = θ .
A direct comparison of the formulas for these two estimates can be a hard task. Instead we
use the monotonicity property (4.60). By definition, (θ, η) maximize globally L(θ,η) .
If we start the procedure with θ◦ = θ , we would improve the value L(θ, η) at every
step. By uniqueness, the procedure stabilizes with θk = θ and ηk = η for every k .
Exercise 4.9.18. 1. Show by induction arguments that
Φ>ηk+1 = Ak+1Y +(ΠηΠθ
)kΦ>η1,
where the linear operator Ak fulfills A1 = 0 and
Ak+1 = Πη −ΠηΠθ +ΠηΠθAk =k−1∑i=0
(ΠηΠθ)i(Πη −ΠηΠθ).
2. Show that Ak converges to A = (IIn−ΠηΠθ)−1(Πη−ΠηΠθ) and evaluate ‖A−Ak‖∞and ‖Φ>(ηk − η)‖ .
Hint: use that ‖Πη −ΠηΠθ‖∞ ≤ 1 and ‖(ΠηΠθ)i‖∞ ≤ ‖ΠηΠθ‖i∞ ≤ λi .
3. Prove (4.63) by inserting η in place of ηk and ηk+1 in (4.62).
152
Chapter 5
Bayes estimation
This chapter discusses the Bayes approach to parameter estimation. This approach
differs essentially from classical parametric modeling also called the frequentist approach.
Classical frequentist modeling assumes that the observed data Y follow a distribution
law IP from a given parametric family (IPθ,θ ∈ Θ ⊂ IRp) , that is,
IP = IPθ∗ ∈ (IPθ).
Suppose that the family (IPθ) is dominated by a measure µ0 and denote by p(y |θ)
the corresponding density:
p(y |θ) =dIPθdµ0
(y).
The likelihood is defined as the density at the observed point and the maximum likelihood
approach tries to recover the true parameter θ∗ by maximizing this likelihood over
θ ∈ Θ .
In the Bayes approach, the paradigm is changed and the true data distribution is not
assumed to be specified by a single parameter value θ∗ . Instead, the unknown parameter
is considered to be a random variable ϑ with a distribution π on the parameter space Θ
called a prior. The measure IPθ can be considered to be the data distribution conditioned
that the randomly selected parameter is exactly θ . The target of analysis is not a single
value θ∗ , this value is no longer defined. Instead one is interested in the posterior
distribution of the random parameter ϑ given the observed data:
what is the distribution of ϑ given the prior π and the data Y ?
In other words, one aims at inferring on the distribution of ϑ on the basis of the observed
data Y and our prior knowledge π . Below we distinguish between the random variable
ϑ and its particular values θ . However, one often uses the same symbol θ for denoting
the both objects.
153
154
5.1 Bayes formula
The Bayes modeling assumptions can be put together in the form
Y | θ ∼ p(· |θ),
ϑ ∼ π(·).
The first line has to be understood as the conditional distribution of Y given the par-
ticular value θ of the random parameter ϑ : Y | θ means Y | ϑ = θ . This section
formalizes and states the Bayes approach in a formal mathematical way. The answer is
given by the Bayes formula for the conditional distribution of ϑ given Y . First consider
the joint distribution IP of Y and ϑ . If B is a Borel set in the space Y of observations
and A is a measurable subset of Θ then
IP (B ×A) =
∫A
(∫BIPθ(dy)
)π(dθ)
The marginal or unconditional distribution of Y is given by averaging the joint proba-
bility w.r.t. the distribution of ϑ :
IP (B) =
∫Θ
∫BIPθ(dy)π(dθ) =
∫ΘIPθ(B)π(dθ).
The posterior (conditional) distribution of ϑ given the event Y ∈ B is defined as the
ratio of the joint and marginal probabilities:
IP (ϑ ∈ A | Y ∈ B) =IP (B ×A)
IP (B).
Equivalently one can write this formula in terms of the related densities. In what follows
we denote by the same letter π the prior measure π and its density w.r.t. some other
measure λ , e.g. the Lebesgue or uniform measure on Θ . Then the joint measure IP
has the density
p(y,θ) = p(y |θ)π(θ),
while the marginal density p(y) is the integral of the joint density w.r.t. the prior π :
p(y) =
∫Θp(y,θ)λ(dθ) =
∫Θp(y |θ)π(θ)λ(dθ).
Finally the posterior (conditional) density p(θ |y) of ϑ given y is defined as the ratio
of the joint density p(y,θ) and the marginal density p(y) :
p(θ |y) =p(y,θ)
p(y)=
p(y |θ)π(θ)∫Θ p(y |θ)π(θ)λ(dθ)
.
155
Our definitions are summarized in the next lines:
Y | θ ∼ p(y |θ),
ϑ ∼ π(θ),
Y ∼ p(y) =
∫Θp(y |θ)π(θ)λ(dθ),
ϑ | Y ∼ p(θ |Y ) =p(Y ,θ)
p(Y )=
p(Y |θ)π(θ)∫Θ p(Y |θ)π(θ)λ(dθ)
. (5.1)
Note that given the prior π and the observations Y , the posterior density p(θ |Y ) is
uniquely defined and can be viewed as the solution or target of analysis within the Bayes
approach. The expression (5.1) for the posterior density is called the Bayes formula.
The value p(y) of the marginal density of Y at y does not depend on the parameter
θ . Given the data Y , it is just a numeric normalizing factor. Often one skips this factor
writing
ϑ | Y ∝ p(Y |θ)π(θ).
Below we consider a couple of examples.
Example 5.1.1. Let Y = (Y1, . . . , Yn)> be a sequence of zeros and ones considered to
be a realization of a Bernoulli experiment for n = 10 . Let also the underlying parameter
θ be random and let it take the values 1/2 or 1 each with probability 1/2 , that is,
π(1/2) = π(1) = 1/2.
Then the probability of observing y = “10 ones” is
IP (y) =1
2IP (y | ϑ = 1/2) +
1
2IP (y | ϑ = 1).
The first probability is quite small, it is 2−10 , while the second one is just one. Therefore,
IP (y) = (2−10 + 1)/2 . If we observed y = (1, . . . , 1)> , then the posterior probability of
ϑ = 1 is
IP (ϑ = 1 | y) =IP (y | ϑ = 1)IP (ϑ = 1)
IP (y)=
1
2−10 + 1
that is, it is quite close to one.
Exercise 5.1.1. Consider the Bernoulli experiment Y = (Y1, . . . , Yn)> with n = 10
and let
π(1/2) = π(0.9) = 1/2.
156
Compute the posterior distribution of ϑ if we observe y = (y1, . . . , yn)> with
• y = (1, . . . , 1)>
• the number of successes S = y1 + . . .+ yn is 5.
Show that the posterior density p(θ |y) only depends on the numbers of successes S .
5.2 Conjugated priors
Let (IPθ) be a dominated parametric family with the density function p(y |θ) . For a
prior π with the density π(θ) , the posterior density is proportional to p(y |θ)π(θ) . Now
consider the case when the prior π belongs to some other parametric family indexed by a
parameter α , that is, π(θ) = π(θ,α) . An very desirable feature of the Bayes approach
is that the posterior density also belongs to this family. Then computing the posterior
is equivalent to fixing the related parameter α = α(Y ) . Such priors are usually called
conjugated.
5.2.1 Examples
To illustrate this notion, we present some examples.
Example 5.2.1. [Gaussian Shift] Let Y ∼ N(θ, σ2) with σ known. Consider ϑ ∼N(τ, g) , α = (τ, g2) . Then
p(y | θ)π(θ,α) ∝ exp{−(y − θ)2/(2σ2)− (θ − τ)2/(2g2)
}The expression in the exponent is a quadratic form of θ and the Taylor expansion w.r.t.
θ at θ = τ implies
π(θ |Y ) ∝ exp{−(y − τ)2/(2σ2) + (y − τ)(θ − τ)/σ2 − 0.5(σ−2 + g−2)(θ − τ)2
}This representation indicates that the conditional distribution of θ given y is normal.
The parameters of the posterior will be computed in the next section.
Example 5.2.2. [Bernoulli] Let Y be a Bernoulli r.v. with IP (Y = 1) = θ . Then
p(y | θ) = θy(1 − θ)1−Y . Consider the family of priors with the Beta-distribution:
π(θ,α) = θa(1− θ)b for α = (a, b) .
Example 5.2.3. [Exponential] Let
Example 5.2.4. [Poisson] Let
Example 5.2.5. [Volatility] Let
157
5.2.2 Exponential families and conjugated priors
All the previous examples can be systematically treated as special case
5.3 Linear Gaussian model and Gaussian priors
An interesting and important class of prior distributions is given by Gaussian priors.
The very nice and desirable feature of this class is that the posterior distribution for the
Gaussian model and Gaussian prior is also Gaussian.
5.3.1 Univariate case
We start with the case of a univariate parameter and one observation Y ∼ N(θ, σ2) ,
where the variance σ2 is known and only the mean θ is unknown. The Bayes approach
suggests to treat θ as a random variable. Suppose that the prior π is also normal with
mean τ and variance r2 .
Theorem 5.3.1. Let Y ∼ N(θ, σ2) , and let the prior π be the normal distribution
N(τ, r2) :
Y | θ ∼ N(θ, σ2),
ϑ ∼ N(τ, r2).
Then the joint, marginal, and posterior distributions are normal as well. Moreover, it
holds
Y ∼ N(τ, σ2 + r2),
ϑ | Y ∼ N
(τσ2 + Y r2
σ2 + r2,σ2r2
σ2 + r2
).
Proof. It holds Y = ϑ + ε with ϑ ∼ N(τ, r2) and ε ∼ N(0, σ2) independent of ϑ .
Therefore, Y is normal with mean IEY = IEϑ+ IEε = τ and the variance is
Var(Y ) = IE(Y − τ)2 = r2 + σ2.
This implies the formula for the marginal density p(Y ) . Next, for ρ = σ2/(r2 + σ2) ,
IE[(ϑ− τ)(Y − τ)
]= IE(ϑ− τ)2 = r2 = (1− ρ) Var(Y ).
Thus, the random variables Y − τ and ζ with
ζ = ϑ− τ − (1− ρ)(Y − τ) = ρ(ϑ− τ)− (1− ρ)ε
158
are Gaussian and uncorrelated and therefore, independent. The conditional distribution
of ζ given Y coincides with the unconditional distribution and hence, it is normal with
mean zero and variance
Var(ζ) = ρ2 Var(ϑ) + (1− ρ)2 Var(ε) = ρ2r2 + (1− ρ)2σ2 =σ2r2
σ2 + r2.
This yields the result because ϑ = ζ + ρτ + (1− ρ)Y .
Exercise 5.3.1. Check the result of Theorem 5.3.1 by direct calculation using Bayes
formula (5.1).
So the posterior mean of ϑ is a weighted average of the prior mean τ and the
sample estimate Y ; the sample estimate is pulled back (or shrunk) toward the prior
mean. Moreover, the weight ρ on the prior mean is close to one if σ2 is large relative
to r2 (i.e. our prior knowledge is more precise than the data information), producing
substantial shrinkage. If σ2 is small (i.e., our prior knowledge is imprecise relative to
the data information), ρ is close to zero and the direct estimate Y is moved very little
towards the prior mean.
Now consider the i.i.d. model from N(θ, σ2) where the variance σ2 is known.
Theorem 5.3.2. Let Y = (Y1, . . . , Yn)> be i.i.d. and for each Yi
Yi | θ ∼ N(θ, σ2), (5.2)
ϑ ∼ N(τ, r2). (5.3)
Then for Y = (Y1 + . . .+ Yn)/n
ϑ | Y ∼ N
(τσ2/n+ Y r2
r2 + σ2/n,r2σ2/n
r2 + σ2/n
).
Exercise 5.3.2. Prove Theorem 5.3.2 using the technique of the proof of Theorem 5.3.1.
Hint: consider Yi = ϑ+ εi , Y = S/n , and define ζ = ϑ− τ − (1− ρ)(Y − τ) . Check
that ζ and Y are uncorrelated and hence, independent.
The result of Theorem 5.3.2 can formally be derived from Theorem 5.3.1 by replacing
n i.i.d. observations Y1, . . . , Yn with one single observation Y with conditional mean
θ and variance σ2/n .
5.3.2 Linear Gaussian model and Gaussian prior
Now we consider the general case when both Y and ϑ are vectors. Namely we consider
the linear model Y = Ψ>ϑ+ ε with Gaussian errors ε in which the random parameter
159
vector ϑ is multivariate normal as well:
ϑ ∼ N(τ ,R), Y | θ ∼ N(Ψ>θ, Σ). (5.4)
Here Ψ is a given p×n design matrix, and Σ is a given error covariance matrix. Below
we assume that both Σ and R are non-degenerate. The model (5.4) can be represented
in the form
ϑ = τ + ξ, ξ ∼ N(0,R), (5.5)
Y = Ψ>τ + Ψ>ξ + ε, ε ∼ N(0, Σ), ε ⊥ ξ, (5.6)
where ξ ⊥ ε means independence of the error vectors ξ and ε . This representation
makes clear that the vectors ϑ,Y are jointly normal. Now we state the result about the
conditional distribution of ϑ given Y .
Theorem 5.3.3. Assume (5.4). Then the joint distribution of ϑ,Y is normal with
IE =
(ϑ
Y
)=
(τ
Ψ>τ
)Var
(ϑ
Y
)=
( R Ψ>RRΨ Ψ>RΨ +Σ
).
Moreover, the posterior ϑ | Y is also normal. With B = R−1 + ΨΣ−1Ψ> ,
IE(ϑ | Y
)= τ +RΨ
(Ψ>RΨ +Σ
)−1(Y − Ψ>τ )
= B−1R−1τ +B−1ΨΣ−1Y , (5.7)
Var(ϑ | Y
)= B−1. (5.8)
Proof. The following technical lemma explains a very important property of the normal
law: normal conditioned on a normal is again a normal.
Lemma 5.3.4. Let ξ and η be jointly normal. Denote U = Var(ξ) , W = Var(η) ,
C = Cov(ξ,η) = IE(ξ − IEξ)(η − IEη)> . Then the conditional distribution of ξ given
η is also normal with
IE[ξ | η
]= IEξ + CW−1(η − IEη),
Var[ξ | η
]= U − CW−1C>.
Proof. First consider the case when ξ and η are zero-mean. Then the vector
ζdef= ξ − CW−1η
160
is also normal zero mean and fulfills
IE(ζη>) = IE[(ξ − CW−1η)η>
]= IE(ξη>)− CW−1IE(ηη>) = 0,
Var(ζ) = IE[(ξ − CW−1η)(ξ − CW−1η)>
]= U − CW−1C>,
The vectors ζ and η are jointly normal and uncorrelated, thus, independent. This
means that the conditional distribution of ζ given η coincides with the unconditional
one. It remains to note that the ξ = ζ + CW−1η , and conditioned on η , the vector
ξ is just a shift of the normal vector ζ by a fixed vector CW−1η . Therefore, the
conditional distribution of ξ given η is normal with mean CW−1η and the variance
Var(ζ) = U − CW−1C> .
Exercise 5.3.3. Extend the proof of Lemma 5.3.4 to the case when the vectors ξ and
η are not zero mean.
It remains to deduce the desired result about posterior distribution from this lemma.
The formulas for the first two moments of ϑ and Y follow directly from (5.5) and (5.6).
Now we apply Lemma 5.3.4 with U = R , C = RΨ , W = Ψ>RΨ + Σ . It follows that
the vector ϑ conditioned on Y is normal with
IE(ϑ | Y
)= τ +RΨW−1(Y − Ψ>τ ) (5.9)
Var(ϑ | Y
)= R−RΨW−1Ψ>R.
Straightforward calculus imply{R−RΨW−1Ψ>R
}B = Ip with B = R−1 + ΨΣ−1Ψ>
yielding by Σ = W − Ψ>RΨ
RΨW−1 = RΨW−1(W − Ψ>RΨ
)Σ−1
= RΨΣ−1 −RΨW−1Ψ>RΨΣ−1 = B−1ΨΣ−1.
This implies (5.7) by (5.9).
Exercise 5.3.4. Check the details of the proof of Theorem 5.3.3.
Exercise 5.3.5. Derive the result of Theorem 5.3.3 by direct computation of the density
of ϑ given Y .
Hint: use that ϑ and Y are jointly normal vectors. Consider their joint density
p(θ,Y ) for Y fixed and obtain the conditional density by analyzing its the linear and
quadratic terms w.r.t. θ .
Exercise 5.3.6. Show that Var(ϑ | Y ) < Var(ϑ) = R .
Hint: use that Var(ϑ | Y
)= B−1 and B
def= R−1 + ΨΣ−1Ψ> > R−1 .
161
The last exercise delivers an important message: the variance of the posterior is
smaller than the variance of the prior. This is intuitively clear because the posterior
utilizes the both sources of information: those contained in the prior and those we get
from the data Y . However, even in the simple Gaussian case, the proof is quite com-
plicated. Another interpretation of this fact will be given later: the Bayes approach
effectively performs a kind of regularization and thus, leads to a reduce of the variance;
cf. Section 4.7.
Another conclusion from the formulas (5.7), (5.8) is that the moments of the posterior
distribution approach the moments of the MLE θ =(ΨΣ−1Ψ>
)−1ΨΣ−1Y as R grows.
5.3.3 Homogeneous errors, orthogonal design
Consider a linear model Yi = Ψ>i ϑ+ εi for i = 1, . . . , n , where Ψi are given vectors in
IRp and εi are i.i.d. normal N(0, σ2) . This model is a special case of the model (5.4)
with Ψ = (Ψ1, . . . , Ψn) and uncorrelated homogeneous errors ε yielding Σ = σ2In .
Then Σ−1 = σ−2In , B = R−1 + σ−2ΨΨ>
IE(ϑ | Y
)= B−1R−1τ + σ−2B−1ΨY , (5.10)
Var(ϑ | Y
)= B−1,
where ΨΨ> =∑
i ΨiΨ>i . If the prior variance is also homogeneous, that is, R = r2Ip ,
then the formulas can be further simplified. In particular,
Var(ϑ | Y
)=(r−2Ip + σ−2ΨΨ>
)−1.
The most transparent case corresponds to the orthogonal design with ΨΨ> = η2Ip for
some η2 > 0 . Then
IE(ϑ | Y
)=
σ2/r2
η2 + σ2/r2τ +
1
η2 + σ2/r2ΨY , (5.11)
Var(ϑ | Y
)=
σ2
η2 + σ2/r2Ip. (5.12)
Exercise 5.3.7. Derive (5.11) and (5.12) from Theorem 5.3.3 with Σ = σ2In , R = r2Ip ,
and ΨΨ> = Ip .
Exercise 5.3.8. Show that the posterior mean is the convex combination of the MLE
θ = η−2ΨY and the prior mean τ :
IE(ϑ | Y
)= ρτ + (1− ρ)θ,
with ρ = (σ2/r2)/(η2 + σ2/r2) . Moreover, ρ → 0 as η → ∞ , that is, the posterior
mean approaches the MLE θ .
162
5.4 Non-informative priors
The Bayes approach requires to fix a prior distribution on the values of the parameter ϑ .
What happens if no such information is available? Is the Bayes approach still applicable?
An immediate answer is “no”, however it is a bit hasty. Actually one can still apply the
Bayes approach with the priors which do not give any preference to one point against the
others. Such priors are called non-informative. Consider first the case when the set Θ is
finite: Θ = {θ1, . . . ,θM} . Then the non-informative prior is just the uniform measure
on Θ giving to every point θm the equal probability 1/M . Then the joint probability
of Y and ϑ is the average of the measures IPθm and the same holds for the marginal
distribution of the data:
p(y) =1
M
M∑m=1
p(y |θm).
The posterior distribution is already “informative” and it differs from the uniform prior:
p(θk |y) =p(y |θk)π(θk)
p(y)=
p(y |θk)∑Mm=1 p(y |θm)
, k = 1, . . . ,M.
Exercise 5.4.1. Check that the posterior measure is non-informative iff all the measures
IPθm coincide.
A similar situation arises if the set Θ is a non-discrete bounded subset in IRp . A
typical example is given by the case of a univariate parameter restricted to a finite interval
[a, b] . Define π(θ) = 1/π(Θ) , where
π(Θ)def=
∫Θdθ.
Then
p(y) =1
π(Θ)
∫Θp(y |θ)dθ.
p(θ |y) =p(y |θ)π(θ)
p(y)=
p(y |θ)∫Θ p(y |θ)dθ
. (5.13)
In some cases the non-informative uniform prior can be used even for unbounded param-
eter sets. Indeed, what we really need is that the integrals in the denominator of the last
formula are finite: ∫Θp(y |θ)dθ <∞ ∀y.
Then we can apply (5.13) even if Θ is unbounded.
163
Exercise 5.4.2. Consider the Gaussian Shift model (5.2) and (5.3).
(i) Check that for n = 1 , the value∫∞−∞ p(y | θ)dθ is finite for every y and the
posterior distribution of ϑ coincides with the distribution of Y .
(ii) Compute the posterior for n > 1 .
Exercise 5.4.3. Consider the Gaussian regression model Y = Ψ>ϑ+ ε , ε ∼ N(0, Σ) ,
and the non-informative prior π which is the Lebesgue measure on the space IRp . Show
that the posterior for ϑ is normal with mean θ = (ΨΣ−1Ψ>)−1ΨΣ−1Y and variance
(ΨΣ−1Ψ>)−1 . Compare with the result of Theorem 5.3.3.
Note that the result of this exercise can be formally derived from Theorem 5.3.3 by
replacing R−1 with 0.
Another way of tackling the case of an unbounded parameter set is to consider a
sequence of priors that approaches the uniform distribution on the whole parameter set.
In the case of linear Gaussian models and normal priors, a natural way is to let the
prior variance tend to infinity. Consider first the univariate case; see Section 5.3.1. A
non-informative prior can be approximated by the normal distribution with mean zero
and variance r2 tending to infinity. Then
ϑ | Y ∼ N
(Y r2
σ2 + r2,σ2r2
σ2 + r2
)w−→ N(Y, σ2) r →∞.
It is interesting to note that the case of an i.i.d. sample in fact reduces the situation
to the case of a non-informative prior. Indeed, the result of Theorem 5.3.3 implies with
r2n = nr2
ϑ | Y ∼ N
(Y r2n
σ2 + r2n,σ2r2nσ2 + r2n
).
One says that the prior information “washes out” from the posterior distribution as the
sample size n tends to infinity.
5.5 Bayes estimate and posterior mean
Given a loss function ℘(θ,θ′) on Θ × Θ , the Bayes risk of an estimate θ = θ(Y ) is
defined as
Rπ(θ)def= IE℘(θ,ϑ) =
∫Θ
(∫Y
℘(θ(y),θ) p(y |θ)µ0(dy)
)π(θ)λ(dθ).
Note that ϑ in this formula is treated as a random variable that follows the prior
distribution π . One can represent this formula symbolically in the form
Rπ(θ) = IE[IE(℘(θ,ϑ) | θ
)]= IE R(θ,ϑ).
164
Here the external integration averages the pointwise risk R(θ,ϑ) over all possible values
of ϑ due to the prior distribution.
The Bayes formula p(y |θ)π(θ) = p(θ |y)p(y) and change of order of integration can
be used to represent the Bayes risk via the posterior density:
Rπ(θ) =
∫Y
(∫Θ℘(θ(y),ϑ) p(θ |y)λ(dθ)
)p(y)µ0(dy)
= IE[IE{℘(θ,ϑ) | Y
}].
The estimate θπ is called Bayes or π -Bayes if it minimizes the corresponding risk:
θπ = argminθ
Rπ(θ),
where the infimum is taken over the class of all feasible estimates. The most widespread
choice of the loss function is the quadratic one:
℘(θ,θ′)def= ‖θ − θ′‖2.
The great advantage of this choice is that the Bayes solution can be given explicitly: it
is the posterior mean:
θπdef= IE(ϑ | Y ) =
∫Θθ p(θ |Y )λ(dθ).
Note that due to Bayes’ formula, this value can be rewritten
θπ =1
p(Y )
∫Θθ p(Y |θ)π(θ)λ(dθ)
p(Y ) =
∫Θp(Y |θ)π(θ)λ(dθ).
Theorem 5.5.1. It holds for any estimate θ
Rπ(θ) ≥ Rπ(θπ).
Proof. The main feature of the posterior mean is that it provides a kind of projection of
the data. This property can be formalized as follows:
IE(θπ − ϑ | Y
)=
∫Θ
(θπ − θ
)p(θ |Y )λ(dθ) = 0
yielding for any estimate θ = θ(Y )
IE(‖θ − ϑ‖2 | Y
)= IE
(‖θπ − ϑ‖2 | Y
)+ IE
(‖θπ − θ‖2 | Y
)+ 2(θ − θπ)IE
(θπ − ϑ | Y
)= IE
(‖θπ − ϑ‖2 | Y
)+ IE
(‖θπ − θ‖2 | Y
)≥ IE
(‖θπ − ϑ‖2 | Y
).
165
Here we have used that both θ and θπ are functions of Y and can be considered as
constant when taking the conditional expectation w.r.t. Y . Now
Rπ(θ) = IE‖θ − ϑ‖2 = IE[IE(‖θ − ϑ‖2 | Y
)]≥ IE
[IE(‖θπ − ϑ‖2 | Y
)]= Rπ(θπ)
and the result follows.
Exercise 5.5.1. Consider the univariate case with the loss function |θ−θ′| . Check that
the posterior median minimizes the Bayes risk.
5.5.1 Posterior mean and ridge regression
Here we again consider the case of a linear Gaussian model
Y = Ψ>ϑ+ ε, ε ∼ N(0, σ2In).
(To simplify the presentation, we focus here on the case of homogeneous errors with
Σ = σ2In .) Recall that the maximum likelihood estimate θ for this model reads as
θ =(ΨΨ>
)−1ΨY .
Regularized MLE
θR =(ΨΨ> +R2
)−1ΨY ,
where R is a regularizing matrix; cf. Section 4.7.1. It turns out that a similar estimate
appears in quite a natural way within the Bayes approach. Consider the normal prior
distribution ϑ ∼ N(0, R2) . The posterior will be normal as well with the posterior
mean::
θπ = σ−2B−1Y =(ΨΨ> + σ2R−2
)−1ΨY ;
see (5.10). It remains to check that θπ = θR for the normal prior π = N(0, σ−2R2) .
One can say that the Bayes approach leads to a regularization of the least squares
method. The degree of regularization is inversely proportional to the variance of the
prior. The larger the variance, the closer the prior is to the non-informative one and the
posterior mean θπ to the MLE θ .
166
Chapter 6
Testing a statistical hypothesis
Let Y be the observed sample. The hypothesis testing problem assumes that there is
some external information (hypothesis) about the distribution of this sample and the
target is to check this hypothesis on the basis of the available data.
6.1 Testing problem
This section specifies the main notions of the theory of hypothesis testing. We start
with a simple hypothesis. Afterwards a composite hypothesis will be discussed. We also
introduce the notions of the testing error, level, power, etc.
6.1.1 Simple hypothesis
The classical testing problem is to check by the available data a specific hypothesis that
the data indeed follow an external precisely known distribution. We illustrate this notion
by several examples.
Example 6.1.1. [Simple game] Let Y = (Y1, . . . , Yn)> be a Bernoulli sequence of zeros
and ones. This sequence can be viewed as the sequence of successes, or results of throwing
a coin, etc. The hypothesis about this sequence is that wins (associated with one) and
losses (associated with zero) are equally frequent in the long run. This hypothesis can
be formalized as follows: IP = IPθ∗ with θ∗ = 1/2 , where IPθ describes the Bernoulli
experiment with parameter θ .
Example 6.1.2. [No effect treatment] Let (Yi, Ψi) be experimental results, i = 1, . . . , n .
The linear regression model assumes certain dependence of the form Yi = Ψ>i θ+ εi with
errors εi having zero mean. The “no effect” hypothesis means that there is no systematic
dependence of Yi on the factors Ψi , i.e. θ = θ∗ = 0 and the observations Yi are just
noise.
167
168
Example 6.1.3. [Quality control] Let Yi be the results of a production process which
can be represented in the form Yi = θ∗ + εi , where θ∗ is a nominal value and εi is
a measurement error. The hypothesis is that the observed process indeed follows this
model.
The general problem of testing a simple hypothesis is stated as follows: to check on
the basis of the available observations Y that their distribution is described by a given
measure IP . The hypothesis is often called a null hypothesis or just null.
6.1.2 Composite hypothesis
More generally, one can speak about the problem of testing a composite hypothesis. Let
(IPθ,θ ∈ Θ ⊂ IRp) be a given parametric family, and let Θ0 ⊆ Θ be a subset in Θ . The
hypothesis is that the data distribution IP belongs to the set (IPθ,θ ∈ Θ0) .
We give some typical examples where such formulation is natural.
Example 6.1.4. [Testing a subvector] Let the vector θ ∈ Θ can be decomposed into
two parts: θ = (γ,η) . The subvector γ is the target of analysis while the subvector η
matters for the distribution of the data but is not the target of analysis. It is often called
the nuisance parameter. The hypothesis we want to test is γ = γ∗ for some fixed value
γ∗ . A typical situation in factor analysis where such problems arise is to check on “no
effect” for one particular factor in the presence of many different factors.
Example 6.1.5. [Interval testing] Let Θ be the real line and Θ0 be an interval. The
hypothesis is that IP = IPθ∗ for θ∗ ∈ Θ0 . Such problems are typical for quality control or
warning (monitoring) systems when the controlled parameter should be in the prescribed
range.
Example 6.1.6. [Testing a hypothesis about error distribution] Consider the regression
model Yi = Ψ>i θ + εi . The typical assumption about the errors εi is that they are
zero-mean normal. One can test this assumption having in mind the cases with discrete,
or heavy-tailed, or heteroscedastic errors.
6.1.3 A test
A test is a statistical decision on the basis of the available data whether the hypothesis is
accepted or rejected. So the decision space consists of only two points, which we denote
by zero and one. A decision φ is a mapping of the data Y to this space and is called a
test :
φ : Y→ {0, 1}.
169
The event φ = 1 means that the hypothesis is rejected and the opposite event means
the acceptance of the null. Usually the testing results are qualified in the following way:
rejection of the hypothesis means that the data are not consistent with the null, or,
equivalently, the data contain some evidence against the null hypothesis. Acceptance
simply means that the data do not contradict the null.
The region of acceptance is a subset of the observation space Y on which φ = 0 .
One also says that this region is the set of values for which we fail to reject the null
hypothesis. The region of rejection or critical region is on the other hand the subset of
Y on which φ = 1 .
6.1.4 Errors of the first kind, test level
In the hypothesis testing framework one distinguishes between error of the first and
second kind. The error of the first kind means that the hypothesis is wrongly rejected
when it was correct. We formalize this notion first for the case of a simple hypothesis
and then extend it to the general case.
Let H0 : Y ∼ IPθ∗ be a null hypothesis. The error of the first kind is the situation
when the data indeed follow the null, but the decision of the test is to reject this hypoth-
esis: φ = 1 . Clearly the probability of such an error is IPθ∗(φ = 1) . One says that φ is
a test of level α for some α ∈ (0, 1) if
IPθ∗(φ = 1) = α.
The value α is called level (size) of the test or significance level.
If the hypothesis is composite, then the level of the test is the maximum rejection
probability over the null subset. A test φ is of level α if
supθ∈Θ0
IPθ(φ = 1) ≤ α.
6.1.5 A randomized test
In some situations it is difficult to decide about acceptance or rejection of the hypothesis.
A randomized test can be viewed as a weighted decision: with a certain probability the
hypothesis is rejected, otherwise accepted. The decision space for a randomized test φ
is an interval [0, 1] , that is, φ(Y ) is a number between zero and one. The hypothesis
H0 is rejected with probability φ(Y ) on the basis of the observed data Y . If φ(Y )
only admits the binary values 0 and 1 for every Y , then we come back to the usual
non-randomized test. The probability of the first-kind error is naturally given by the
170
value IEφ(Y ) . For a simple hypothesis H0 : IP = IPθ∗ , a test φ is of level α if
IEφ(Y ) = α.
In the case of a composite hypothesis H0 : IP ∈ (IPθ,θ ∈ Θ0) , the level condition reads
as
supθ∈Θ0
IEφ(Y ) ≤ α.
In what follows we mostly consider non-randomized tests and only comment on whether
a randomization can be useful. Note that any randomized test can be reduced to a
non-randomized test by extending the probability space.
Exercise 6.1.1. Construct for any randomized test φ its non-randomized version using
a random data generator.
6.1.6 An alternative, error of the second kind, power of the test
The set-up of hypothesis testing focuses on the null hypothesis. However, for a complete
analysis, one has to specify the data distribution when the hypothesis is wrong. Within
the parametric framework, one usually makes the assumption that the unknown data
distribution belongs to some parametric family (IPθ,θ ∈ Θ ⊆ IRp) . This assumption has
to be fulfilled independently of whether the hypothesis is true or false. In other words,
we assume that IP ∈ (IPθ,θ ∈ Θ) and there is a subset Θ0 ⊂ Θ corresponding to the
null hypothesis. The measure IP = IPθ for θ 6∈ Θ0 is called an alternative.
Now we can consider the performance of a test φ when the hypothesis H0 is wrong.
The decision to accept the hypothesis when it is wrong is called the error of the second
kind. The probability of such error is equal to IP (φ = 0) . This value certainly depends
on the alternative IP = IPθ for θ 6∈ Θ0 . The value β(θ) = 1− IPθ(φ = 0) is often called
the test power at θ 6∈ Θ0 . The function β(θ) of θ ∈ Θ \Θ0 given by
β(θ)def= 1− IPθ(φ = 0)
is called a power function. Ideally one would desire to build a test which simultaneously
and separately minimizes the level and maximizes the power. These two wishes are
somehow contradictory. A decrease of the level usually results in a decrease of the power
and vice versa. Usually one imposes the level α constraint on the test and tries to
optimize its power.
171
Definition 6.1.1. A test φ∗ is called uniformly most power (UMP) of level α if it is of
level α and for any other test of level α , it holds
1− IPθ(φ∗ = 0) ≥ 1− IPθ(φ = 0), θ 6∈ Θ0.
Unfortunately, such UMP tests exist only in very few special models; otherwise,
optimization of the power given the level is a complicated task.
In the case of a univariate parameter θ ∈ Θ ⊂ IR1 and a simple hypothesis θ = θ∗ ,
one often considers one-sided alternatives
H1 : θ ≥ θ∗ or H1 : θ ≤ θ∗
or a two-sided alternative
H1 : θ 6= θ∗
6.2 Neyman-Pearson test for two simple hypotheses
This section discusses one very special case of hypothesis testing when both the hypothesis
and alternative are simple one-point sets. This special situation by itself can be viewed
as a toy problem, but it is very important from the methodological point of view. In
particular, it introduces and justifies the so-called likelihood ratio test and demonstrates
its efficiency.
For simplicity we write IP0 for the null hypothesis and IP1 for the alternative measure.
A test φ is a measurable function of the observations with values in the two-point set
{0, 1} . The event φ = 0 is treated as acceptance of the null hypothesis H0 while φ = 1
means rejection of the null hypothesis against H1 .
For ease of presentation we assume that the measure IP1 is absolutely continuous
w.r.t. the measure IP0 and denote by Z(Y ) the corresponding derivative at the obser-
vation point:
Z(Y )def=
dIP1
dIP0(Y ).
Similarly L(Y ) means the log-density:
L(Y )def= logZ(Y ) = log
dIP1
dIP0(Y ).
The solution of the test problem in the case of two simple hypotheses is known as the
Neyman-Pearson test: reject the hypothesis H0 if the log-likelihood ratio L(Y ) exceeds
a specific critical value t :
φ∗tdef= 1
(Z(Y ) > t
).
172
The Neyman-Pearson test is known as the one minimizing the weighted sum of the errors
of the first and second kind. For a non-randomized test this sum is equal to
℘0IP0(φ = 1) + ℘1IP1(φ = 0),
while the weighted error of a randomized test φ is
℘0IE0φ+ ℘1IE1(1− φ). (6.1)
Theorem 6.2.1. For every two positive values ℘0 and ℘1 , the test φ∗t with t = ℘0/℘1
minimizes (6.1) over all possible (randomized) tests φ :
φ∗tdef= 1(Z(Y ) ≥ t) = argmin
φ
{℘0IE0φ+ ℘1IE1(1− φ)
}.
Proof. We use the formula for a change of measure:
IE1ξ = IE0
[ξZ(Y )
]for any r.v. ξ . It holds for any test φ with t = ℘0/℘1
℘0IE0φ+ ℘1IE1(1− φ) = IE0
[℘0φ− ℘1Z(Y )φ
]+ ℘1
= −℘1IE0[Z(Y )− t]φ+ ℘1
≥ −℘1IE0[Z(Y )− t]+ + ℘1
with the equality for φ = 1(Z(Y ) ≥ t) .
The Neyman-Pearson test belongs to a large class of tests of the form
φ = 1(T ≥ t),
where T is a function of the observations Y . This random variable is usually called a
test statistic while the threshold t is called a critical value. The hypothesis is rejected
if the test statistic exceeds the critical value. For the Neyman-Pearson test, the test
statistic is the likelihood ratio Z(Y ) and the critical value is selected as its quantile.
The next result shows that the Neyman-Pearson test φ∗t with a proper critical value
t can be constructed to maximize the power IE1φ under the level constraint IE0φ ≤ α .
Theorem 6.2.2. Given α ∈ (0, 1) , let tα be such that
IP0(Z(Y ) ≥ tα) = α. (6.2)
Then it holds
φ∗tαdef= 1
(Z(Y ) ≥ tα
)= argmax
φ:IE0φ≤α
{IE1φ
}.
173
Proof. Let φ satisfy IE0φ ≤ α . Then
IE1φ− αtα ≤ IE0
{Z(Y )φ
}− tαIE0φ
= IE0
{(Z(Y )− tα)φ
}≤ IE0[Z(Y )− tα]+
with the equality for φ = 1(Z(Y ) ≥ tα) .
The previous result assumes that for a given α there is a critical value tα such that
(6.2) is fulfilled. However, this is not always the case.
Exercise 6.2.1. Let Z(Y ) = dIP1(Y )/dIP0 .
• Show that the relation (6.2) can always be fulfilled with a proper choice of tα if the
pdf of Z(Y ) under IP0 is a continuous function.
• Suppose that the pdf of Z(Y ) is discontinuous and tα fulfills
IP0(Z(Y ) ≥ tα) > α, IP0(Z(Y ) < tα) < 1− α.
Construct a randomized test that fulfills IE0φ = α and maximizes the test power IE1(1−φ) among all such tests.
The Neyman-Pearson test can be viewed as a special case of the general likelihood
ratio test. Indeed, it decides in favor of the null or the alternative by looking at the
likelihood ratio. Informally one can say: we select the null if it is more likely at the point
of observation Y .
An interesting question that arises in relation with the Neyman-Pearson result is how
to interpret it when the true distribution IP does not coincide either with IP0 or with
IP1 and probably it is not even within the considered parametric family (IPθ) . Wald
called this situation the third-kind error. It is worth mentioning that the test φ∗t remains
meaningful: it decides which of two given measures IP0 and IP1 better describes the given
data. However, it is not any more a likelihood ratio test. In analogy with estimation
theory, one can call it a quasi likelihood ratio test.
6.2.1 Neyman-Pearson test for an i.i.d. sample
Let Y = (Y1, . . . , Yn)> be an i.i.d. sample from a measure P . Suppose that P belongs
to some parametric family (Pθ,θ ∈ Θ ⊂ IRp) , that is, P = Pθ∗ for θ∗ ∈ Θ . Let also a
special point θ0 (a null) be fixed. The null hypothesis can be formulated as θ∗ = θ0 .
Similarly, a simple alternative is θ∗ = θ1 for some other point θ1 ∈ Θ . The Neyman-
Pearson test situation is a bit artificial: one reduces the whole parameter set Θ to just
these two points θ0 and θ1 and tests θ0 against θ1 .
174
As usual, the distribution of the data Y is described by the product measure IPθ =
P⊗nθ . If µ0 is a dominating measure for (Pθ) and `(y,θ)def= log[dPθ(y)/dµ0] , then the
log-likelihood L(Y ,θ) is
L(Y ,θ)def= log
dIPθµ0
(Y ) =∑i
`(Yi,θ),
where µ0 = µ⊗n0 . The log-likelihood ratio of IPθ1 w.r.t. IPθ0 can be defined as
L(Y ,θ1,θ0)def= L(Y ,θ1)− L(Y ,θ0),
The related Neyman-Pearson test can be written as
φ∗tdef= 1
(L(Y ,θ1,θ0) > z
)with z = log t .
6.3 Likelihood ratio test
This section introduces a general likelihood ratio test in the framework of parametric
testing theory. Let, as usual, Y be the observed data, and IP be their distribution. The
parametric assumption is that IP ∈ (IPθ,θ ∈ Θ) , that is, IP = IPθ∗ for θ∗ ∈ Θ . Let
now two subsets Θ0 and Θ1 of the set Θ be given. The hypothesis H0 that we would
like to test is that IP ∈ (IPθ,θ ∈ Θ0) , or equivalently, θ∗ ∈ Θ0 . The alternative is that
θ∗ ∈ Θ1 .
The general likelihood approach leads to comparing the likelihood values L(Y ,θ) on
the hypothesis and alternative sets. Namely, the hypothesis is rejected if there is one
alternative point θ1 ∈ Θ1 such that the value L(Y ,θ) exceeds all similar values for
θ ∈ Θ0 . In other words, observing the sample Y under alternative IPθ1 is more likely
than under any measure IPθ from the null. Formally this relation can be written as:
supθ∈Θ0
L(Y ,θ) < supθ∈Θ1
L(Y ,θ).
In particular, a simple hypothesis means that the set Θ0 consists of one point θ0 and
this relation becomes of the form
L(Y ,θ0) < supθ∈Θ1
L(Y ,θ).
In general, the likelihood ratio (LR) test corresponds to the test statistic
Tdef= sup
θ∈Θ1
L(Y ,θ)− supθ∈Θ0
L(Y ,θ). (6.3)
175
The hypothesis is rejected if this test statistic exceeds some critical value z . Usually this
critical value is selected to ensure the level condition:
IP(T > zα
)≤ α
for a given level α .
We have already seen that the LR test is optimal in testing of two simple hypothesis.
Later we show that this optimality property can be extended to some more general
situations. Now we consider further examples of a LR test.
6.3.1 Gaussian shift model
For all examples considered in this section, we assume that the data Y in form of an
i.i.d. sample (Y1, . . . , Yn)> follow the model Yi = θ∗ + εi with εi ∼ N(0, σ2) for σ2
known. Equivalently Yi ∼ N(θ∗, σ2) . The log-likelihood L(Y , θ) (which we also denote
by L(θ) ) reads as
L(θ) = −n2
log(2πσ2)− 1
2σ2(Yi − θ)2 (6.4)
and the log-likelihood ratio L(θ, θ0) = L(θ)− L(θ0) is given by
L(θ, θ0) = σ−2[(S − nθ0)(θ − θ0)− n(θ − θ0)2/2
](6.5)
with Sdef= Y1 + . . . + Yn . Moreover, under the measure IPθ0 , the variable S − nθ0 is
normal zero-mean with the variance nσ2 . This particularly implies that (S−nθ0)/√nσ2
is standard normal under IPθ0 :
L
(1
σ√n
(S − nθ0) | IPθ0)
= N(0, 1).
We start with the simplest case of a simple null and simple alternative.
Simple null and simple alternative Let the null H0 : θ∗ = θ0 be tested against the
alternative H1 : θ∗ = θ1 for some fixed θ1 6= θ0 . The log-likelihood L(θ1, θ0) is given
by (6.5) leading to the test statistic
T = σ−2[(S − nθ0)(θ1 − θ0)− n(θ1 − θ0)2/2
].
The proper critical value z can be selected from the condition of α -level: IPθ0(T > zα) =
α . We use that the sum S−nθ0 is under the null normal zero-mean with variance nσ2 .
With ξ = (S − nθ0)/√nσ2 ∼ N(0, 1) , the level condition can be rewritten as
IP
(ξ >
1
|θ1 − θ0|σ√n
[σ2zα + n(θ1 − θ0)2/2
])= α.
176
As ξ is standard normal, the proper zα can be computed as a quantile of the standard
normal law: if zα is defined by IP (ξ > zα) = α , then
1
|θ1 − θ0|σ√n
[σ2zα + n|θ1 − θ0|2/2
]= zα
or
zα = σ−2[zα|θ1 − θ0|σ
√n− n|θ1 − θ0|2/2
].
It is worth noting that this value actually does not depend on θ0 . It only depends on the
difference |θ1 − θ0| between the null and the alternative. This is a very important and
useful property of the normal family and it is called pivotality. Another way of selecting
the critical value z is given by minimizing the sum of the first and second-kind error
probabilities. Theorem 6.2.1 leads to the choice z = 0 , or equivalently, to the test
φ = 1(S/n > θ1/2) = 1(θ > θ1/2).
This test is also called the Fisher discrimination. It naturally appears in classification
problems.
Two-sided test Now we consider a more general situation when the simple null θ∗ =
θ0 is tested against the alternative θ∗ 6= θ0 . Then the LR test compares the likelihood at
θ0 with the maximum likelihood over Θ\{θ0} which clearly coincides with the maximum
over the whole parameter set. This leads to the test statistic:
T = maxθL(θ, θ0) =
n
2σ2|θ − θ0|2.
(see Section 2.9), where θ = S/n is the MLE. Now for a critical value z , the LR test
rejects the null if T ≥ z . The value z can be selected from the level condition:
IPθ0(T > z
)= IPθ0
(nσ−2|θ − θ0|2 > 2z
)= α.
Now we use that nσ−2|θ− θ0|2 is χ21 -distributed. If zα is defined by IP (ξ2 ≥ 2zα) = α
for standard normal ξ , then the test φ = 1(T > zα) is of level α . Again, this value
does not depend on the null point θ0 , and the LR test is pivotal.
Exercise 6.3.1. Compute the power function of the test φ = 1(T > zα) .
6.3.2 One-sided test
Now we consider the problem of testing the null θ∗ = θ0 against the one-sided alternative
H1 : θ > θ0 . To apply the LR test we have to compute the maximum of the log-likelihood
ratio L(θ, θ0) over the set Θ1 = {θ > θ0} .
177
Exercise 6.3.2. Check that
supθ>θ0
L(θ, θ0) =
nσ−2|θ − θ0|2/2 if θ ≥ θ0,
0 otherwise.
Hint: if θ ≥ θ0 , then the maximum over Θ0 coincides with the global maximum,
otherwise it is attained at the edge θ0 .
Now the LR test rejects the null if θ > θ0 and nσ−2|θ− θ0|2 > 2z for a CV z . That
is,
φ = 1(θ − θ0 > σ
√2z/n
).
The CV z can be again chosen by the level condition. As ξ =√nσ−2(θ−θ0) is standard
normal under IPθ0 , one has to select z to ensure IP (ξ >√
2z) = α .
6.3.3 Testing the mean when the variance is unknown
This section discusses the two-sided testing problem H0 : θ∗ = θ0 against H1 : θ1 6= θ0
for the Gaussian shift model Yi = θ∗+σ∗εi with standard normal errors εi and unknown
variance σ∗2 . Here the null hypothesis is composite because it involves the unknown
variance σ∗2 .
The log-likelihood function is still given by (6.4) but now σ∗2 is a part of the pa-
rameter vector. Maximizing the log-likelihood L(θ, σ2) under the null leads to the value
L(θ0, σ20) with
σ20def= argmax
σ2
L(θ0, σ2) = n−1
∑i
(Yi − θ0)2.
As in Section 2.9.2 for the problem of variance estimation, it holds for any σ
L(θ0, σ20)− L(θ0, σ
2) = nK(σ20, σ2).
At the same time, maximizing L(θ, σ2) over the alternative is equivalent to the global
maximization leading to the value L(θ, σ2) with
θ = S/n, σ2 =1
n
∑i
(Yi − θ)2.
The LR test statistic reads as
T = L(θ, σ2)− L(θ0, σ20).
178
This expression can be decomposed in the following way:
T = L(θ, σ2)− L(θ0, σ2) + L(θ0, σ
2)− L(θ0, σ20) =
1
2σ2(θ − θ0)2 − nK(σ20, σ
2).
Often one considers another test in which the variance is only estimated under the alter-
native, that is, σ is used in place of σ0 . This is quite natural because the null can be
viewed as a particular case of the alternative. This leads to the test statistic
T ∗ = L(θ, σ2)− L(θ0, σ2) =
1
2σ2(θ − θ0)2.
An advantage of this expression is that its distribution under the measure IPθ0,σ2 does
not depend on θ0 or on σ2 . It is known as Fisher distribution and will be discussed in
Chapter 7.
6.3.4 LR-tests. Examples
(to be insereted)
6.4 Testing problem for a univariate exponential family
Let (Pθ, θ ∈ Θ ⊆ IR1) be a univariate exponential family. The choice of parametriza-
tion is unimportant, any of parametrization can be taken. To be specific, we assume
the natural parametrization that simplifies the expression for the maximum likelihood
estimate.
We assume that the two function of θ are fixed: C(θ) and B(θ) , with which the
log-density of Pθ can be written in the form:
`(y, θ)def= log p(y, θ) = yC(θ)−B(θ)− `(y)
for some other function `(y) . The function C(θ) is monotonic in θ and C(θ) and B(θ)
are related (for the case of an EFn) by the identity B′(θ) = θC ′(θ) ; see Section 2.11.
Let now Y = (Y1, . . . , Yn) be an i.i.d. sample from Pθ∗ for θ∗ ∈ Θ . The task is to
test a simple hypothesis θ∗ = θ0 against an alternative θ∗ ∈ Θ1 for some subset Θ1
that does not contain θ0 .
6.4.1 Two-sided alternative
We start with the case of a simple hypothesis H0 : θ∗ = θ0 against a full two-sided alter-
native H1 : θ∗ 6= θ0 . The likelihood ratio approach suggests to compare the likelihood at
θ0 with the maximum of the likelihood over the alternative, that effectively means the
179
maximum over the whole parameter set. In the case of a univariate exponential family,
this maximum is computed in Section 2.11. For
L(θ, θ0)def= L(θ)− L(θ0) = S
[C(θ)− C(θ0)
]− n
[B(θ)−B(θ0)
]with S = Y1 + . . .+ Yn , it holds
Tdef= sup
θL(θ, θ0) = nK(θ, θ0),
where K(θ, θ′) = Eθ`(θ, θ′) is the Kullback-Leibler divergence between the measures Pθ
and Pθ′ . For an EFn, the MLE θ is the empirical mean of the observations Yi , θ = S/n ,
and the KL divergence K(θ, θ0) is of the form
K(θ, θ0) = θ[C(θ)− C(θ0)
]−[B(θ)−B(θ0)
].
Therefore, the test statistic T is a function of the empirical mean θ = S/n :
T = nK(θ, θ0) = nθ[C(θ)− C(θ0)
]− n
[B(θ)−B(θ0)
]. (6.6)
The LR test rejects H0 if the test statistic T exceeds a critical value z . Given α ∈ (0, 1) ,
a proper CV zα can be specified by the level condition
IPθ0(T > zα) = α.
In view of (6.6), the LR test rejects the null if the “distance” K(θ, θ0) between the
estimate θ and the null θ0 is significantly larger than zero. In the case of an exponential
family, one can simplify the test just by considering the estimate θ as test statistic. We
use the following technical result for the KL divergence K(θ, θ0) :
Lemma 6.4.1. Let (Pθ) be an EFn. Then for every z there are two positive values
t−(z) and t+(z) such that
{θ : K(θ, θ0) ≤ z} = {θ : θ0 − t−(z) ≤ θ ≤ θ0 + t+(z)}. (6.7)
In other words, the conditions K(θ, θ0) ≤ z and θ0 − t−(z) ≤ θ ≤ θ0 + t+(z) are
equivalent.
Proof. The function K(θ, θ0) of the first argument θ fulfills
∂K(θ, θ0)
∂θ= C(θ)− C(θ0),
∂2K(θ, θ0)
∂θ2= C ′(θ) > 0.
Therefore, it is convex in θ with minimum at θ0 , and it can cross the level z only once
from the left of θ0 and once from the right. This yields that for any z > 0 , there are
180
two positive values t−(z) and t+(z) such that (6.7) holds. Note that one or even both
of these values can be infinite.
Due to the result of this lemma, the LR test can be rewritten as
φ = 1− 1(−t−(z) ≤ θ − θ0 ≤ t+(z)
)= 1
(θ > θ0 + t+(z)
)+ 1(θ < θ0 − t−(z)
),
that is, the test rejects the null if the estimate θ deviates significantly from θ0 .
6.4.2 One-sided alternative
Now we consider the problem of testing the same null H0 : θ∗ = θ0 against the one-sided
alternative H1 : θ∗ > θ0 . Of course, the other one-sided alternative H1 : θ∗ < θ0 can be
considered as well.
The LR test requires computing the maximum of the log-likelihood over the alterna-
tive set {θ : θ > θ0} . This can be done as in the Gaussian shift model. If θ > θ0 then
this maximum coincides with the global maximum over all θ . Otherwise, it is attained
at θ = θ0 .
Lemma 6.4.2. Let (Pθ) be an EFn. Then
supθ>θ0
L(θ, θ0) =
nK(θ, θ0) if θ ≥ θ0,
0 otherwise.
Proof. It is only necessary to consider the case θ < θ0 . The difference L(θ)− L(θ) can
be represented as nK(θ, θ) . Next, one more use of the identity B′(θ) = θC ′(θ) yields
∂K(θ, θ)
∂θ= (θ − θ)C ′(θ) < 0
for any θ > θ0 . This means that L(θ)−L(θ) decreases as θ grows beyond θ0 , attaining
its maximum at θ = θ0 .
This fact implies the following representation of the LR test in the case of a one-sided
alternative.
Theorem 6.4.3. Let (Pθ) be an EFn. Then the α -level LR test for the null H0 : θ∗ =
θ0 against the one-sided alternative H1 : θ∗ > θ0 is
φ = 1(θ > θ0 + tα), (6.8)
where tα is selected to ensure IPθ0(θ > θ0 + tα
)= α .
181
Proof. Let T be the LR test statistic. Due to Lemma 6.4.2, the inequality T ≥ t can
be rewritten as θ > θ0 + t(z) for some t(z) . It remains to select a proper value t(z) to
ensure the level condition.
This result can be extended naturally to the case of a composite null hypothesis
H0 : θ∗ ≤ θ0 .
Theorem 6.4.4. Let (Pθ) be an EFn. Then the α -level LR test for the composite null
H0 : θ∗ ≤ θ0 against the one-sided alternative H1 : θ∗ > θ0 is
φ∗α = 1(θ > θ0 + tα), (6.9)
where tα is selected to ensure IPθ0(θ > θ0 + tα
)= α .
Proof. The same arguments as in the proof of Theorem 6.4.3 lead to exactly the same
LR test statistic T and thus to the test of the form (6.8). In particular, the estimate θ
should significantly deviate from the null set. It remains to check that the level condition
for the edge point θ0 ensures the level for all θ < θ0 . This follows from the next
monotonicity property.
Lemma 6.4.5. Let (Pθ) be an EFn. Then for any t ≥ 0
IPθ(θ > θ0 + t) ≤ IPθ0(θ > θ0 + t), ∀θ < θ0 .
Proof. Let θ < θ0 . We apply
IPθ(θ > θ0 + t) = IEθ0 exp{L(θ, θ0)
}1(θ > θ0 + t).
Now the monotonicity of L(θ, θ) w.r.t. the second argument (see Lemma 6.4.2), implies
L(θ, θ0) < 0 on the set {θ < θ0 < θ} . This yields the result.
Therefore, if the level is controlled under IPθ0 , it is well checked for all other points
in the null set.
A very nice feature of the LR test is that it can be universally represented in terms
of θ independently of the form of the alternative set. In particular, for the case of a
one-sided alternative, this test just compares the estimate θ with the value θ0 + tα .
Moreover, the value tα only depends on the distribution of θ under IPθ0 via the level
condition. This and monotonicity of the error probability from Lemma 6.4.5 allow us to
state the nice optimality property of this test: φ∗α is uniformly most power in the sense
of Definition 6.1.1, that is, it maximizes the test power under the level constraint.
182
Theorem 6.4.6. Let (Pθ) be an EFn, and let φ∗α be the test from (6.9) for testing
H0 : θ∗ ≤ θ0 against H1 : θ∗ > θ0 . For any (randomized) test φ satisfying IEθ0 ≤ α
and any θ > θ0 , it holds
IEθφ ≤ IPθ(φ∗α = 1).
In fact, this theorem repeats the Neyman-Pearson result of Theorem 6.2.2 because
the test φ∗α is at the same time the LR α -level test of the simple hypothesis θ∗ = θ0
against θ∗ = θ .
6.4.3 Interval hypothesis
In some applications, the null hypothesis is naturally formulated in the form that the
parameter θ∗ belongs to a given interval [θ0, θ1] . The alternative H1 : θ∗ ∈ Θ \ [θ0, θ1]
is the complement of this interval. The likelihood ratio test is based on the test statistic
T from (6.3) which compares the maximum of the log-likelihood L(θ) under the null
[θ0, θ1] with the maximum over the alternative set. The special structure of the log-
likelihood in the case of an EFn permits representing this test statistics in terms of the
estimate θ : the hypothesis is rejected if the estimate θ significantly deviates from the
interval [θ0, θ1] .
Theorem 6.4.7. Let (Pθ) be an EFn. Then the α -level LR test for the null H0 : θ ∈[θ0, θ1] against the alternative H1 : θ 6∈ [θ0, θ1] can be written as
φ = 1(θ > θ1 + t+α ) + 1(θ < θ0 − t−α ), (6.10)
where t+α and t−α are selected to ensure IPθ0−t−α (φ = 1) = α/2 and IPθ0+t+α (φ = 1) =
α/2 . ???
Exercise 6.4.1. Prove the result of Theorem 6.4.7.
Hint: Consider three cases: θ ∈ [θ0, θ1] , θ > θ1 , and θ < θ0 . For every case, apply
the monotonicity of L(θ, θ) in θ .
One can consider the alternative of the interval hypothesis as a combination of two
one-sided alternatives. The LR test φ from (6.10) involves only one critical value z and
the parameters t−α and t+α are related via the structure of this test: they are obtained
by transforming the inequality T > z into θ > θ1 + t+θ and θ < θ0 − t−α . However, one
can just apply two one-sided tests independently: for the alternative H−1 : θ∗ < θ0 and
H+1 : θ∗ > θ1 . This leads to two separate tests:
φ−def= 1
(θ < θ0 − t−
), φ+
def= 1
(θ > θ1 + t+
).
183
The values t−, t+ can be chosen by the so-called Bonferroni rule: just perform each of
the two tests at level α/2 .
Exercise 6.4.2. Let the values t−, t+ be selected to ensure
IPθ0(θ < θ0 − t−
)= α/2, IPθ1
(θ > θ1 + t+
)= α/2
Then for any θ ∈ [θ0, θ1] , the test φ fulfills
IPθ(φ = 1) ≤ α.
Hint: use the monotonicity from Lemma 6.4.5.
184
Chapter 7
Testing in linear models
This chapter discusses the testing problem for linear Gaussian models given by the equa-
tion
Y = f + ε (7.1)
with the vector of observations Y , response vector f , and vector of errors ε in IRn .
The linear parametric assumption (linear PA) means that
Y = Ψ>θ + ε (7.2)
where Ψ is the p×n design matrix. By θ we denote the p -dimensional target parameter
vector, θ ∈ Θ ⊆ IRp . Usually we assume that the parameter set coincides with the
whole space IRp , i.e. Θ = IRp . The most general assumption about the vector of errors
ε = (ε1, . . . , εn)> is Var(ε) = Σ , which permits for inhomogeneous and correlated
errors. However, for most results we assume i.i.d. errors εi ∼ N(0, σ2) . The variance
σ2 could be unknown as well. As in previous chapters, θ∗ denotes the true value of the
parameter vector (assumed that the model (7.2) is correct).
7.1 Likelihood ratio test for a simple null
This section discusses the problem of testing a simple hypothesis H0 : θ∗ = θ0 for a
given vector θ0 . A natural “non-informative” alternative is H1 : θ∗ 6= θ0 .
7.1.1 General errors
We start from the case of general errors with known covariance matrix Σ . The results
obtained for the estimation problem in Chapter 4 will be heavily used in our study. In
185
186
particular, the MLE θ of θ∗ is
θ =(ΨΣ−1Ψ>
)−1ΨΣ−1Y
and the corresponding maximum likelihood is
L(θ,θ0) =1
2
(θ − θ0
)>B(θ − θ0
)with a p× p - matrix B given by
B = ΨΣ−1Ψ>.
This immediately leads to the following representation for the likelihood ratio (LR) test
in this set-up:
Tdef= sup
θL(θ,θ0) =
1
2
(θ − θ0
)>B(θ − θ0
)(7.3)
Moreover, Wilks’ phenomenon claims that under IPθ0 , the test statistic T has a fixed
distribution: namely 2T is χ2p -distributed (chi-squared with p degrees of freedom).
Theorem 7.1.1. Consider the model (7.2) with ε ∼ N(0, Σ) for a known matrix Σ .
Then the LR test statistic T is given by (7.3). Moreover, if zα fulfills IP (ζp > 2zα) = α
with ζp ∼ χ2p , then the LR test φ with
φdef= 1
(T > zα
)(7.4)
is of exact level α :
IPθ0(φ = 1) = α.
This result follows directly from Theorem 4.6.1. We see again the important pivotal
property of the test: the critical value zα only depends on the dimension of the parameter
space Θ . It does not depend on the design matrix Ψ , error covariance Σ , and the null
value θ0 .
7.1.2 I.i.d. errors, known variance
We now specify this result for the case of i.i.d. errors. We also focus on the residuals
εdef= Y − Ψ>θ = Y − f ,
where f = Ψ>θ is the estimated response of the true regression function f = Ψ>θ∗ .
We start with some geometric properties of the residuals ε and the test statistic T
from (7.3).
187
Theorem 7.1.2. Consider the model (7.1). Let T be the LR test statistic built under
the assumptions f = Ψ>θ∗ and Var(ε) = σ2In with a known value σ2 . Then T is
given by
T =1
2σ2∥∥Ψ>(θ − θ0)
∥∥2 =1
2σ2∥∥f − f0
∥∥2. (7.5)
Moreover, the following decompositions for the vector of observations Y and for the
errors ε = Y − f0 hold:
Y − f0 = (f − f0) + ε, (7.6)
‖Y − f0‖2 = ‖f − f0‖2 + ‖ε‖2, (7.7)
where f − f0 is the estimation error and ε = Y − f is the vector of residuals.
Proof. The key step of the proof is the representation of the estimated response f under
the model assumption Y = f+ε as a projection of the data on the p -dimensional linear
subspace L in IRn spanned by the rows of the matrix Ψ :
f = ΠY = Π(f + ε)
where Π = Ψ>(ΨΨ>
)−1Ψ is the projector onto L ; see Section 4.3. Note that this
decomposition is valid for the general linear model; the parametric form of the response
f and the noise normality is not required. The identity Ψ>(θ − θ0) = f − f0 follows
directly from the definition implying the representation (7.5) for the test statistic T . The
identity (7.6) follows from the definition. Next, Πf0 = f0 and thus f−f0 = Π(Y −f0) .
Similarly,
ε = Y − f = (In −Π)Y .
As Π and In −Π are orthogonal projectors, it follows
‖Y − f0‖2 = ‖(In −Π)Y +Π(Y − f0)‖2 = ‖(In −Π)Y ‖2 + ‖Π(Y − f0)‖2
and the decomposition (7.7) follows.
The decomposition (7.6), although straightforward, is very important for understand-
ing the structure of the residuals under the null and under the alternative. Under the
null H0 , the response f is assumed to be known and coincides with f0 , so the residuals
ε coincide with the errors ε . The sum of squared residuals is usually abbreviated as
RSS:
RSS0def= ‖Y − f0‖2
188
Under the alternative, the response is unknown and is estimated by f . The residuals
are ε = Y − f resulting in the RSS
RSSdef= ‖Y − f‖2.
The decomposition (7.7) can be rewritten as
RSS0 = RSS +‖f − f0‖2. (7.8)
We see that the RSS under the null and the alternative can be essentially different only
if the estimate f significantly deviates from the null assumption f = f0 . The test
statistic T from (7.3) can be written as
T =RSS0−RSS
2σ2.
Now we show that if the model assumptions are correct, the test T has the exact
level α and is pivotal.
Theorem 7.1.3. Consider the model (7.1) with ε ∼ N(0, σ2In) for a known value σ2 ,
i.e. εi are i.i.d. normal. The LR test φ from (7.4) is of exact level α . Moreover,
f − f0 and ε are under IPθ0 zero-mean independent Gaussian vectors satisfying
2T = σ−2‖f − f0‖2 ∼ χ2p , σ−2‖ε‖2 ∼ χ2
n−p . (7.9)
Proof. The null assumption f = f0 together with Πf0 = f0 implies now the following
decomposition:
f − f0 = Πε, ε = ε−Πε = (In −Π)ε.
Next, Π and In −Π are orthogonal projectors implying orthogonal and thus uncorre-
lated vectors Πε and (In −Π)ε . Under normality of ε , these vectors are also normal,
and uncorrelation implies independence. The property (7.9) for the distribution of Πε
was proved in Theorem 4.6.1. For ε = (In −Π)ε , the proof is similar.
Next we discuss the power of the LR test φ defined as the probability of detecting
the alternative when the response f deviates from the null f0 . In the next result we
do not assume that the true response f follows the linear PA f = Ψ>θ and show that
the test power depends on the value ‖Π(f − f0)‖2 .
Theorem 7.1.4. Consider the model (7.1) with Var(ε) = σ2In for a known value σ2 .
Define
∆ = σ−2‖Π(f − f0)‖2.
189
Then the power of the LR test φ only depends on ∆ , i.e. it is the same for all f with
equal ∆ -value. It holds
IP(φ = 0
)= IP
(∣∣ξ1 +√∆∣∣2 + ξ22 + . . .+ ξ2p > 2zα
)with ξ = (ξ1, . . . , ξp)
> ∼ N(0, IIp) .
Proof. It follows from f = ΠY = Π(f + ε) and f0 = Πf0 for the test statistic
T = (2σ2)−1‖f − f0‖2 that
T = (2σ2)−1‖Π(f − f0) +Πε‖2
Now we show that the distribution of T depends on the response f only via the value
∆ . For this we compute the Laplace transform of T .
Lemma 7.1.5. It holds for µ < 1
g(µ)def= log IE exp
{µT}
=µ∆
2(1− µ)− p
2log(1− µ).
Proof. For a standard Gaussian random variable ξ and any a , it holds
IE exp{µ|ξ + a|2/2
}= eµa
2/2(2π)−1/2∫
exp{µax+ µx2/2− x2/2
}dx
= exp
{µa2
2+
µ2a2
2(1− µ)
}1√2π
∫exp
{−1− µ
2
(x− aµ
1− µ
)2}dx
= exp
{µa2
2(1− µ)
}(1− µ)−1/2.
The projector Π can be represented as Π = U>ΛpU for an orthogonal transform U
and the diagonal matrix Λp = diag(1, . . . , 1, 0, . . . , 0) with only p unit eigenvalues. This
permits representing T in the form
T =
p∑j=1
(ξj + aj)2/2
with i.i.d. standard normal r.v. ξj and numbers aj satisfying∑
j a2j = ∆ . The
independence of the ξj ’s implies
g(µ) =
p∑j=1
[µa2j2
+µ2a2j
2(1− µ)− 1
2log(1− µ)
]=
µ∆
2(1− µ)− p
2log(1− µ)
as required.
190
The result of Lemma 7.1.5 claims that the Laplace transform of T depends on f
only via ∆ and so this also holds for the distribution of T .
The distribution of the squared norm ‖ξ + h‖2 for ξ ∼ N(0, IIp) and any fixed
vector h ∈ IRp with ‖h‖2 = ∆ is called non-central chi-squared with the non-centrality
parameter ∆ . In particular, for each α, α1 one can define the minimal value ∆ providing
the prescribed error α1 of the second kind by the equation under the given level α :
IP(‖ξ + h‖2 ≥ 2zα
)≥ 1− α1 subject to IP
(‖ξ‖2 ≥ 2zα
)≤ α (7.10)
with ‖h‖2 ≥ ∆ . The results from Section 4.6 indicate that the value zα can be bounded
from above by p+√
2p logα−1 for the moderate values of α−1 . For evaluating the value
∆ , the following decomposition is useful:
‖ξ + h‖2 − ‖h‖2 − p = ‖ξ‖2 − p+ 2h>ξ.
The right hand-side of this equality is a sum of the centered Gaussian quadratic and
linear forms. In particular, the cross term 2h>ξ is a centered Gaussian r.v. with the
variance 4‖h‖2 , while Var(‖ξ‖2
)= 2p . These arguments suggest to take ∆ of order p
to ensure the prescribed power α1 .
Theorem 7.1.6. For each α, α1 ∈ (0, 1) , there are absolute constants C and C1 such
that (7.10) fulfills for ‖h‖2 ≥ ∆ with
∆1/2 =√Cp logα−1 +
√C1p logα−11
7.1.3 Smooth Wald test
The result of Theorem 7.1.6 reveals some problem with the power of the LR-test when
the dimensionality of the parameter space grows. Indeed, the test remains insensitive for
all alternatives in the zone σ−2‖Π(f − f0)‖2 ≤ Const.p and this zone becomes larger
and larger with p . (to be done)
7.1.4 I.i.d. errors with unknown variance
This section briefly discusses the case when the errors εi are i.i.d. but the variance
σ2 = Var(εi) is unknown. A natural idea in this case is to estimate the variance from
the data. The decomposition (7.8) and independence of RSS = ‖Y − f‖2 and ‖f−f0‖2
are particularly helpful. Theorem 7.1.3 suggests to estimate σ2 from RSS by
σ2 =1
n− pRSS =
1
n− p‖Y − f‖2.
191
Indeed, due to the result (7.9), σ−2 RSS ∼ χ2p yielding
IEσ2 = σ2, Var σ2 =2
n− pσ4 (7.11)
and therefore, σ2 is an unbiased, root-n consistent estimate of σ2 .
Exercise 7.1.1. Check (7.11). Show that σ2 − σ2 IP−→ 0 .
Now we consider the LR test (7.5) in which the true variance is replaced by its
estimate σ2 :
Tdef=
1
2σ2∥∥f − f0
∥∥2 =(n− p)
∥∥f − f0
∥∥22‖Y − f‖2
=RSS0−RSS
2 RSS /(n− p).
The result of Theorem 7.1.3 implies the pivotal property for this test statistic as well.
Theorem 7.1.7. Consider the model (7.1) with ε ∼ N(0, σ2In) for an unknown value
σ2 . Then the distribution of the test statistic under IPθ0 only depends on p and n− p :
p−1T ∼ Fp,n−p ,
where Fp,n−p denotes the Fisher distribution with parameters p, n− p :
Fp,n−p = L
( ‖ξp‖2/p‖ξn−p‖2/(n− p)
)where ξp and ξn−p are two independent standard Gaussian vectors of dimension p and
n− p . In particular, it does not depend on the design matrix Ψ , the noise variance σ2 ,
and the true parameter θ0 .
This result suggests to fix the critical value z for the test statistic T using the
quantiles of the Fisher distribution: if tα is such that Fp,n−p(tα) = 1−α , then zα = ptα .
Theorem 7.1.8. Consider the model (7.1) with ε ∼ N(0, σ2In) for a unknown value
σ2 . If Fp,n−p(tα) = 1−α and zα = ptα , then the test φ = 1(T ≥ zα) is a level-α test:
IPθ0(φ = 1
)= IPθ0
(T ≥ zα
)= α.
Exercise 7.1.2. Prove the result of Theorem 7.1.8.
If the sample size n is sufficiently large, then σ2 is very close to σ2 and one can
apply an approximate choice of the critical value zα from the case of σ2 known:
φ = 1(T ≥ zα).
This test is not of exact level α but it is of asymptotic level α . Its power function is
also close to the power function of the test φ corresponding to the known variance σ2 .
192
Theorem 7.1.9. Consider the model (7.1) with ε ∼ N(0, σ2In) for a unknown value
σ2 . Then
limn→∞
IPθ0(φ = 1
)= α. (7.12)
Moreover,
limn→∞
supf
∣∣IPθ0(φ = 1)− IPθ0
(φ = 1
)∣∣ = 0. (7.13)
Exercise 7.1.3. Consider the model (7.1) with ε ∼ N(0, σ2In) for σ2 unknown .
• Prove (7.12).
• Prove (7.13).
Hint:
• The consistency of σ2 permits to restrict to the case∣∣σ2/σ2 − 1
∣∣ ≤ δn for δn → 0 .
• The independence of ‖f − f0‖2 and σ2 permits to consider the distribution of 2T =
‖f − f0‖2/σ2 as if σ2 were a fixed number close to δ .
• use that for ζp ∼ χ2p
IP(ζp ≥ zα(1 + δn)
)− IP
(ζp ≥ zα
)→ 0, n→∞.
7.2 Likelihood ratio test for a linear hypothesis
The previous section dealt with the case of a simple null hypothesis. This section consid-
ers a more general situation when the null hypothesis concerns a subvector of the vector.
This means that the whole model is given by (7.2) but the vector θ is decomposed into
two parts: θ = (γ,η) , where γ is of dimension p0 < p . The null hypothesis assumes
that η = η0 for all γ . Usually η0 = 0 but the particular value of η0 is not important.
To simplify the presentation we assume η0 = 0 leading to the subset Θ0 of Θ
Θ0 = {θ = (γ, 0)}.
Under the null hypothesis, the model is still linear:
Y = Ψ>γ γ + ε,
where Ψγ denotes a submatrix of Ψ composed by the rows of Ψ corresponding to the
γ -components of θ .
193
Fix any point θ0 ∈ Θ0 , e.g. θ0 = 0 and define the corresponding response f0 =
Ψ>θ0 . The LR test T can be written in the form
T = maxθ∈Θ
L(θ,θ0)− maxθ∈Θ0
L(θ,θ0). (7.14)
The results of both maximization problems is known:
maxθ∈Θ
L(θ,θ0) =1
2σ2‖f − f0‖2, max
θ∈Θ0
L(θ,θ0) =1
2σ2‖f0 − f0‖2,
where f and f0 are estimates of the response under the null and the alternative respec-
tively. As in Theorem 7.1.2 we can establish the following geometric decomposition.
Theorem 7.2.1. Consider the model (7.1). Let T be the LR test statistic from (7.14)
built under the assumptions f = Ψ>θ∗ and Var(ε) = σ2In with a known value σ2 .
Then T is given by
T =1
2σ2∥∥f − f0
∥∥2.Moreover, the following decompositions for the vector of observations Y and for the
residuals ε0 = Y − f0 from the null hold:
Y − f0 = (f − f0) + ε,
‖Y − f0‖2 = ‖f − f0‖2 + ‖ε‖2, (7.15)
where f − f0 is the difference between the estimated response under the null and under
the alternative, and ε = Y − f is the vector of residuals from the alternative.
Proof. The proof is similar to the proof of Theorem 7.1.2. We use that f = ΠY where
Π = Ψ>(ΨΨ>
)−1Ψ is the projector on the space L spanned by the rows of Ψ . Similarly
f0 = Π0Y where Π0 = Ψ>γ(ΨγΨ
>γ
)−1Ψγ is the projector on the subspace L0 spanned
by the rows of Ψγ . This yields the decomposition f − f0 = Π(Y − f0) . Similarly,
f − f0 = (Π −Π0)Y , ε = Y − f = (In −Π)Y .
As Π −Π0 and In −Π are orthogonal projectors, it follows
‖Y − f0‖2 = ‖(In −Π)Y + (Π −Π0)Y ‖2 = ‖(In −Π)Y ‖2 + ‖(Π −Π0)Y ‖2
and the decomposition (7.15) follows.
The decomposition (7.15) can again be represented as RSS0 = RSS +2σ2T , where
RSS is the sum of squared residuals, while RSS0 is the same as in the case of a simple
null.
194
Now we show that if the model assumptions are correct, the test T has the exact
level α and is pivotal.
Theorem 7.2.2. Consider the model (7.1) with ε ∼ N(0, σ2In) for a known value σ2 ,
i.e. εi are i.i.d. normal. Then f − f0 and ε are under IPθ0 zero-mean independent
Gaussian vectors satisfying
2T = σ−2‖f − f0‖2 ∼ χ2p−p0 , σ−2‖ε‖2 ∼ χ2
n−p . (7.16)
Let zα fulfill IP (ζp−p0 ≥ zα) = α . Then the LR test φ = 1(T ≥ zα) is of exact level α .
Proof. The null assumption θ∗ ∈ Θ0 implies f ∈ L0 . This, together with Π0f = f
implies now the following decomposition:
f − f0 = (Π −Π0)ε, ε = ε−Πε = (In −Π)ε.
Next, Π − Π0 and In − Π are orthogonal projectors implying orthogonal and thus
uncorrelated vectors (Π − Π0)ε and (In − Π)ε . Under normality of ε , these vectors
are also normal, and uncorrelation implies independence. The property (7.16) is similar
to (7.9).
If the variance σ2 of the noise is unknown, one can proceed exactly as in the case of
a simple null: estimate the variance from the residuals using their independence of the
test statistic T . This leads to the estimate
σ2 =1
n− pRSS =
1
n− p‖Y − f‖2
and to the test statistic
T =RSS0−RSS
2 RSS /(n− p)=
(n− p)‖f − f0‖2
2‖Y − f‖2.
The property of pivotality is preserved here as well: properly scaled, the test statistic T
has a Fisher distribution.
Theorem 7.2.3. Consider the model (7.1) with ε ∼ N(0, σ2In) for an unknown value
σ2 . Then T /(p− p0) has the Fisher Fp−p0,n−p distribution with parameters p− p0 and
n−p . If tα is the 1−α quantile of this distribution then the test φ = 1(T > (p−p0)tα
)is of exact level α .
If the sample size is sufficiently large, one can proceed as if σ2 were the true variance
ignoring the error of variance estimation. This would lead to the critical value zα from
Theorem 7.2.2 and the corresponding test is of asymptotic level α .
195
Exercise 7.2.1. Prove Theorem 7.2.3.
The study of the power of the test T does not differ from the case of a simple
hypothesis. One needs to only redefine ∆ as
∆def= σ−2‖(Π −Π0)f‖2.
196
Chapter 8
Some other testing methods
This chapter discusses some classical testing procedures such as the Kolmogorov-Smirnov,
Cramer-Smirnov-von Mises, and chi-squared as particular cases of the substitution ap-
proach.
Let Y = (Y1, . . . , Yn)> be an i.i.d. sample from a distribution P . The joint distribu-
tion IP of Y is the n -fold product of P , so a hypothesis about IP can be formulated
as a hypothesis about the marginal measure P . A simple hypothesis H0 means the
assumption that P = P0 for a given measure P0 . The empirical measure Pn is a natu-
ral empirical counterpart of P leading to the idea of testing the hypothesis by checking
whether Pn significantly deviates from P0 . As in to the estimation problem, this substi-
tution idea can be realized in several different ways. We briefly discuss below the method
of moments and the minimal distance method.
8.1 Method of moments for an i.i.d. sample
Let g(·) be any d -vector function on IR1 . The assumption P = P0 leads to the
population moment
m0 = IE0g(Y1).
The empirical counterpart of this quantity is given by
Mn = IEng(Y ) =1
n
∑i
g(Yi).
The method of moments (MM) suggests to consider the difference Mn−m0 for building
a reasonable test. The properties of M were stated in Section 2.4. In particular, under
the null P = P0 , the first two moments of the vector Mn−m0 can be easily computed:
197
198
IE0(Mn −m0) = 0 and
Var0(M) = IE0
[(Mn −m0) (Mn −m0)
>)]
= n−1V,
Vdef= E0
[(g(Y )−m0
)(g(Y )−m0
)>].
For simplicity of presentation we assume that the moment function g is selected to
ensure a non-degenerate matrix V . Standardization by the covariance matrix leads to
the vector ξn
ξn = n1/2V −1/2(Mn −m0)
which has under null measure zero mean and a unit covariance matrix. Moreover, it is
asymptotically standard normal, i.e. its distribution is approximately standard normal
if the sample size n is sufficiently large; see Theorem 2.4.9. The MM test rejects the
null hypothesis if the vector ξn computed from the available data Y is very unlikely
standard normal, that is, if it deviates significantly from zero. We specify the procedure
separately for the univariate and multivariate cases.
Univariate case Let g(·) be a univariate function with E0g(Y ) = m0 and E0
[g(Y )−
m0
]2= σ2 . Define the linear test statistic
Tn =1√nσ2
∑i
[g(Yi)−m
]= n1/2σ−1(Mn −m0)
leading to the test
φ = 1(|Tn| > zα/2
), (8.1)
where zα denotes the corresponding quantile of the standard normal law.
Theorem 8.1.1. Let Y be an i.i.d. sample from P . Then the test statistic Tn is
asymptotically standard normal and the test φ from (8.1) of H0 : P = P0 is of asymptotic
level α , that is,
IP0
(φ = 1
)→∞, n→ α.
Similarly one can consider a one-sided alternative H+1 : m > m0 or H−1 : m < m0
about the moment m = Eg(Y ) of the distribution P
φ+ = 1(Tn > zα), φ− = 1(Tn < −zα).
As in Theorem 8.1.1, both tests φ+ and φ− are of asymptotic level α .
199
Multivariate case The components of the vector function g(·) ∈ IRd are usually
associated with “directions” in which the null hypothesis is tested. The multivariate
situation means that we test simultaneously in d > 1 directions. The most natural test
statistic is the squared Euclidean norm of the standardized vector ξn :
Tndef= ‖ξn‖2 = n‖V −1/2(Mn −m0)‖2. (8.2)
By Theorem 2.4.9 the vector ξn is asymptotically standard normal so that Tn is asymp-
totically chi-squared with d degrees of freedom. This yields the natural definition of the
test φ using quantiles of χ2d :
φ = 1(Tn > zα
). (8.3)
Theorem 8.1.2. Let Y be an i.i.d. sample from P . If zα fulfill IP(χ2d > zα
)= α ,
then the test statistic Tn from (8.2) is asymptotically χ2d normal and the test φ from
(8.3) of H0 : P = P0 is of asymptotic level α .
8.1.1 Series expansion
A standard method of building the moment tests or, alternatively, of choosing the di-
rections g(·) is based on some series expansion. Let ψ1, ψ2, . . . , be a given set of basis
functions in the related functional space. It is especially useful to select these basis
functions to be orthonormal under the measure P0 :∫ψj(y)P0(dy) = 0,
∫ψj(y)ψj′(y)P0(dy) = δj,j′ , ∀j, j′. (8.4)
Select a fixed index d and take the first basis functions ψ1, . . . , ψd as “directions” or
components of g . Then
mj,0def=
∫ψj(y)P0(dy) = 0
is the j th population moment under the null hypothesis H0 and it is tested by checking
whether the empirical moments Mj,n with
Mj,ndef=
1
n
∑i
ψj(Yi)
do not deviate significantly from zero. The condition (8.4) effectively permits to test
each direction ψj independently of the others.
For each d one obtains a test statistic Tn,d with
Tn,ddef= n
(M2
1,n + . . .+M2d,n
)
200
leading to the test
φd = 1(Tn,d > zα,d
), (8.5)
where zα,d is the α -quantile of ξ2d . In practical applications the choice of d is particu-
larly relevant and is subject of various studies.
8.1.2 Chi-squared test
A special but popular case of the previous series approach is the chi-squared test. Let
the observation space (which is a subset of IR1 ) be split into non-overlapping subsets
A1, . . . , Ad . Define for j = 1, . . . , d
ψj(y) =1
σj
[1(y ∈ Aj)− pj
](8.6)
with
pj = P0(Aj) =
∫Aj
P0(dy), σ2j = pj(1− pj). (8.7)
Then the conditions (8.4) are fulfilled.
Exercise 8.1.1. Check the conditions (8.4) for the functions ψj from (8.6) and (8.7).
The frequencies
νj,n = n−1∑i
1(Yi ∈ Aj)
are the empirical counterparts of the probabilities pj . The related test statistic Tn with
Tn,d =d∑j=1
n(νj,n − pj)2
σ2j=
d∑j=1
n(νj,n − pj)2
pj(1− pj)
is called the chi-squared test statistic leading to the so-called chi-squared test (8.5).
8.1.3 Testing a parametric hypothesis
The method of moments can be extended to the situation when the null hypothesis is
parametric: H0 : P ∈ (Pθ,θ ∈ Θ0) . It is natural to apply the method of moments both
to estimate the parameter θ under the null and to test the null. So, we assume two
different moment vector functions g0 and g1 to be given. The first one is selected to
fulfill
θ ≡ Eθg0(Y ), θ ∈ Θ0 .
201
This permits estimating the parameter θ directly by the empirical moment:
θ = Eng0(Y ) =1
n
∑i
g0(Yi).
The second vector of moment functions is composed by directional alternatives. An
identifiability condition suggests to select the directional alternative functions orthogonal
to g0 : (to be continued)
8.2 Minimum distance method for an i.i.d. sample
The method of moments is especially useful for the case of a simple hypothesis because
it compares the population moments computed under the null with their empirical coun-
terpart. However, if a more complicated composite hypothesis is tested, the population
moments can not be computed directly: the null measure is not specified precisely. In
this case, the minimum distance idea appears to be useful. Let (Pθ,θ ∈ Θ ⊂ IRp)
be a parametric family and Θ0 be a subset of Θ . The null hypothesis about an i.i.d.
sample Y from P is that P ∈ (Pθ,θ ∈ Θ0) . Let ρ(P, P ′) denote some functional
(distance) defined for measures P, P ′ on the real line. We assume that ρ satisfies the
following conditions: ρ(Pθ1 , Pθ2) ≥ 0 and ρ(Pθ1 , Pθ2) = 0 iff θ1 = θ2 . The condition
P ∈ (Pθ,θ ∈ Θ0) can be rewritten in the form
infθ∈Θ0
ρ(P, Pθ) = 0.
Now we can apply the substitution principle: use Pn in place of P . Define the value T
by
Tdef= inf
θ∈Θ0
ρ(Pn, Pθ). (8.8)
Large values of the test statistic T indicate a possible violation of the null hypothesis.
In particular, if H0 is a simple hypothesis, that is, if the set Θ0 consists of one point
θ0 , the test statistic reads as T = ρ(Pn, Pθ0) . The critical value for this test is usually
selected by the level condition:
IPθ0(ρ(Pn, Pθ0) > tα
)≤ α.
Note that the test statistic (8.8) can be viewed as a combination of two different steps.
First we estimate under the null the parameter θ ∈ Θ0 which provides the best possible
parametric fit under the assumption P ∈ (Pθ,θ ∈ Θ0) :
θ0 = arginfθ∈Θ0
ρ(Pn, Pθ).
202
Next we formally apply the minimum distance test with the simple hypothesis with
θ0 = θ0 .
Below we discuss some standard choices of the distance ρ .
8.2.1 Kolmogorov-Smirnov test
Let P0, P1 be two distributions on the real line with distribution functions F0, F1 :
Fj(y) = Pj(Y ≤ y) for j = 0, 1 . Define
ρ(P0, P1) = ρ(F0, F1) = supy
∣∣F0(y)− F1(y)∣∣. (8.9)
Now consider the related test starting from the case of a simple null hypothesis P = P0
with corresponding c.d.f. F0 . Then the distance ρ from (8.9) (properly scaled) leads to
the Kolmogorov-Smirnov test statistic
Tndef= sup
yn1/2
∣∣F0(y)− Fn(y)∣∣.
A nice feature of this test is the property of asymptotic pivotality.
Theorem 8.2.1 (Kolmogorov). Let F0 be a continuous c.d.f. Then
Tn = supyn1/2
∣∣F0(y)− Fn(y)∣∣ w−→ η
where η is a fixed random variable (maximum of a Brownian bridge on [0, 1] ).
Proof. Idea of the proof. The c.d.f. F0 is monotonic and continuous. Therefore, its
inverse function F−10 is uniquely defined. Consider the r.v.’s
Uidef= F0(Yi).
The basic fact about this transformation is that the Ui ’s are i.i.d. uniform on the interval
[0, 1] .
Lemma 8.2.2. The r.v.’s Ui are i.i.d. with values in [0, 1] and for any u ∈ [0, 1] it
holds
IP(Ui ≤ u
)= u.
By definition of F−10 , it holds for any u ∈ [0, 1]
F0
(F−10 (u)
)≡ u.
203
Moreover, if Gn is the c.d.f. of the Ui ’s, that is, if
Gn(u)def=
1
n
∑i
1(Ui ≤ u),
then
Gn(u) ≡ Fn[F−10 (u)
]. (8.10)
Exercise 8.2.1. Check Lemma 8.2.2 and (8.10).
Now by the change of variable y = F−10 (u) we obtain
Tn = supu∈[0,1]
n1/2∣∣F0(F
−10 (u))− Fn(F−10 (u))
∣∣ = supu∈[0,1]
n1/2∣∣u−Gn(u)
∣∣.It is obvious that the right-hand side of this expression does not depend on the orig-
inal model. Actually, it is for fixed n a precisely described random variable, so its
w-distribution only depends on n . It only remains to show that this distribution for
n large is close to some fixed limit distribution with a continuous c.d.f. allowing for a
choice of a proper critical value. We indicate the main steps of the proof.
Given a sample U1, . . . , Un , define the random function
ξn(u)def= n1/2
[u−Gn(u)
].
Clearly Tn = supu∈[0,1] ξn(u) . Next, the convergence of the random functions ξn(·)would imply the convergence of their maximum over u ∈ [0, 1] because the maximum is
a continuous function of a function. Finally, the weak convergence of ξn(·) w−→ ξ(·) can
be checked if for any continuous function h(u) , it holds
⟨ξn, h
⟩ def= n1/2
∫ 1
0h(u)
[u−Gn(u)
]du
w−→⟨ξ, h⟩ def
=
∫ 1
0h(u)ξ(u)du.
Now the result can be derived from the representation
⟨ξn, h
⟩= n1/2
∫ 1
0
[h(u)Gn(u)−m(h)
]du = n−1/2
n∑i=1
[Uih(Ui)−m(h)
]with m(h) =
∫ 10 h(u)ξ(u)du and from the central limit theorem for a sum of i.i.d. random
variables.
The case of a composite hypothesis If H0 : P ∈ (Pθ,θ ∈ Θ0) , then the test statistic
is described by (8.8). As we already mentioned, the case of a composite hypothesis can
be viewed (to be continued)
204
8.2.2 ω2 test (Cramer-Smirnov-von Mises)
Here we briefly discuss another distance also based on the c.d.f. of the null measure.
Namely, define for a measure P on the real line with c.d.f. F
ρ(Pn, P ) = ρ(Fn, F ) = n
∫ [Fn(y)− F (y)
]2dF (x). (8.11)
For the case of a simple hypothesis P = P0 , the Cramer-Smirnov-von Mises (CSvM)
test statistic is given by (8.11) with F = F0 . This is another functional of the path of the
random function n1/2[Fn(y)− F0(y)
]. The Kolmogorov test uses the maximum of this
function while the CSvM test uses the integral of this function squared. The property of
pivotality is preserved for the CSvM test statistic as well.
Theorem 8.2.3. Let F0 be a continuous c.d.f. Then
Tn = n
∫ [Fn(y)− F0(y)
]2dF (x)
w−→ η
where ηω is a fixed random variable (integral of a Brownian bridge squared on [0, 1] ).
Proof. The idea of the proof is the same as in the case of the Kolmogorov-Smirnov test.
First the transformation by F−10 translates the general case to the case of the uniform
distribution on [0, 1] . Next one can again use the functional convergence of the process
ξn(u) .
8.3 Partially Bayes tests and Bayes testing
In the above sections we mostly focused on the likelihood ratio testing approach. As in
estimation theory, the LR approach is very general and possesses some nice properties.
This section briefly discusses some possible alternative approaches including the quasi
likelihood ratio, partially Bayes and Bayes approaches.
8.3.1 Quasi LR approach
The structure of the LR test statistic T from (6.3) only uses the geometric properties of
the likelihood function L(θ) . The only point where the underlying data distribution is
called for, is the level condition: this condition must be checked under the null hypothesis
about the data. Now we consider the situation when L(θ) is a quasi log-likelihood. See
Section 2.10 for examples and details. Then the test statistic is still defined by (6.3)
leading to the test 1(T > z) . The first question to be addressed in this situation is what
the null hypothesis means. In the classical parametric approach it is the hypothesis about
the underlying data distribution which is described by the likelihood function L(θ) . Now
205
we consider the situation when the parametric assumption is possibly misspecified and
the process L(θ) is only a quasi log-likelihood. The answer depends on this type of
misspecification. (to be continued)
8.3.2 Partial Bayes approach and Bayes tests
Let Θ0 and Θ1 be two subsets of the parameter set Θ . We test the null hypothesis
H0 : θ∗ ∈ Θ0 against an alternative H1 : θ∗ ∈ Θ1 . The LR approach compares the
maximum of the likelihood process over Θ0 with the similar maximum over Θ1 . Let
now two measures π0 on Θ0 and π1 on Θ1 be given. Now instead of the maximum of
L(θ) we consider its weighted sum (integral) over Θ0 (resp. Θ1 ) with weights π0(θ)
resp. π1(θ) . More precisely, we consider the value
Tπ0,π1 =
∫Θ1
L(θ)π1(θ)λ(dθ)−∫Θ0
L(θ)π0(θ)λ(dθ).
Significantly positive values of this expression indicate that the hypothesis is likely wrong.
8.3.3 Bayes approach
Within the Bayes approach the true data distribution and the true parameter value are
not defined. Instead one considers the prior and posterior distribution of the parameter.
The parametric Bayes model can be represented as
Y | θ ∼ p(y|θ), θ ∼ π(θ).
The posterior density p(θ|Y ) can be computed via the Bayes formula:
p(θ|Y ) =p(Y |θ)π(θ)
p(Y )
with the marginal density p(Y ) =∫Θ p(Y |θ)π(θ)λ(dθ) . The Bayes approach suggests
instead of checking the hypothesis about the location of the parameter θ to look directly
at the posterior distribution. Namely, one can construct the so-called credible sets which
contain a prespecified fraction, say 1−α , of the mass of the whole posterior distribution.
Then one can say that the probability for the parameter θ to lie outside of this credible
set is at most α . So, the testing problem in the frequentist approach is replaced by the
problem of confidence estimation for the Bayes method. (to be continued)
206
Chapter 9
Deviation probability for
quadratic forms
The approximation results of the previous sections rely on the probability of the form
IP(‖ξ‖ > y
)for a given random vector ξ ∈ IRp . The only condition imposed on this
vector is that
log IE exp(γ>ξ
)≤ ν20‖γ‖2/2, γ ∈ IRp, ‖γ‖ ≤ g.
To simplify the presentation we rewrite this condition as
log IE exp(γ>ξ
)≤ ‖γ‖2/2, γ ∈ IRp, ‖γ‖ ≤ g. (9.1)
The general case can be reduced to ν0 = 1 by rescaling ξ and g :
log IE exp(γ>ξ/ν0
)≤ ‖γ‖2/2, γ ∈ IRp, ‖γ‖ ≤ ν0g
that is, ν−10 ξ fulfills (9.1) with a slightly increased g . In typical situations like in
Section ??, the value g is large (of order root-n ) while the value ν0 is close to one.
9.1 Gaussian case
Our benchmark will be a deviation bound for ‖ξ‖2 for a standard Gaussian vector ξ .
The ultimate goal is to show that under (9.1) the norm of the vector ξ exhibits behavior
expected for a Gaussian vector, at least in the region of moderate deviations. For the
reason of comparison, we begin by stating the result for a Gaussian vector ξ .
Theorem 9.1.1. Let ξ be a standard normal vector in IRp . Then for any u > 0 , it
holds
IP(‖ξ‖2 > p+ u
)≤ exp
{−(p/2)φ(u/p)
]}207
208
with
φ(t)def= t− log(1 + t).
Let φ−1(·) stand for the inverse of φ(·) . For any x ,
IP(‖ξ‖2 > p+ φ−1(2x/p)
)≤ exp(−x).
This particularly yields with κ = 6.6
IP(‖ξ‖2 > p+
√κxp ∨ (κx)
)≤ exp(−x).
Proof. The proof utilizes the following well known fact: for µ < 1
log IE exp(µ‖ξ‖2/2
)= −0.5p log(1− µ).
It can be obtained by straightforward calculus. Now consider any u > 0 . By the
exponential Chebyshev inequality
IP(‖ξ‖2 > p+ u
)≤ exp
{−µ(p+ u)/2
}IE exp
(µ‖ξ‖2/2
)(9.2)
= exp{−µ(p+ u)/2− (p/2) log(1− µ)
}.
It is easy to see that the value µ = u/(u+ p) maximizes µ(p+ u) + p log(1− µ) w.r.t.
µ yielding
µ(p+ u)− p log(1− µ) = u− p log(1 + u/p).
Further we use that x− log(1+x) ≥ a0x2 for x ≤ 1 and x− log(1+x) ≥ a0x for x > 1
with a0 = 1 − log(2) ≥ 0.3 . This implies with x = u/p for u =√κxp or u = κx and
κ = 2/a0 < 6.6 that
IP(‖ξ‖2 ≥ p+
√κxp ∨ (κx)
)≤ exp(−x)
as required.
The message of this result is that the squared norm of the Gaussian vector ξ con-
centrates around the value p and the deviation over the level p+√xp are exponentially
small in x .
A similar bound can be obtained for a norm of the vector IBξ where IB is some
given matrix. For notational simplicity we assume that IB is symmetric. Otherwise one
should replace it with (IB>IB)1/2 .
209
Theorem 9.1.2. Let ξ be standard normal in IRp . Then for every x > 0 and any
symmetric matrix IB , it holds with p = tr(IB2) , v2 = 2 tr(IB4) , and a∗ = ‖IB2‖∞
IP(‖IBξ‖2 > p + (2vx1/2) ∨ (6a∗x)
)≤ exp(−x).
Proof. The matrix IB2 can be represented as U> diag(a1, . . . , ap)U for an orthogonal
matrix U . The vector ξ = Uξ is also standard normal and ‖IBξ‖2 = ξ>UIB2U>ξ .
This means that one can reduce the situation to the case of a diagonal matrix IB2 =
diag(a1, . . . , ap) . We can also assume without loss of generality that a1 ≥ a2 ≥ . . . ≥ ap .
The expressions for the quantities p and v2 simplifies to
p = tr(IB2) = a1 + . . .+ ap,
v2 = 2 tr(IB4) = 2(a21 + . . .+ a2p).
Moreover, rescaling the matrix IB2 by a1 reduces the situation to the case with a1 = 1 .
Lemma 9.1.3. It holds
IE‖IBξ‖2 = tr(IB2), Var(‖IBξ‖2
)= 2 tr(IB4).
Moreover, for µ < 1
IE exp{µ‖IBξ‖2/2
}= det(1− µIB2)−1/2 =
p∏i=1
(1− µai)−1/2. (9.3)
Proof. If IB2 is diagonal, then ‖IBξ‖2 =∑
i aiξ2i and the summands aiξ
2i are indepen-
dent. It remains to note that IE(aiξ2i ) = ai , Var(aiξ
2i ) = 2a2i , and for µai < 1 ,
IE exp{µaiξ
2i /2}
= (1− µai)−1/2
yielding (9.3).
Given u , fix µ < 1 . The exponential Markov inequality yields
IP(‖IBξ‖2 > p + u
)≤ exp
{−µ(p + u)
2
}IE exp
(µ‖IBξ‖22
)≤ exp
{−µu
2− 1
2
p∑i=1
[µai + log
(1− µai
)]}.
We start with the case when x1/2 ≤ v/3 . Then u = 2x1/2v fulfills u ≤ 2v2/3 . Define
µ = u/v2 ≤ 2/3 and use that t+ log(1− t) ≥ −t2 for t ≤ 2/3 . This implies
IP(‖IBξ‖2 > p + u
)≤ exp
{−µu
2+
1
2
p∑i=1
µ2a2i
}= exp
(−u2/(4v2)
)= e−x. (9.4)
210
Next, let x1/2 > v/3 . Set µ = 2/3 . It holds similarly to the above
p∑i=1
[µai + log
(1− µai
)]≥ −
p∑i=1
µ2a2i ≥ −2v2/9 ≥ −2x.
Now, for u = 6x and µu/2 = 2x , (9.4) implies
IP(‖IBξ‖2 > p + u
)≤ exp
{−(2x− x
)}= exp(−x)
as required.
Below we establish similar bounds for a non-Gaussian vector ξ obeying (9.1).
9.2 A bound for the `2 -norm
This section presents a general exponential bound for the probability IP(‖ξ‖ > y
)under
(9.1). Given g and p , define the values w0 = gp−1/2 and wc by the equation
wc(1 + wc)
(1 + w2c )
1/2= w0 = gp−1/2. (9.5)
It is easy to see that w0/√
2 ≤ wc ≤ w0 . Further define
µcdef= w2
c/(1 + w2c )
ycdef=√
(1 + w2c )p,
xcdef= 0.5p
[w2c − log
(1 + w2
c
)]. (9.6)
Note that for g2 ≥ p , the quantities yc and xc can be evaluated as y2c ≥ w2cp ≥ g2/2
and xc & pw2c/2 ≥ g2/4 .
Theorem 9.2.1. Let ξ ∈ IRp fulfill (9.1). Then it holds for each x ≤ xc
IP(‖ξ‖2 > p+
√κxp ∨ (κx), ‖ξ‖ ≤ yc
)≤ 2 exp(−x),
where κ = 6.6 . Moreover, for y ≥ yc , it holds with gc = g−√µcp = gwc/(1 + wc)
IP(‖ξ‖ > y
)≤ 8.4 exp
{−gcy/2− (p/2) log(1− gc/y)
}≤ 8.4 exp
{−xc − gc(y− yc)/2
}.
Proof. The main step of the proof is the following exponential bound.
Lemma 9.2.2. Suppose (9.1). For any µ < 1 with g2 > pµ , it holds
IE exp(µ‖ξ‖2
2
)1I(‖ξ‖ ≤ g/µ−
√p/µ
)≤ 2(1− µ)−p/2. (9.7)
211
Proof. Let ε be a standard normal vector in IRp and u ∈ IRp . The bound IP(‖ε‖2 >
p)≤ 1/2 implies for any vector u and any r with r ≥ ‖u‖+ p1/2 that IP
(‖u+ ε‖ ≤
r)≥ 1/2 . Let us fix some ξ with ‖ξ‖ ≤ g/µ−
√p/µ and denote by IPξ the conditional
probability given ξ . It holds with cp = (2π)−p/2
cp
∫exp(γ>ξ − ‖γ‖
2
2µ
)1I(‖γ‖ ≤ g)dγ
= cp exp(µ‖ξ‖2/2
) ∫exp(−1
2
∥∥µ−1/2γ − µ1/2ξ∥∥2) 1I(µ−1/2‖γ‖ ≤ µ−1/2g)dγ
= µp/2 exp(µ‖ξ‖2/2
)IPξ(‖ε+ µ1/2ξ‖ ≤ µ−1/2g
)≥ 0.5µp/2 exp
(µ‖ξ‖2/2
),
because ‖µ1/2ξ‖+ p1/2 ≤ µ−1/2g . This implies in view of p < g2/µ that
exp(µ‖ξ‖2/2
)1I(‖ξ‖2 ≤ g/µ−
√p/µ
)≤ 2µ−p/2cp
∫exp(γ>ξ − ‖γ‖
2
2µ
)1I(‖γ‖ ≤ g)dγ.
Further, by (9.1)
cpIE
∫exp(γ>ξ − 1
2µ‖γ‖2
)1I(‖γ‖ ≤ g)dγ
≤ cp
∫exp(−µ−1 − 1
2‖γ‖2
)1I(‖γ‖ ≤ g)dγ
≤ cp
∫exp(−µ−1 − 1
2‖γ‖2
)dγ
≤ (µ−1 − 1)−p/2
and (9.7) follows.
Due to this result, the scaled squared norm µ‖ξ‖2/2 after a proper truncation pos-
sesses the same exponential moments as in the Gaussian case. A straightforward impli-
cation is the probability bound IP(‖ξ‖2 > p+ u
)for moderate values u . Namely, given
u > 0 , define µ = u/(u+ p) . This value optimizes the inequality (9.2) in the Gaussian
case. Now we can apply a similar bound under the constraints ‖ξ‖ ≤ g/µ −√p/µ .
Therefore, the bound is only meaningful if√u+ p ≤ g/µ−
√p/µ with µ = u/(u+ p) ,
or, with w =√u/p ≤ wc ; see (9.5).
The largest value u for which this constraint is still valid, is given by p + u = y2c .
212
Hence, (9.7) yields for p+ u ≤ y2c
IP(‖ξ‖2 > p+ u, ‖ξ‖ ≤ yc
)≤ exp
{−µ(p+ u)
2
}IE exp
(µ‖ξ‖22
)1I(‖ξ‖ ≤ g/µ−
√p/µ
)≤ 2 exp
{−0.5
[µ(p+ u) + p log(1− µ)
]}= 2 exp
{−0.5
[u− p log(1 + u/p)
]}.
Similarly to the Gaussian case, this implies with κ = 6.6 that
IP(‖ξ‖ ≥ p+
√κxp ∨ (κx), ‖ξ‖ ≤ yc
)≤ 2 exp(−x).
The Gaussian case means that (9.1) holds with g = ∞ yielding yc = ∞ . In the non-
Gaussian case with a finite g , we have to accompany the moderate deviation bound with
a large deviation bound IP(‖ξ‖ > y
)for y ≥ yc . This is done by combining the bound
(9.7) with the standard slicing arguments.
Lemma 9.2.3. Let µ0 ≤ g2/p . Define y0 = g/µ0−√p/µ0 and g0 = µ0y0 = g−√µ0p .
It holds for y ≥ y0
IP(‖ξ‖ > y
)≤ 8.4(1− g0/y)−p/2 exp
(−g0y/2
)(9.8)
≤ 8.4 exp{−x0 − g0(y− y0)/2
}. (9.9)
with x0 defined by
2x0 = µ0y20 + p log(1− µ0) = g2/µ0 − p+ p log(1− µ0).
Proof. Consider the growing sequence yk with y1 = y and g0yk+1 = g0y + k . Define
also µk = g0/yk . In particular, µk ≤ µ1 = g0/y . Obviously
IP(‖ξ‖ > y
)=
∞∑k=1
IP(‖ξ‖ > yk, ‖ξ‖ ≤ yk+1
).
Now we try to evaluate every slicing probability in this expression. We use that
µk+1y2k =
(g0y + k − 1)2
g0y + k≥ g0y + k − 2,
and also g/µk −√p/µk ≥ yk because g− g0 =
õ0p >
õkp and
g/µk −√p/µk − yk = µ−1k (g−√µkp− g0) ≥ 0.
213
Hence by (9.7)
IP(‖ξ‖ > y
)≤∞∑k=1
IP(‖ξ‖ > yk, ‖ξ‖ ≤ yk+1
)
≤∞∑k=1
exp(−µk+1y
2k
2
)IE exp
(µk+1‖ξ‖2
2
)1I(‖ξ‖ ≤ yk+1
)≤∞∑k=1
2(1− µk+1
)−p/2exp(−µk+1y
2k
2
)
≤ 2(1− µ1
)−p/2 ∞∑k=1
exp(−g0y + k − 2
2
)= 2e1/2(1− e−1/2)−1(1− µ1)−p/2 exp
(−g0y/2
)≤ 8.4(1− µ1)−p/2 exp
(−g0y/2
)and the first assertion follows. For y = y0 , it holds
g0y0 + p log(1− µ0) = µ0y20 + p log(1− µ0) = 2x0
and (9.8) implies IP(‖ξ‖ > y0
)≤ 8.4 exp(−x0) . Now observe that the function f(y) =
g0y/2 + (p/2) log(1 − g0/y
)fulfills f(y0) = x0 and f ′(y) ≥ g0/2 yielding f(y) ≥
x0 + g0(y− y0)/2 . This implies (9.9).
The statements of the theorem are obtained by applying the lemmas with µ0 = µc =
w2c/(1+w2
c ) . This also implies y0 = yc , x0 = xc , and g0 = gc = g−√µcp ; cf. (9.6).
The statements of Theorem 9.3.1 can be simplified under the assumption g2 ≥ p .
Corollary 9.2.4. Let ξ fulfill (9.1) and g2 ≥ p . Then it holds for x ≤ xc
IP(‖ξ‖2 ≥ z(x, p)
)≤ 2e−x + 8.4e−xc , (9.10)
z(x, p)def=
p+√κxp, x ≤ p/κ,
p+ κx p/κ < x ≤ xc,(9.11)
with κ = 6.6 . For x > xc
IP(‖ξ‖2 ≥ zc(x, p)
)≤ 8.4e−x, zc(x, p)
def=∣∣yc + 2(x− xc)/gc
∣∣2.This result implicitly assumes that p ≤ κxc which is fulfilled if w2
0 = g2/p ≥ 1 :
κxc = 0.5κ[w20 − log(1 + w2
0)]p ≥ 3.3
[1− log(2)
]p > p.
214
In the zone x ≤ p/κ we obtain sub-Gaussian behavior of the tail of ‖ξ‖2 − p , in the
zone p/κ < x ≤ xc it becomes sub-exponential. Note that the sub-exponential zone is
empty if g2 < p .
For x ≤ xc , the function z(x, p) mimics the quantile behavior of the chi-squared
distribution χ2p with p degrees of freedom. Moreover, increase of the value g yields a
growth of the sub-Gaussian zone. In particular, for g = ∞ , a general quadratic form
‖ξ‖2 has under (9.1) the same tail behavior as in the Gaussian case.
Finally, in the large deviation zone x > xc the deviation probability decays as e−cx1/2
for some fixed c . However, if the constant g in the condition (9.1) is sufficiently large
relative to p , then xc is large as well and the large deviation zone x > xc can be ignored
at a small price of 8.4e−xc and one can focus on the deviation bound described by (9.10)
and (9.11).
9.3 A bound for a quadratic form
Now we extend the result to more general bound for ‖IBξ‖2 = ξ>IB2ξ with a given
matrix IB and a vector ξ obeying the condition (9.1). Similarly to the Gaussian case
we assume that IB is symmetric. Define important characteristics of IB
p = tr(IB2), v2 = 2 tr(IB4), λ∗def= ‖IB2‖∞
def= λmax(IB2).
For simplicity of formulation we suppose that λ∗ = 1 , otherwise one has to replace p
and v2 with p/λ∗ and v2/λ∗ .
Let g be shown in (9.1). Define similarly to the `2 -case wc by the equation
wc(1 + wc)
(1 + w2c )
1/2= gp−1/2.
Define also µc = w2c/(1+w2
c )∧2/3 . Note that w2c ≥ 2 implies µc = 2/3 . Further define
y2c = (1 + w2c )p, 2xc = µcy
2c + log det{IIp − µcIB2}. (9.12)
Similarly to the case with IB = IIp , under the condition g2 ≥ p , one can bound y2c ≥ g2/2
and xc & g2/4 .
Theorem 9.3.1. Let a random vector ξ in IRp fulfill (9.1). Then for each x < xc
IP(‖IBξ‖2 > p + (2vx1/2) ∨ (6x), ‖IBξ‖ ≤ yc
)≤ 2 exp(−x).
Moreover, for y ≥ yc , with gc = g−√µcp = gwc/(1 + wc) , it holds
IP(‖IBξ‖ > y
)≤ 8.4 exp
(−xc − gc(y− yc)/2
).
215
Proof. The main steps of the proof are similar to the proof of Theorem 9.2.1.
Lemma 9.3.2. Suppose (9.1). For any µ < 1 with g2/µ ≥ p , it holds
IE exp(µ‖IBξ‖2/2
)1I(‖IB2ξ‖ ≤ g/µ−
√p/µ
)≤ 2det(IIp − µIB2)−1/2. (9.13)
Proof. With cp(IB) =(2π)−p/2
det(IB−1)
cp(IB)
∫exp(γ>ξ − 1
2µ‖IB−1γ‖2
)1I(‖γ‖ ≤ g)dγ
= cp(IB) exp(µ‖IBξ‖2
2
)∫exp(−1
2
∥∥µ1/2IBξ − µ−1/2IB−1γ∥∥2) 1I(‖γ‖ ≤ g)dγ
= µp/2 exp(µ‖IBξ‖2
2
)IPξ(‖µ−1/2IBε+ IB2ξ‖ ≤ g/µ
),
where ε denotes a standard normal vector in IRp and IPξ means the conditional prob-
ability given ξ . Moreover, for any u ∈ IRp and r ≥ p1/2 + ‖u‖ , it holds in view of
IP(‖IBε‖2 > p
)≤ 1/2
IP(‖IBε− u‖ ≤ r
)≥ IP
(‖IBε‖ ≤ √p
)≥ 1/2.
This implies
exp(µ‖IBξ‖2/2
)1I(‖IB2ξ‖ ≤ g/µ−
√p/µ
)≤ 2µ−p/2cp(IB)
∫exp(γ>ξ − 1
2µ‖IB−1γ‖2
)1I(‖γ‖ ≤ g)dγ.
Further, by (9.1)
cp(IB)IE
∫exp(γ>ξ − 1
2µ‖IB−1γ‖2
)1I(‖γ‖ ≤ g)dγ
≤ cp(IB)
∫exp(‖γ‖2
2− 1
2µ‖IB−1γ‖2
)dγ
≤ det(IB−1) det(µ−1IB−2 − IIp)−1/2 = µp/2 det(IIp − µIB2)−1/2
and (9.13) follows.
Now we evaluate the probability IP(‖IBξ‖ > y
)for moderate values of y .
Lemma 9.3.3. Let µ0 < 1∧ (g2/p) . With y0 = g/µ0 −√p/µ0 , it holds for any u > 0
IP(‖IBξ‖2 > p + u, ‖IB2ξ‖ ≤ y0
)≤ 2 exp
{−0.5µ0(p + u)− 0.5 log det(IIp − µ0IB2)
}. (9.14)
216
In particular, if IB2 is diagonal, that is, IB2 = diag(a1, . . . , ap
), then
IP(‖IBξ‖2 > p + u, ‖IB2ξ‖ ≤ y0
)≤ 2 exp
{−µ0u
2− 1
2
p∑i=1
[µ0ai + log
(1− µ0ai
)]}. (9.15)
Proof. The exponential Chebyshev inequality and (9.13) imply
IP(‖IBξ‖2 > p + u, ‖IB2ξ‖ ≤ y0
)≤ exp
{−µ0(p + u)
2
}IE exp
(µ0‖IBξ‖22
)1I(‖IB2ξ‖ ≤ g/µ0 −
√p/µ0
)≤ 2 exp
{−0.5µ0(p + u)− 0.5 log det(IIp − µ0IB2)
}.
Moreover, the standard change-of-basis arguments allow us to reduce the problem to the
case of a diagonal matrix IB2 = diag(a1, . . . , ap
)where 1 = a1 ≥ a2 ≥ . . . ≥ ap > 0 .
Note that p = a1 + . . .+ap . Then the claim (9.14) can be written in the form (9.15).
Now we evaluate a large deviation probability that ‖IBξ‖ > y for a large y . Note
that the condition ‖IB2‖∞ ≤ 1 implies ‖IB2ξ‖ ≤ ‖IBξ‖ . So, the bound (9.14) continues
to hold when ‖IB2ξ‖ ≤ y0 is replaced by ‖IBξ‖ ≤ y0 .
Lemma 9.3.4. Let µ0 < 1 and µ0p < g2 . Define g0 by g0 = g − √µ0p . For any
y ≥ y0def= g0/µ0 , it holds
IP(‖IBξ‖ > y
)≤ 8.4 det{IIp − (g0/y)IB2}−1/2 exp
(−g0y/2
).
≤ 8.4 exp(−x0 − g0(y− y0)/2
), (9.16)
where x0 is defined by
2x0 = g0y0 + log det{IIp − (g0/y0)IB2}.
Proof. The slicing arguments of Lemma 9.2.3 apply here in the same manner. One has
to replace ‖ξ‖ by ‖IBξ‖ and (1 − µ1)−p/2 by det{IIp − (g0/y)IB2}−1/2 . We omit the
details. In particular, with y = y0 = g0/µ , this yields
IP(‖IBξ‖ > y0
)≤ 8.4 exp(−x0).
Moreover, for the function f(y) = g0y + log det{IIp − (g0/y)IB2} , it holds f ′(y) ≥ g0
and hence, f(y) ≥ f(y0) + g0(y− y0) for y > y0 . This implies (9.16).
One important feature of the results of Lemma 9.3.3 and Lemma 9.3.4 is that the
value µ0 < 1∧ (g2/p) can be selected arbitrarily. In particular, for y ≥ yc , Lemma 9.3.4
217
with µ0 = µc yields the large deviation probability IP(‖IBξ‖ > y
). For bounding the
probability IP(‖IBξ‖2 > p + u, ‖IBξ‖ ≤ yc
), we use the inequality log(1− t) ≥ −t− t2
for t ≤ 2/3 . It implies for µ ≤ 2/3 that
− log IP(‖IBξ‖2 > p + u, ‖IBξ‖ ≤ yc
)≥ µ(p + u) +
p∑i=1
log(1− µai
)≥ µ(p + u)−
p∑i=1
(µai + µ2a2i ) ≥ µu− µ2v2/2. (9.17)
Now we distinguish between µc = 2/3 and µc < 2/3 starting with µc = 2/3 . The
bound (9.17) with µ = 2/3 and with u = (2vx1/2) ∨ (6x) yields
IP(‖IBξ‖2 > p + u, ‖IBξ‖ ≤ yc
)≤ 2 exp(−x);
see the proof of Theorem 9.1.2 for the Gaussian case.
Now consider µc < 2/3 . For x1/2 ≤ µcv/2 , use u = 2vx1/2 and µ0 = u/v2 . It
holds µ0 = u/v2 ≤ µc and u2/(4v2) = x yielding the desired bound by (9.17). For
x1/2 > µcv/2 , we select again µ0 = µc . It holds with u = 4µ−1c x that µcu/2−µ2cv2/4 ≥2x− x = x . This completes the proof.
Now we describe the value z(x, IB) ensuring a small value for the large deviation
probability IP(‖IBξ‖2 > z(x, IB)
). For ease of formulation, we suppose that g2 ≥ 2p
yielding µ−1c ≤ 3/2 . The other case can be easily adjusted.
Corollary 9.3.5. Let ξ fulfill (9.1) with g2 ≥ 2p . Then it holds for x ≤ xc with xc
from (9.12):
IP(‖IBξ‖2 ≥ z(x, IB)
)≤ 2e−x + 8.4e−xc ,
z(x, IB)def=
p + 2vx1/2, x ≤ v/18,
p + 6x v/18 < x ≤ xc.(9.18)
For x > xc
IP(‖IBξ‖2 ≥ zc(x, IB)
)≤ 8.4e−x, zc(x, IB)
def=∣∣yc + 2(x− xc)/gc
∣∣2.9.4 Rescaling and regularity condition
The result of Theorem 9.3.1 can be extended to a more general situation when the
condition (9.1) is fulfilled for a vector ζ rescaled by a matrix V0 . More precisely, let the
218
random p -vector ζ fulfills for some p× p matrix V0 the condition
supγ∈IRp
log IE exp(λγ>ζ
‖V0γ‖
)≤ ν20λ2/2, |λ| ≤ g, (9.19)
with some constants g > 0 , ν0 ≥ 1 . Again, a simple change of variables reduces the case
of an arbitrary ν0 ≥ 1 to ν0 = 1 . Our aim is to bound the squared norm ‖D−10 ζ‖2 of a
vector D−10 ζ for another p×p positive symmetric matrix D20 . Note that condition (9.19)
implies (9.1) for the rescaled vector ξ = V −10 ζ . This leads to bounding the quadratic
form ‖D−10 V0ξ‖2 = ‖IBξ‖2 with IB2 = D−10 V 20 D−10 . It obviously holds
p = tr(IB2) = tr(D−20 V 20 ).
Now we can apply the result of Corollary 9.3.5.
Corollary 9.4.1. Let ζ fulfill (9.19) with some V0 and g . Given D0 , define IB2 =
D−10 V 20 D−10 , and let g2 ≥ 2p . Then it holds for x ≤ xc with xc from (9.12):
IP(‖D−10 ζ‖2 ≥ z(x, IB)
)≤ 2e−x + 8.4e−xc ,
with z(x, IB) from (9.18). For x > xc
IP(‖D−10 ζ‖2 ≥ zc(x, IB)
)≤ 8.4e−x, zc(x, IB)
def=∣∣yc + 2(x− xc)/gc
∣∣2.Finally we briefly discuss the regular case with D0 ≥ aV0 for some a > 0 . This
implies ‖IB‖∞ ≤ a−1 and
v2 = 2 tr(IB4) ≤ 2a−2p.
9.5 A chi-squared bound with norm-constraints
This section extends the results to the case when the bound (9.1) requires some other
conditions than the `2 -norm of the vector γ . Namely, we suppose that
log IE exp(γ>ξ
)≤ ‖γ‖2/2, γ ∈ IRp, ‖γ‖◦ ≤ g◦, (9.20)
where ‖·‖◦ is a norm which differs from the usual Euclidean norm. Our driving example
is given by the sup-norm case with ‖γ‖◦ ≡ ‖γ‖∞ . We are interested to check whether
the previous results of Section 9.2 still apply. The answer depends on how massive the
set A(r) = {γ : ‖γ‖◦ ≤ r} is in terms of the standard Gaussian measure on IRp . Recall
that the quadratic norm ‖ε‖2 of a standard Gaussian vector ε in IRp concentrates
219
around p at least for p large. We need a similar concentration property for the norm
‖ · ‖◦ . More precisely, we assume for a fixed r◦ that
IP(‖ε‖◦ ≤ r◦
)≥ 1/2, ε ∼ N(0, IIp). (9.21)
This implies for any value u◦ > 0 and all u ∈ IRp with ‖u‖◦ ≤ u◦ that
IP(‖ε− u‖◦ ≤ r◦ + u◦
)≥ 1/2, ε ∼ N(0, IIp).
For each z > p , consider
µ(z) = (z− p)/z.
Given u◦ , denote by z◦ = z◦(u◦) the root of the equation
g◦µ(z◦)
− r◦
µ1/2(z◦)= u◦. (9.22)
One can easily see that this value exists and unique if u◦ ≥ g◦−r◦ and it can be defined
as the largest z for which g◦µ(z) −
r◦µ1/2(z)
≥ u◦ . Let µ◦ = µ(z◦) be the corresponding
µ -value. Define also x◦ by
2x◦ = µ◦z◦ + p log(1− µ◦).
If u◦ < g◦ − r◦ , then set z◦ =∞ , x◦ =∞ .
Theorem 9.5.1. Let a random vector ξ in IRp fulfill (9.20). Suppose (9.21) and let,
given u◦ , the value z◦ be defined by (9.22). Then it holds for any u > 0
IP(‖ξ‖2 > p+ u, ‖ξ‖◦ ≤ u◦
)≤ 2 exp
{−(p/2)φ(u)
]}. (9.23)
yielding for x ≤ x◦
IP(‖ξ‖2 > p+
√κxp ∨ (κx), ‖ξ‖◦ ≤ u◦
)≤ 2 exp(−x), (9.24)
where κ = 6.6 . Moreover, for z ≥ z◦ , it holds
IP(‖ξ‖2 > z, ‖ξ‖◦ ≤ u◦
)≤ 2 exp
{−µ◦z/2− (p/2) log(1− µ◦)
}= 2 exp
{−x◦ − g◦(z− z◦)/2
}.
Proof. The arguments behind the result are the same as in the one-norm case of Theo-
rem 9.2.1. We only outline the main steps.
220
Lemma 9.5.2. Suppose (9.20) and (9.21). For any µ < 1 with g◦ > µ1/2r◦ , it holds
IE exp(µ‖ξ‖2/2
)1I(‖ξ‖◦ ≤ g◦/µ− r◦/µ
1/2)≤ 2(1− µ)−p/2. (9.25)
Proof. Let ε be a standard normal vector in IRp and u ∈ IRp . Let us fix some ξ with
µ1/2‖ξ‖◦ ≤ µ−1/2g◦−r◦ and denote by IPξ the conditional probability given ξ . It holds
by (9.21) with cp = (2π)−p/2
cp
∫exp(γ>ξ − 1
2µ‖γ‖2
)1I(‖γ‖◦ ≤ g◦)dγ
= cp exp(µ‖ξ‖2/2
) ∫exp(−1
2
∥∥µ1/2ξ − µ−1/2γ∥∥2) 1I(‖µ−1/2γ‖◦ ≤ µ−1/2g◦)dγ
= µp/2 exp(µ‖ξ‖2/2
)IPξ(‖ε− µ1/2ξ‖◦ ≤ µ−1/2g◦
)≥ 0.5µp/2 exp
(µ‖ξ‖2/2
).
This implies
exp(µ‖ξ‖2
2
)1I(‖ξ‖◦ ≤ g◦/µ− r◦/µ
1/2)
≤ 2µ−p/2cp
∫exp(γ>ξ − 1
2µ‖γ‖2
)1I(‖γ‖◦ ≤ g◦)dγ.
Further, by (9.20)
cpIE
∫exp(γ>ξ − 1
2µ‖γ‖2
)1I(‖γ‖◦ ≤ g◦)dγ
≤ cp
∫exp(−µ−1 − 1
2‖γ‖2
)dγ ≤ (µ−1 − 1)−p/2
and (9.25) follows.
As in the Gaussian case, (9.25) implies for z > p with µ = µ(z) = (z − p)/z the
bounds (9.23) and (9.24). Note that the value µ(z) clearly grows with z from zero to
one, while g◦/µ(z)− r◦/µ1/2(z) is strictly decreasing. The value z◦ is defined exactly as
the point where g◦/µ(z)− r◦/µ1/2(z) crosses u◦ , so that g◦/µ(z)− r◦/µ
1/2(z) ≥ u◦ for
z ≤ z◦ .
For z > z◦ , the choice µ = µ(y) conflicts with g◦/µ(z) − r◦/µ1/2(z) ≥ u◦ . So, we
apply µ = µ◦ yielding by the Markov inequality
IP(‖ξ‖2 > z, ‖ξ‖◦ ≤ u◦
)≤ 2 exp
{−µ◦z/2− (p/2) log(1− µ◦)
},
and the assertion follows.
221
It is easy to check that the result continues to hold for the norm of Πξ for a given
sub-projector Π in IRp satisfying Π = Π> , Π2 ≤ Π . As above, denote pdef= tr(Π2) ,
v2def= 2 tr(Π4) . Let r◦ be fixed to ensure
IP(‖Πε‖◦ ≤ r◦
)≥ 1/2, ε ∼ N(0, IIp).
The next result is stated for g◦ ≥ r◦ + u◦ , which simplifies the formulation.
Theorem 9.5.3. Let a random vector ξ in IRp fulfill (9.20) and Π follows Π = Π> ,
Π2 ≤ Π . Let some u◦ be fixed. Then for any µ◦ ≤ 2/3 with g◦µ−1◦ − r◦µ
−1/2◦ ≥ u◦ ,
IE exp{µ◦
2(‖Πξ‖2 − p)
}1I(‖Π2ξ‖◦ ≤ u◦
)≤ 2 exp(µ2◦v
2/4), (9.26)
where v2 = 2 tr(Π4) . Moreover, if g◦ ≥ r◦ + u◦ , then for any z ≥ 0
IP(‖Πξ‖2 > z, ‖Π2ξ‖◦ ≤ u◦
)≤ IP
(‖Πξ‖2 > p + (2vx1/2) ∨ (6x), ‖Π2ξ‖◦ ≤ u◦
)≤ 2 exp(−x).
Proof. Arguments from the proof of Lemmas 9.3.2 and 9.5.2 yield in view of g◦µ−1◦ −
r◦µ−1/2◦ ≥ u◦
IE exp{µ◦‖Πξ‖2/2
}1I(‖Π2ξ‖◦ ≤ u◦
)≤ IE exp
(µ◦‖Πξ‖2/2
)1I(‖Π2ξ‖◦ ≤ g◦/µ◦ − p/µ
1/2◦)
≤ 2det(IIp − µ◦Π2)−1/2.
Now the inequality log(1− t) ≥ −t− t2 for t ≤ 2/3 implies
− log det(IIp − µ◦Π2) ≤ µ◦p + µ2◦v2/2
cf. (9.17); the assertion (9.26) follows.
9.6 A bound for the `2 -norm under Bernstein conditions
For comparison, we specify the results to the case considered recently in Y. Baraud
(2010). Let ζ be a random vector in IRn whose components ζi are independent and
satisfy the Bernstein type conditions: for all |λ| < c−1
log IEeλζi ≤ λ2σ2
1− c|λ|. (9.27)
222
Denote ξ = ζ/(2σ) and consider ‖γ‖◦ = ‖γ‖∞ . Fix g◦ = σ/c . If ‖γ‖◦ ≤ g◦ , then
1− cγi/(2σ) ≥ 1/2 and
log IE exp(γ>ξ
)≤∑i
log IE exp(γiζi
2σ
)≤∑i
|γi/(2σ)|2σ2
1− cγi/(2σ)≤ ‖γ‖2/2.
Let also S be some linear subspace of IRn with dimension p and ΠS denote the
projector on S . For applying the result of Theorem 9.5.1, the value r◦ has to be fixed.
We use that the infinity norm ‖ε‖∞ concentrates around√
2 log p .
Lemma 9.6.1. It holds for a standard normal vector ε ∈ IRp with r◦ =√
2 log p
IP(‖ε‖◦ ≤ r◦
)≥ 1/2.
Proof. By definition
IP(‖ε‖◦ > r◦
)≤ IP
(‖ε‖∞ >
√2 log p
)≤ pIP
(|ε1| >
√2 log p
)≤ 1/2
as required.
Now the general bound of Theorem 9.5.1 is applied to bounding the norm of ‖ΠSξ‖ .
For simplicity of formulation we assume that g◦ ≥ u◦ + r◦ .
Theorem 9.6.2. Let S be some linear subspace of IRn with dimension p . Let g◦ ≥u◦ + r◦ . If the coordinates ζi of ζ are independent and satisfy (9.27), then for all x ,
IP((4σ2)−1‖ΠSζ‖2 > p +
√κxp ∨ (κx), ‖ΠSζ‖∞ ≤ 2σu◦)≤ 2 exp(−x),
The bound of Baraud (2010) reads
IP
(‖ΠSζ‖2 >
(3σ ∨
√6cu)√
x + 3p, ‖ΠSζ‖∞ ≤ 2σu◦
)≤ e−x.
As expected, in the region x ≤ xc of Gaussian approximation, the bound of Baraud is
not sharp and actually quite rough.
Bibliography
223