Reproducing Kernel Hilbert Spaces in Probability and Statistics || RKHS and Stochastic Processes

Chapter 2

RKHS AND STOCHASTIC PROCESSES

1. INTRODUCTIONIn Chapter 1, we have studied the relationships between reproducing

kernels and positive definite functions. In this chapter, the central resultdue to Loeve is that the class of covariance functions of second orderstochastic processes coincide with t he class of positive definite functions.This link has been used to translate some problems related to stochasticprocesses into functional ones. Such equivalence results are potentiallyinteresting either to use functional methods for solving stochastic problems but also to use stochastic methods for improving algorithms forfunctional problems , as will be illustrated in Section 4.2 , and belong tothe large field of interactions between approximation theory and statist ics. Bayesian numerical analysis for example is surveyed in Diaconis(1988) who traces it back to Poincare (1912) . We will come back to thistopic later on in Chapter 3. In the 1960 's, Parzen popularized the use ofMercer and Karhunen representation t heorems to write formal solutionsto best linear prediction problems for stochastic processes. This leadWahba in the 1970 's to reveal the spline nature of the solut ion of somefiltering problems.

2. COVARIANCE FUNCTION OF A SECONDORDER STOCHASTIC PROCESS

2.1. CASE OF ORDINARY STOCHASTICPROCESSES

Let (S1, A ,P) be a fixed probability space and L2 (S1, A ,P) be the spaceof second order random var iables on S1. We recall t hat L2 (S1 , A ,P) is aHilbert space traditionally equipped with t he inner product < X ,Y >=

A. Berlinet et al., Reproducing Kernel Hilbert Spaces In Probability and Statistics

© Springer Science+Business Media New York 2004

56 RKHS IN PROBABILITY AND STATISTICS

E(XY). Let X t, t ranging in a set T, be a second order stochasticprocess defined on the probability space (n, A, P) with values in lR orC. We will denote by m the mean function of the process

by R the second moment function

R(t, s) = E(XtXs )

and by K t he covariance function

Note thatR(s, t) = m(t)m(s) + K(s, t)

Some authors use covariance function for the function R and some othersspecify proper covariance function for the function K. We may do thesame when it is clear from the background. T is often a subset of lRP•

The term process is used for X, when p = 1, whereas the term randomfield is preferred when p > 1. Since most of our statements will coverboth cases, we will use process indifferently. Complex valued processesare not so common in applications, hence in order to avoid unnecessarycomplications, we will consider only the case of real valued processes inthe rest of the chapter, even though what follows remains valid up to afew complex conjugate signs.

2.1.1 CASE OF GENERALIZED STOCHASTICPROCESSES

Some processes used in physical models such as white noise or itsderivatives require a more elaborate theory to be defined rigorously dueto the lack of continuity of their covariance function and to the fact thatthey have no point values. The physician does not have direct accessto a process X t but rather to its measure through a physical deviceJXt<p(t)d>..(t) where <p is a weighting function characterizing this device .Hence it is natural to define these processes as linear operators on a setof weighing functions.Generalized stochastic processes are introduced in Ito (1954) and Gelfandand Vilenkin (1967) from ordinary stochastic processes in the same waygeneralized functions are defined from ordinary functions. As in Meidan(1979), we will consider rather Ito 's definition (Ito, 1954) which is morerestrictive in the sense that the random variables involved are requiredto be square integrable. For an open subset T of lRP, according to Ito's

RKHS and stochastic processes 57

definition , a second order generalized stochastic process (hereafter GSP)is a linear and continuous operator from V(T) to £2(0, A, Pl. In contrast, an ordinary second order stochastic process is a mapping from Tinto £2(0). When this mapping is moreover Bochner integrable on compact sets of T (see Chapter 4, Section 5), it induces a GSP, and hencethis definition generalizes the traditional one. For these processes, weneed to introduce a covariance operator as we did for random variableswith values in a RKHS in Chapter 1, Section 4.3. We can proceed as forthe case of Banach valued processes (see Bosq , 2000). For a topologicalvector subspace £ of ~.T, if X is a linear and continuous operator fromE to £2(0) and u an element of the topological bidual E", let

Cx(u) = E(((u,X»X) (2.1)

2.2.2.2.1

define the covariance operator of X from E" to E, To avoid the use ofthe bidual, some authors prefer to restrict attention to the covariancestructure given by the bilinear form

R(¢, 7/;) = E(((¢,X»((7/; ,X») .

For example continuous white noise on the interval (0, a) can be introduced as the zero-mean generalized process with covariance structure

R(¢, 7/;) = l a

¢(t)7/;(t)d>.(t)

for two elements ¢ and 7/; in V(T) . Coming back to GSP's, if X is aGSP and X' its transpose, which can be viewed as a an operator from£2(n) into 1J'(T) , Meidan (1979) defines its covariance operator to bethe operator X'X from V(T) into V'(T). It is the restriction to V(T)of the covariance operator defined by 2.1. When the GSP is inducedby an ordinary stochastic process , its covariance operator is an integraloperator whose kernel coincide with the ordinary covariance function .

POSITIVITY AND COVARIANCEPOSITIVE TYPE FUNCTIONS ANDCOVARIANCE FUNCTIONS

The following theorem due to Loeve (1978, p. 132 of volume 2) establishes the exact coincidence between the class of functions of positivetype and the class of covariance functions .

THEOREM 27 (LOEVE 's THEOREM) R is a second moment funct ion ofa second order stochastic process indexed by T if and only if R is afunction of positive type on TxT.


Proof. For simplicity let us restrict to the case of real-valued kernels. Itis clear that a second moment function is of positive type. Conversely, letR be of positive type on T xT. For any finite subset Tn = {t l , ... , tn} ET , the quadratic form Q(u) = L:?'j=l R(ti ' tj)UiUj being positive definite,there exists jointly normal random variables X ti such that R(ti, tj) =E(XtiXtj ) (simply work in a diagonalization basis of this form). Itis then enough to check that these laws defined for any finite subsetTn = {tl,"" tn} E T are consistent which is clear. •

Note that it also coincides with the class of proper covariance functions.Let us now give some examples of classes of processes with covariancefunctions of particular types.Example 1 stationary processesA useful class of processes for statistical applications is that of stationary processes on T C !Rd. Let us recall that a second order process issaid to be strongly stationary when its distribution is translation invariant i.e. when for any increment vector h E T, the joint distribution ofX t1+h, ... , X tn+h is the same as the distribution of X t1, ... , X tn for alltl , . .. , tn E T . In the vocabulary of random fields, the term "homogeneous" is sometimes preferred . The proper covariance of a stationaryprocess is translation invariant i.e. K(s, t) = k(s - t). The positive definiteness of the function K corresponds to the property of the one variablefunction k to be of positive type (see Chapter 1 Section Positivetype).A larger class of processes still retaining the statistical advantages of theprevious one is that of second order (or weakly) stationary processes:a second order process is said to be weakly stationary when its meanis constant and its proper covariance is translation invariant. Bochnertheorem (see Chapter 1) allows one to define the spectral measure of astationary process as the Fourier transform of the positive type functionk.Example 2 markovian gaussian processesA gaussian process {Xt , t E T} with zero mean and continuous propercovariance K can be shown to be markovian if and only if

K(s, t) =g(s)G(min(s, t))g(t), (2.2)

where 9 is a non vanishing continuous function on T with g(O) = 1, andG is continuous and monotone increasing on T (see Neveu, 1968).In the case G(O) > 0, equation (2.2) is equivalent to

K(s, t) = iT g(s)( s - u)~g(t)(t - u)~dG(u) +g(s)g(t)G(O)

where x+ is equal to x if x > 0 and to 0 otherwise. The correspondingreproducing kernel Hilber t space is the set of functions f of the form


f(t) = JoT g(t)(t-u)~F(u)dG(u)+Fog(t)G(O) for some real Fa and some

function F such that JoT IF(u) 12 dG(u) < +00 endowed with the norm

II f 117<= JoT IF(uW dG(u)+FJG(O). This result can be derived from theKarhunen representation theorem (see Subsection 3.3). See Exercise 7.Particular cases are the Wiener process with K(s , t) = min(s, t), g(t) = 1and G(t) = a2t and the pinned Wiener process or Brownian bridge withK(s, t) = min(s, t) - st, g(t) = 1 - t and G(t) = l~t. For [3 > 0,K (5, t) = a 2 exp( -[3 It - sl) defines a stationary markovian process withg(t) = exp(-(3) and G(t) = a 2 exp(2[3t). Note that a kernel of the form(2.2), also called triangular kernel, belongs to the family of factorizablekernels (see Chapter 7, Section 3).For the case of generalized stochastic processes, Meidan (1979) states anextension of LOEwe's theorem which exhibits the correspondence betweenthe class of covariance operators of GSP 's on T and the class of Schwartzkernels of 1J'(T).

THEOREM 28 The covariance operator of a GSP on T is a Schwartzkernel relative to 1)' (T) and conversely, given a Schwartz kernel R from1J(T) into 1J'(T) , there exists a (non necessarily unique) GSP X suchthat R is its covariance operator.

As for the case of stationary ordinary stochastic process (OSP), Gelfandand Vilenkin (1967) define t he spectral measure of a stationary GSPusing the Bochner-Schwartz theorem (see Subsection 1.7.2).Note that the characteristic function of a process is also of positive type:see Loeve (1963) for the case of an OSP and Gelfand and Vilenkin (1967)for the characteristic functional of a GSP defined by

C(¢) = E (exp (ih. x, ¢(t) d>.(t)))

for ¢ E 1J(T).

2.2.2 GENERALIZED COVARIANCES ANDCONDITIONALLY OF POSITIVE TYPEFUNCTIONS

The t heory of processes with stationary increments aims at generalizing stationarity while keeping some of its advantages in terms ofstatistical inference for applications to larger ranges of non-stationaryphenomena. Yaglom (1957) and Gelfand and Vilenkin (1961) developthis theory in the framework of GSPs. Considering the case of ordinarystochastic processes on T C lRd with stationary increments of order m ,


Matheron (1973) introduces the class of Intrinsic Random Functions oforder m , designated by m-IRF, thus inducing the development of applications to the field of geostatistics.If a process is stationary, its increments, i.e. differences Xt+h - Xt, arealso stationary. The first step in generalizing stationarity is to considerthe larger family of processes whose increments are stationary. By taking increments of increments, one is lead to the following definition ofgeneralized increments. Let us recall t hat Pm is the set of polynomialsof degree less than or equal to m.

DEFINITION 11 Given a set of spatial locations s; E T , 1 ~ i ~ k ,a vector v E IRk is a generalized increment of order m if it satisfies

2::7=1 ViP(Si) = 0, VP E Pm.

Ordinary increments Xt+h - X; correspond to order m = O. It is theneasy to define the class of order m stationary processes.

DEFINITION 12 For an integer m, a second order stochastic process X,is an m-IRF if for any integer k , for any spatial locations s; E T, 1 ~

i ~ k and any gen eral ized increment vector v of order m associated withthese locations, the stochastic process (2:7=1ViXsi+t , t E T) is secondorder stationary.

It is then clear that a O-IRF is such that ordinary increments (Xt+h Xt, t E T) are stationary. By convention , one can say that stationaryprocesses are (-I)-IRF. In the time domain, the m-IRF are known asintegrated processes: an ARIMA(O, m, 0) is an (m-l)-IRF. The class ofm-IRF is contained in the class of (m+l)-IRF.Let us introduce the variogram of a O-IRF. For a O-IRF, the variance ofX t+h - Xt , also equal to E(Xt+h - X t)2 is a function of h denoted by

where 2")' is called the variogram associated with the process and")' thesemi-variogram. An example of O-IRF is given by the fractional isotropicbrownian motion on IRd with semi-variogram ,(h) =11 h 11 2H for 0 < H <1. For a (-I)-IRF, we have the following relationship between variogramand covariance function

, (h) = K(O) - K(h) .

A function which plays a central role for m-IRF is the generalized covariance. Matheron (1973) proves that for a continuous m-IRF X, thereexists a class of continuous functions GK such that for any generalized increment vectors v and u' associated with the locations s; E T , 1 :::; i :::; k ,

RKHS and stochastic processes

we havek k m

COV(LViXS j,LViXs,) = L VjVjGK(Si - Sj)i=l i=l i,j=l

61

The generalized covariance is said to be the stationary part of a nonstationary covariance. It is symmetric and continuous except perhaps atthe origin. Any two such function differ by arbitrary even polynomialof degree less than or equal to 2m. The reader should be aware thatthe term generalized here does not refer to generalized processes. It isinteresting to see what is the generalized covariance for a stationary RFand more generally for a O-IRF.

LEMMA 12 For a stationary RF and more generally for a O-IRF, thegeneralized covariance is the negative of the variogram.

Proof. Inserting a random variable placed at the origin (assuming thisone belongs to T), for a generalized increment vector t/ ,

k k m

Var(L ViXs;) = Var(L Vi(XSj - X o)) = L VjVjGK(Si - Sj).i=l i=l i,j=l

On the other hand1

= 2E (X Si - X o+ X o - XsJ

= -y(Xs;) + -y(XSj ) - GK(Si - Sj)

Hence solving for GK(Si - Sj) and reporting in the previous equation

k m

var(L ViXs;) (L Vi)( L VjXsJi=l j=lm

m m

+ (L Vj)( L ViXs;) - L VjVjGK(Si - Sj)i i= l m i,j=l

m

- L VjVjGK(Si - Sj)i,j=l

•It is clear from the definition and from the characterization of Section1.7.3 that the generalized covariance of a continuous m-IRF is a function which is conditionally of positive type of order m. Matheron (1973)proves the converse, thus generalizing the correspondence between covariances and functions of positive type to the case of generalized covariances and functions conditionally of positive type. Matheron (1973)


also gives a spectral representation of the generalized covariance whichincorporates the Bochner theorem as a special case. Johansen (1960)proves that conditionally positive type functions are essentially the logarithms of positive type functions.Gelfand and Vilenkin (1967) study the case of GSP with stationary increments of order m. The definition does not involve generalized differences but rather derivatives of the GSP. They also develop the spectralanalysis of their covariance operator in connexion with the extension ofthe Bochner-Schwartz theorem mentionned in Section 1.7.3. Howeverthe natural extension of the correspondence between covariances andfunctions of positive type to the case of covariance operators of GSPwith stationary increments and Schwartz distributions conditionally ofpositive type is only implicit in their work.Covariances must be bounded , variograms must grow slower than aquadratic and in general the growth condition is related to the presence or absence of a drift (Matheron, 1973) .

2.3. HILBERT SPACE GENERATED BY APROCESS

If 1£ is a Hilbert space and {4>(t) , t E T} is a family of vectors of1£, we will denote by l(4)(t), t E T) the closure in 1£ of the linear span£(4)(t) , t E T) generated by the vectors {4>(t), t E T} . This subspacel(4)(t) , t E T) of 1i. will be called the Hilbert subspace generated by{4>(t),t E T}.Similarly, let us now define the Hilbert space generated by theprocess l(X). First let the linear space spanned by the process £(X)be the set of finite linear combinations of random variables of the formX t i for ti E T . .L:(X) can be equipped with an inner product but does notnecessarily possess the completeness property. Let therefore l(Xt , t ET) (or simply £(X)) be the closure in £2(12, A, P) of £(X). £(X) isequipped with the inner product induced by that of £2(12 , A, P)

< U, V> l(x) = E(UV).

£(X) contains the random variables attainable by linear operations, including limits, on the measurements of the process.It is classical to add a subscript m to the expectation sign when it seemsnecessary to remind the reader that this expectation depends on themean function. Since E(UV) = Cov(U, V) + Em(U)Em(V) for any random variables U and V in l(X) , the inner product does depend on themean and therefore the random variables belonging to l(X) may not bethe same for all values of m, except if T is a finite set .It is interesting to know the links between the smoothness properties


of the process and the smoothness properties of its covariance kernel.Let us first recall the definitions of continuity and differentiability of aprocess.

DEFINITION 131fT is a metric space, the process X; is said to be continuous in quadratic mean on T if

for all t E T, lim E IXs ..... X t l2 = 0s--tt

DEFINITION 141fT is a metric space, X, is said to be weakly continuousif for all t E T and all U E l(X)

lim E(UX s ) = E(UX t )s-.....tt

DEFINITION 15 X, is said to be weakly differentiable at t if the limit

lim E((XHh - Xt)(XHhl - X t ) )

h.h'~O h h'

exists and is finite.

The continuity of the process can be linked to the continuity of thecovariance kernel by the following two theorems.

THEOREM 29 The process X; is continuous in quadratic mean on T ifand only if(i) R is continuous on TxTor if(ii) for all t E T, R(., t) is continuous on T and the map s --+ R(s, s)is continuous on T.

Note that this property is a remarquable property of covariance functionrelative to continuity which was proved by Loeve (1978).

THEOREM 30 The process X, is weakly continuous on T if and only iffor any point t in T and any sequence t-. converging to t as n tends to 00,

X tn cotiverqes uieaklu to X, as elements ofl(X), i.e. limn~ooE(UXtn) =E(UX t) for all U E l(X).

The differentiability of the process can be linked to the differentiabiltyof its kernel by the following theorem.

THEOREM 31 X, is weakly differentiable at t if and only if the secondpartial derivative a~;t' R(t, t') exists and is finite on the diagonal t = t'.

Sufficient conditions for the separability of l(X) are given by the thefollowing two theorems.


THEOREM 32 If T is a separable metric space and if X, is continuousin quadratic mean, then £(X) is a separable Hilbert space.

THEOREM 33 If T is a separable metric space and if X, is weakly continuous, then £(X) is a separable Hilbert space.

In the case of generalized stochastic processes, £(X) is defined to be theclosure in L2(0) of the range space of X.Let us now introduce another important Hilbert space attached to aprocess which can be called the non linear Hilbert space generated bythe process. Let us denote by N(X) the space of all linear and nonlinear functionals of the process, i. e. the space of random variableswhich have finite second moments and are of the form h(Yi1 , ••• , Yin) forsome integer n, some t1 , ••• .t; E T, and some measurable function h on]RT . It will be endowed with the same inner product as £(X).

3. REPRESENTATION THEOREMSLet {Xt , t E T} be a second order stochastic with covariance func

tion R. The purpose of the representation theorems is to find variousconcrete function spaces isometrically isomorphic to the Hilbert spacegenerated by the process £(X).

DEFINITION 16 A family of vectors {4>(t), t E T}, in a Hilbert space Hequipped with an inner product < .,. >H is said to be a representationof the process X, if for every sand t in T

< 4>(t) , 4>(s) >H= R(t, s).

In other words, the family {4>(t), t E T} is a representation of the processX, if the Hilbert space £(4)(t), t E T) is congruent (or isomorphic to theHilbert space generated by the process.The proofs of representation theorems are generally based on the following result.

THEOREM 34 (BASIC CONGRUENCE THEOREM) Let HI and H 2 be twoabstract Hilbert spaces respectively equipped with the inner products< .,. >1 and < .,. >2. Let {u(t) , t E T} be a family of vectors whichspans HI and {v(t) , t E T} be a family of vectors which spans H 2 • If forevery sand t in T

< u(s), u(t) >1=< v(s), v(t) >2

then HI is congruent to H2.

If {4>(t) , t E T} is a family of vectors in a Hilbert space 1l, and if K is defined by K(s, t) =< </>(s) , </>(t) >, then 1lK is congruent to £(</>(t) , t E T)


by the map J:J(K(. , t)) = ¢(t)

called the "canonical congruence".

3.1. THE LOEVE REPRESENTATIONTHEOREM

65

The first representation theorem due to LOEwe yields 1lR as a representation for X« .

THEOREM 35 (LOEVE'S THEOREM) The Hilbert space l(X) generatedby the process {Xt, t E T} with covariance function R is congruent tothe RKHS 1lR.

Proof. Let 'l/J be the map from l(X) to Hn defined on .c(X) by

n n

Val, a2,·· .an E JR , 'l/J(2: aiX d = 2:aiR(ti,.)i=l i=l

Extend 'l/J to l(X) by continuity. It is then easy to see that this map isan isometry from l(X) to Hn since

< x.;x, >l(X)= R(t,s) =< R(t , .) , R(s,.) >'H.R

•The following properties are simple consequences of this theorem.

(2.3)

and

Proof. Let us give details for (2.3) only. Em ('l/J-I(R(t, .))) = Em (Xt ) =m(t) =< m, R(t,.) >'H.R and therefore (2.3) is true for g = R(t , .). It isthen true for any g in 1lR by linearity and continuity. •


A corollary of this theorem is the following.

COROLLARY 6 If a linear map p from l(X) into 1iR satisfies for all hin 1iR and all t in T , E(p-l(h)Xt ) = h(t) , then p coincides with thecongruence map which maps X, into R(t , .).

Another corollary is that 'HR coincides with the space of functions h E]RT of the form h(t) =E(XtU) for some random variable U E £(X).Let {a(t , .}, t E T} be a family of elements of 1iR and{Zt = 1/J- I(a(t , .)), t E T} the corresponding family of random variablesof l(X). a(t ,.) is called the representer of Zt and it is intuitively obtained from R (t ,.) by t he same linear operations by which Zt is obtainedfrom X t . For example

z, =hA(t, s)XsdA(s) ¢::::::> a(t,.) =hA(t, s)R(t, s)dA(s)

Symetrically, any random variable Zt in £(X) can be expressed asZt = 1/J (h (t, .)) for some h(t ,.) E lRT .

Unfortunately, this nice correspondence has the following limitation:in general , with probability one, the sample paths of the process donot belong to the associated RKHS. For a proof of this fact see Hajek (1962) . Wahba (1990) gives the following heuristic argument: ifX, = E~=l ( n<Pn(t) is the Karhunen expansion of the process (see Section 3.2) , t hen for any finite integer k,

k 2 k (~E( L ( n<Pn(t) ) = E(L An) = k

n=l n=l

and therefore tends to 00 as n tends to 00. Let us consider for examplea Brownian motion process on (0,1) with R(s , t) = min(s, t) . Then1iR is the reproducing kernel Hilbert space described in Example 4 ofChapter 1. It is well known that the sample path of this process arealmost surely not differentiable unlike elements of HI (0,1). In the samefashion, Driscoll(1973) gives a necessary and sufficient condition for thesample paths of X to belong to 1iR under certain additional conditions.

THEOREM 36 Let (Xt , t E T) be gaussian with m ean value m and propercovariance funct ion K . A ssume that K is continuous on T xT andthat almost all the sample paths of X are continuous on T . Let Rbe a continuous posit ive kernel on T xT such that m E 'HR . Theneither P(X E 1iR) = 1 or P(X E 'HR) = 0 according as the seriessUPnEN trace(KnRn) is summable or not where R n and K n denote respectively the restrictions of Rand K to the first n elements of a countable den se subse t of T .


The answer to such question will be of interest in the detection of signalin noise problems (see Section 6.4).Let us characterize 1{n when the process is stationary on T = JR with auniformly bounded spectral density Ix . If fx never vanishes, then Hnconsists of all square integrable functions 9 such that

r 2 1JnJFg(w) I fx (w)d>.(w) < 00. (2.4)

If fx vanishes on the set N, then Hn consists of all square integrablefunctions 9 such that :F9 vanishes on N and satisfying (2.4). For thedetails see Parzen (1963) or Kimeldorf and Wahba (1970) .Is it possible to use the reproducing kernel Hilbert space associated withthe proper covariance function to build a representation for X t? Asalready mentioned , the space l(X) may depend upon m. However withthe additional assumption that m belongs to a subset M of 1{K , thenaccording to Parzen (1961b) , this space is the same for all m: the set 1{K

is equal to the set 1{n but the two spaces are equipped with a differentnorm (see examples of this situation in Chapter 6).Then one can define an isomorphism "p between L2(X) and H« by

"p(X t ) = K(t, .),

and it is such that

Additional properties in that case are

Em ("p- l (g)) =< m,g >K , 'V mE M , 'V g E H« ,

and

Var("p-l(g)) =11 9 Ilk .Meidan (1979) establishes the same correspondence for the case of aGSP.

THEOREM 37 Let X be a GSP and R its covariance operator. Thenthere exists an hilbertian isomorphism between the subspace of distributions ?in generated by R and the linear space l (X) generated by X .Conversely, given a Hilbert space of distributions H on T , there exists aGSP X such that the kernel operator R of H is related to X by

R=X'X


The following two theorems show that a number of properties of theprocess can be inferred from simple properties of the elements of 1iRand that the spaces 1iR will generally consist of continuous or evendifferentiable functions for regular processes (see Parzen (1959)).

THEOREM 38 If T is a metric space and X, a process with covariancefunction R , then X, is weakly continuous if and only if any function of1iR is continuous on T.

THEOREM 39 If T is an interval of the real line, and if X, a processwith covar iance kernel R, then X, is weakly differentiable at all pointsof T if and only if any function of 1iR is differentiable on T.

For an m-IRF, we conjecture that one could establish a similar type ofcongruence between the space of generalized increments and any semiHilbert space associated with the generalized covariance.A representation theorem for the non linear Hilbert space N(X) is foundin Duttweiler and Kailath (1973). Without giving details , let us mentionthat it is related to the characteristic functional of t he process C (¢) =E(exp(i IT X t¢(t ) dA(t)) ) for ¢ a real function ranging in an appropriatelinear space and the isome tr y between N(X) and 1ic is linked to thefact that

E (exp (i lr Xt¢(t)dA(t)) exp (i lr Xt'ljJ(t)dA(t))) = C('ljJ - ¢).

3.2. THE MERCER REPRESENTATIONTHEOREM

Mercer 's theorem (Riesz and Nagy, 1955) yields a representation theorem for a process which is continuous in quadratic mean (continuouscovariance function). Let X, be a second order stochastic process, continuous in quadratic mean, indexed by a finite closed interval T = (a, b)with covariance function R.

THEOREM 40 (MERCER'S THEOREM) If R is a continuous positive definite fun ction, then there exists a sequence of eigenf unctions ¢n (.) E 1iRand a sequence of corresponding nonnegative eigen values such that

lb

R (s , t) ¢n (s)dA(s) = An¢n(t)

I b

¢n(s) <Pm(S)dA(S) = c:


where omn is the Kronecker delta function . Moreover we have

00

R(s, t) = L An¢>n(S)¢>n(t)n=l

69

(2.5)

where the series converges absolutely and uniformly on T .

It is then easy to characterize 1iR as the set of 9 E L2(T) for whichthere exists a sequence (an) with L:~=l a;An < 00 such that g(t) =L:~=l Anan¢>n(t) endowed with the inner product

00 00 00

< L Anan¢>n(.), L Anbn¢>n(') >= L anbnAn.n=l n=l n=l

Equivalently, as in Nashed and Wahba (1974), it is the set of 9 E L2(T)

such that

00 1 rL An (J7 g(t)¢>n(t)dA(t))2 < 00,n=l T

with inner product

< I, 9 >=~ ;n !r f(t)¢>n(t)dA(t) !r g(t)¢>n(t)dA(t)

One can define an operator R from L2(T) to L2(T) by

R(g)(s) = !r R(s, t)g(t)dA(t),g E L2(T).

This operator is a self adjoint Hilbert-Schmidt operator. Its square roothas the following representation

The pseudo-inverses of Rand Rl/2 are respectively given by

and


where xt = 0 if x = 0 and xt = ~ otherwise (see Nashed and Wahba(1974)).The following orthogonal decomposition of the process in terms of aseries is called the Karhunen-Loeve expansion.

COROLLARY 7 ( K A RHUNEN - L o EV E EXPANSION) Under the above condit ions, there exists a sequence of random variables (n such that E (n(m) =>'n~m ,n and x, = E~=l (n¢n(t).

Note that the advantage of this representation of the process is that itisolates t he manner in which the random function X t(w) depends upont and upon w.Proof. Let f:n(t) = >'n¢n(t) . Define en by en = "p- l (f:n). The secondmoments of en then follow from the properties of "p. The rest followsfrom

00 00

"p (X t ) = R(t ,.) = L:>'n¢n(t) ¢n(.) = L:f:(.)¢n(t).n=l n=l

•For the case of a GSP, Meidan (1979) proves an extension of Mercer 'stheorem with a series expansion of a GSP and of its associated covarianceoperator.

3.3. THE KARHUNEN REPRESENTATIONTHEOREM

At last, we will give the Karhunen representation theorem or integralrepresentation theorem which represents the process as a stochastic integral for the case of kernels of the form (2.6) below. Given a measurespace (Q, B, J-L) , a family of random variables {Z(B) , B E B} is calledan orthogonal random set function with covariance measure J-L if for anytwo sets B1 and B2 in B, E(Z(BdZ(B2 ) ) = J-L(B1 n B2 ) . As in Section 2.3 , the Hilbert space generated by {Z(B) ,B E B} is the closure of£(ZB, BE B) (see Parzen , 1961a).

THEOREM 41 (KARHUNEN'S REPRESENTATION THEOREM) Let X t denote a second order stochastic process defined on the probability space(D,A,P) with covariance function R. Let {f(t, .) ,t E T} be a family of fun ctions in L2 (Q, B , J-L) for a m easure space (Q ,B, J-L) such thatdimL2(Q , B , J-L) ::; dimL2(D , A, P) and such that

R(s , t) = kf(s, ,)f(t, ,)J-L(J) (2.6)

holds. Th en {f(t , .) , t E T} is a representation for X t . If moreover{f(t, .) , t E T} spans L2 (Q, B , f..L) , then there ex ists an orthogonal random


set Junction {Z(B), BE B} with covariance measure J.l such that

x, = hJ(t)dZ.

71

We refer the reader to Karhunen (1947) or Parzen (1959) for a proof ofthis result. Note that a kernel of the form (2.5) is also of the form (2.6)if we let Q be the set of integers, B be the family of all subsets of Q andJ.l be the discrete measure J.l(n) = An.Under the conditions of Theorem 41, it is easy to see that 1iR consistsof all functions ¢> on T such that ¢>(t) = JQ F(,)¢>(t,,)dJ.l(') for some

unique FE £(¢>(t , .), t E T) endowed with the norm

II ¢> lI~n=11 F 11l,2(1J) .

COROLLARY 8 Under the assumptions oj Karhunen's theorem, any random variable U in £(X) may be represented as U = JQ g(t)dZ Jor someunique 9 E L2(Q,B, It).

In the case J(t,,) = exp(it,), K is a translation invariant kernel corresponding to a stationary process and this representation is the classicalspectral decomposition.Example 1 The Wiener process on (0, T) has covariance function R(s, t) =min(s, t) = J[ (s - u)~ (t - u)~dA(u). It follows that 1iR is the set offunctions of the form J(t) = Ji F(u)dA(u), t E (0, T) with F E L2(0 , T)endowed with the norm: II J II1ln =11 J' 11l,2([o,T))'Example 2 Continuous time autoregressive process of order mA continuous time autoregressive process of order m may be defined asa continuous time stationary process with covariance function given by

K(s, t) = 100

exp(i(s - t)w) 2 dA(w)- 00 211" 1L:~o ak(iw)m -kl

where the polynomial L:;=o akZm-k has no zeros in the right half of thecomplex plane. Its corresponding RKHS Hn is given in Chapter 7. Ofparticular interest are the cases m = 1 and m = 2 corresponding to processes which are solution of a first (respectively second order) differentialequation whose input is white noise.Example 3 The isotropic fractional Brownian motion on R d is a centered gaussian field whose covariance is given by

K(s, t) = ~{II s 112H + II t 11 2H

- II s- t 112H

} ,


where 0 < H < 1. Cohen (2002) describes the associated RKHS bywriting the covariance as

K( ) = _1 r (exp(isO) - l)(exp( -itO) - 1) d>'(O)s, t Gil JRd (2rr )d/2 11 0 IId+2H .

He then derives its Karhunen-Loeve expansion and uses the RKHS tobuild generalizations offractional Brownian motion and study the smoothness of the sample paths.

3.4. APPLICATIONSAn important class of applications of congruence maps is provided by

the following theorem. This result is important in giving an explicit wayof computing the projection of an element of a Hilbert space onto theHilbert subspace generated by a given family of vectors. Examples inspecific context will be given later in Section 4.

THEOREM 42 Let {f(t) , t E T} be a family of vectors in a Hilbert space1£, and K be the kern el of the Hilbert subspace generated by this family. Let h be a function on the index set T. A necessary and sufficientcondition for the problem

< x , f(t) >1i= h(t) Vt E T (2.7)

to have a necessarily unique solution in the space l(J(t) , t E T) is thath belong to 1£K. In that case, the solution is also the vector of minimumnorm in 1£ which sat isfies the "interpolating conditions" (2.7). It isgiven by

where 'if; is the canonical congruence between l(J(t), t E T ) and 1£K ,and its norm is II x 11=11 h II1iK'

Proof. If (2.7) has a solution x in £(J(t) , t E T), let 9 = J:' (x), then

h(t) =< J(g ) , f(t) > = < J(g), J(K(t, .)) >< g , K(t , .) > = g(t)

and hence h = 9 and h E 1£K. Conversely, if h E 1£K , then J(h) is asolution. To check the minimum norm property, it is enough to notethat if Xl denotes the orthogonal projection of a vector x E 1£ onto£(f(t) , t E T), then x satisfies (2.7) if and only if Xl satisfies (2.7) andII X II~II X l II· •


Example 1 In particular, if T is finite and the vectors of the family{f(t) , t E T} are linearly independent, then the matrix (K(s, t))s,tET isnon-singular and the solution is given by

x = L h(t)K-1(s, t)f(s)s,tET

(2.8)

Example 2 If (2.7) is a system of m equations in ~n corresponding tothe rectangular matrix of coefficients A, this theorem defines the MoorePenrose inverse (or pseudo-inversejAl of the matrix A since the elementof minimal norm satisfying the system is then x = Ath where At =A'(AA')-l. In the square matrix case At coincides with the ordinaryinverse.Example 3 A particular case of equation (2.7) appears when one seeksthe solution in L2(a, b) of an integral equation (of the first kind) of the

form J: x(s)K(s, t)d>.(s) = h(t). In this sense (2.7) may be consideredas a generalized integral equation (see Nashed and Wahba, 1974).Concrete applications of a representation theorem necessitate a criterionfor showing that a given function h belongs to 1iR and ways of calculatingtjJ-l(h). The following theorem (Parzen, 1959) give selected examples ofsuch results when h is obtained by some linear operation on the familyof kernels.

THEOREM 43 Let H be a Hilbert space, {f(t) , t E T} a family of vectorsof 1i with K(s, t) =< f(s), f(t) >1{ and T an interval of the real line.Let h be a function ofRT.

(1) If there exists a finite subset Tn of T and real numbers c(s), s E Tn'such that

h(t) = L c(s)K(t, s)sETn

then h E 1iK and

tjJ-l(h) = L c(s)f(s)sETn

II h 112= L c(s)c(t)K(s, t) = L c(t)h(t)

s,tETn tETn

(2) If there exists a countable subset T= ofT and real numbers c(s), fors E T= , such that

L c(s)c(t)K(s, t) < 00

s,tET""

74

and the function h satisfies

RKHS IN PROBABILITY AND STATISTICS

h(t) = 2: c(s)K(t, s),sEToo

then h E llK and'IjJ - l (h) = 2: c(s)f(s)

«r;

" h W= 2: c(s)c (t)K(s, t) = 2: c(t)h(t)s ,tET00 tET00

(3) If there exists a continuous funct ion c on T such that

1.1. c(s)c(t)K(s, t)d>.(s)d>.(t) < 00,

and if the function h satisfies

h(t) = 1 c(s)K(t, s)d>'(s)sET

then h E H« and

'IjJ - l (h) = 1 c(s)f(s)d>.(s)sET

II h 112= 1 1 c(s)c(t)K(s, t)d>.(s)d>.(t) = [ c(t)h(t)d>'(t)

sET tET JET

(4) If there exists a function of bounded variation V on T such that theRiemann Stielges integral IT IT K(s, t)dV(s)dV(t) is finite and if thefunction h satisfies

h(t) = 1. K(t, s)dV(s),

then h E llK and

'IjJ - l (h) = hf(s)dV(s)

II h 112= l h(t)dV(t)

(5) If there exists an integer m such that the partial derivatives

a2i

at i at i K (s , t)


are finite for i = 1, ... , mand if h(t) = L:~o aj%ti, K(t , to) for some to,

then h E 1lK and

Example 4 Let X be a continuous time autoregressive process of order1 on (a, b) as in Example 2 of Section 3.2 with covariance K(s , t) =C exp(-(3 Is - tl) for (3 > O. Then it is easy (see Parzen, 1963) to seethat

2~C {(321b

h(t)Xtd>..(t) +l bh'(t)dXt }

1+2C {h(a)Xa + h(b)Xb}.

4. APPLICATIONS TO STOCHASTICFILTERING

Historically, what follows is in the direct line of the works of Kolmogorov (1941) and Wiener (1949). Wiener established the relationshipbetween time series prediction problems and solutions of the so-calledWiener-Hopf integral equations (see Kailath (2000) and 2.10) I althoughat the time nothing was explained in terms of reproducing kernels.It is in the works of Parzen (1961) that this tool is introduced to tacklethese prediction problems and that the Loeve representation theoremis shown to yield explicit solutions. Given the link that we establishedbetween positive definite functions and covariance functions of secondorder stochastic processes, it is natural to expect that problems relatedto these processes may be solved by reproducing kernel methods. In thenext sections, we present examples of correspondence between stochasticfiltering problems and functional ones. More precisely, we will show that,in many instances, best linear prediction or filtering problems can betranslated into an optimization problem in a reproducing kernel Hilbertspace. The dictionary for this translation is based on the Loeve represen tation theorem.In statistical communication theory, the signal plus noise models, used tomodel digital data transmission systems, present the data X, as the sumof two components: the signal component S, and the noise component


Nt with zero mean. In the sure signal case the signal S, is a nonrandomfunction. In the stochastic signal case , the signal is independent of thenoise. The proper covariance of the signal and noise will be denotedrespectively by Ks and KN.In some cases the signal will be assumed to belong to the class of signalsof the form St = Ef:=l (hd/(t) where () are unknown (random or not)parameters and dl are known functions.

4.1. BEST PREDICTIONIn this section, we summarize the work of Parzen concerning the fact

that reproducing kernel Hilbert spaces provide a formal solution of t heproblems of minimum mean square error prediction.

4.1.1 BEST PREDICTION AND BEST LINEARPREDICTION

Let us define the problem of best prediction and best linear predictionof a random variable U based on the observation of a process {yt , t E T}.Assume that U and {yt , t E T} belong to L 2 (0 , A ,P) . Let

E(UYt ) pu(t )E(ytYs ) = R(t , s)

E(yt) = m(t).

We assume that E(U2) , pu(t), and R(s, t) are known.As in the previous section, llR and l(Y) denote respectively the reproducing kernel Hilbert space with kernel R and the Hilbert space generated by the process {yt, t E T}, and 'lj; denotes the isometric isomorphismbetween them. Recall that N(Y) denotes the non linear Hilbert spacegenerated by Y.

LEMMA 13 For any U in l (Y) , the function PU belongs to llR

Proof. If U = Ei=l aiYi , then pu(t) = L:~l aiR(t , ti) is a linear combination of functions of Hn and therefore it is true by completeness andcontinuity. Now if U ~ l(Y), let U = U1+ U2 with U1 E l(Y) andU2 E l(Y).L. Then E(Uyt) = E(U1yt), hence pu(t) = PUI (t) which~~~~. .DEFINITION 17 A random variable U* is called best predictor (BP) ofU based on {Yi, t E T} if it minimizes E(Z - U)2 among elements Z ofN(Y).

It is a classical resul t (see Neveu , 1968) that the best prediction of Ubased on {yt, t E T} corresponds to the conditional expectation of U


given {yt, t E T} denoted by E(U I yt, t E T) and that it is the uniquerandom variable in N(Y) with the property that

E(VE(U I yt, t E T») = E(VU), VV E N(Y).

However, the computation of the conditional expectation involves knowing the joint distribution of U and yt . When this distribution is unknown, but only first and second moments are available, best linearprediction is a possible alternative.

DEFINITION 18 A random variable U* is called best linear predictor(ELP) of U based on {yt, t E T} if it minimizes E(Z - U)2 amongelements of l(Y) .

Note that the BLP is nothing else than the projection in the Hilbertspace £2(0, A, P) of the random variable U onto the closed subspacel(Y) generated by the process. It is therefore the unique random variable U* in l(Y) with the property that

E(VU*) = E(VU), VV E l(Y) .

In the gaussian case, Neveu (1968) proves that BLP and BP coincide.The following theorem characterizes BLP.

THEOREM 44 The ELP of U based on {yt, t E T} is given by U* ='l/J-l(PU), where'l/J is the canonical congruence betweenl(yt,t E T) and1iR where R(t, s) = E(ytYs ) . The mean square error of prediction isthen

E(U* - U)2 = E IUI 2 - II PU IIhProof. It is clear that the BLP Z is solution in l(Y) of the equationE(Zyt) = E(Uyt) ,Vi E T. Note that this system of equations may haveno solution or an infinity of solutions in L2 (0, A,P). Restricting thesearch to the subspace l(Y) is equivalent to looking for an element ofminimal norm solving this system, and in this subspace the projectiontheorem ensures the existence and uniqueness of the solution. It is thenenough to use the congruence map 'l/J to translate the problem into aproblem in 1lR in which the solution g is the unique solution in 1lR ofthe system of equations

< g, R(t,.) >= E(Uyt) = pu(t) Vt E T. (2.9)

To compute the mean square error of prediction , just note that if Z =1jJ-l(g) then

E(Z - U)2 = E(U2)- < oii.ou >R + < g - PU,g - PU >R

•


(2.10)

When T is an interval (a,b), one may write heuristically a random variable of l(Y) in the form J: w(t)X (t)d>.(t) and the function w corresponding to the BLP must satisfy the Wiener-Hopf equation

lb

w(t)R(s, t)d>.(t) = pu(s) , 'V s E (a, b).

To solve problem (2.9) in 1iR, one may use Theorem 42 and Theorem43. If one can express E(UYt) in terms of linear operations of the familyof functions {R(t,.), t E T}, then the BP may be expressed in terms ofthe corresponding linear operations on {Yt, t E T}. Let us illustrate thisresult by two examples from Parzen.Example 1. For a mean zero process {Yt, t E T}, find the best prediction ~~ of U = Yto based on the observation of {Yt, t E Tn} for a finitenumber of points Tn = {t1, ... ,tn}. From (2.8), the solution is thengiven by

n

~~ = LaiYt;i= l

where the vector of weights ai is given in terms of the matrixR = (R(ti, tj))i,j=l ,..n and the vector rto = (R(to, ti))i=l, ..n by a =R-1rto'Example 2. For the autoregressive process of order 1 described inExample 4 of Section 3.4, with covariance K(s, t) = C exp(-{3lt - sl)for ({3 > 0, C > 0), we look for the BP Y* of Yb+c , C > 0 given thatwe observe {Yt, t E (a, b)}. It is enough to note that Cov(Yt, Yb+c) =exp(-{3c)K(b,t) to conclude y* = exp(-{3cYb)'Example 3. Under Mercer's theorem assumptions, the BLP of U basedon the observation of Yt, t E (a, b) is given by

00 1 Ib Ib

U* = ~ >'n a PU(t)<Pn(t)d>.(t) a Ys<Pn(s)d>.(s)

A similar result for characterizing BP can be found in Parzen (1962a).Let ~u(t, v) = E(U exp(ivYt)) and K(s, u; t, v) = E(exp(iuYs) exp(ivYt)).Note that K is the two-dimensionnal characteristic function of the process Yt and that it is also a reproducing kernel.

LEMMA 14 For any U in L2 (y ), the function ~u(t, v) = E(U exp(ivyt))belongs to 1iK.

Then the BP of U based on {Yt, t E T} is the unique random variableU* of minimal norm E(U*2) satisfying

E(U* exp(ivYt)) = ~u(t, v) 'V t, 'V v.


The following result is then clear.

THEOREM 45 The BP of U based on {l't , t E T} is given by U* ='l/J-l(~U), where 'l/J is the canonical congruence between N 2(l't, t E T)and llK where K(s, u; t , v) = E(exp(iuYs ) exp(ivl't)).

Therefore if one can find a representation of ~u(t, v) in terms of linearoperations on the kernels K(.,.; t, v), then the BP can be expressed interms of the corresponding operations on the family of random variables{exp(ivl't), t E T, v E lR} .Best linear prediction can be applied to signal plus noise models as inParzen (1971). The general filtering model is of the following form.Given an observed process X t , predict the signal St knowing that X; =s, + n; 0:S t :S T, with E(Nt) = 0, Nt uncorrelated with X t . If wedenote by Ks and KN the respective covariances of the signal and thenoise, the BLP of S, based on X, is given by 'l/J-l(Ks( ., t)) where 'l/J isthe canonical isomorphism between l(X) and llKs+KN"

4.1.2 BEST LINEAR UNBIASED PREDICTIONAssume now that we have the additional information that the un

known m(.) belongs to a known family of mean functions M C H«. Inthat case, the reproducing kernel Hilbert space associated with {l't , t ET} by Loeve theorem is included in Htc- Let Cov(U, l't) = pu(t) andCov(l't, Ys ) = K(t, s). We assume that pu(t) , K(t, s) and Var(U) areknown.Assume moreover that Var(U) and Cov(U, l't) is independent of the unknown mean m and that the random variable U is predictable that is tosay, there exists a function u« Htc such that Em(U) =< h, m >K , forall mE M.As in the previous paragraph, it is easy to check that PU belongs to H«.

DEFINITION 19 A random variable U* is called uniformly best linearunbiased predictor (BLUP) of U based on {l't, t E T} if it minimizesE(Z - U)2 among elements Z of £(Y) satisfying E(Z) = E(U) for allmEM.

THEOREM 46 Under the above assumptions, the uniformly BLUP of thepredictable random variable U based on {l't, t E T} is given by U* ='l/J-l (g), where'l/J is the canonical congruence between £(Y) and 1iK, and9 is the unique minimizer in 1iK of II 9 - pu 11 2 subject to < g , m >=E(U) for all m EM. It is also given by


where h is the un ique function of minimal norm in JiK among functionssatisfying < h, m >= E(U)- < PU, m > for all m E M. The meansquare error of prediction is then

E(U* - U)2 = Var(U)- II PU Ilk + II E*(h - PU IM) Ilk .

Proof. It is easy to check that for any h in JiK, we have

Cov(U, ljJ-l(g)) =< PU,9 > .

Therefore

E(1j;-l(g) _ U)2 = Var(U)+ II PU 112 + II 9 - PU 11

2•

Hence it is enough to minimize II 9 - PU 11 2 in JiR among functionssatisfying < g, m >= E(U) since E(1j;-l(g)) =< g, m >. •

A frequent case in the applications is when the Hilbert space generatedby the mean functions in H«, ll(M), is finite dimensional. It is thenpossible to write explicit solutions.

4.2. FILTERING AND SPLINE FUNCTIONSIn the situations we examine in this section, the equivalent problem

in Hn can be transformed further into a classical type of minimizationproblem, thus relating filtering problems to spline functions. A briefintroduction to the variational theory of spline functions can be foundin Chapter 3.The interest in these relationships dates back to Dolph and Woodbury(1952) . Several authors report such links at different levels of generality:Kimeldorf and Wahba (1970a and b, 1971), Duchon (1976), Weinert andSidhu (1978), Weinert, Byrd and Sidhu (1980), Matheron (1981), Salkauskas (1982), Dubrule (1983), Watson (1984), Kohn and Ansley (1983,1988) , Heckman (1986), Thomas-Agnan (1991)' Myers (1992), Cressie(1993), Kent and Mardia (1994) and Mardia et al (1995) .Most of these papers are concerned with the Kriging model. In geostatistics, Kriging is the name given by Matheron (1963) to the BLUPof intrinsic random functions after the South African mining engineer D.Krige . Kriging models account for large scale variation in spatial databy spatial trend and small scale variation by spatial correlation. Besidesmining where it originated, the range of applications encompasses forexample hydrology, environmental monitoring, meteorology, econometry. Mathematically, the method relies on BLUP. Several approaches tothe connection between Kriging and splines range from a purely algebraic level to more sophisticated presentations. Our point of view will


be to underline that this link is due to the common role of reproducingkernel Hilbert spaces in these two theories. We will use a frameworkcommon to the above quoted equivalence results. However in order notto hide the mechanism with superfluous abstraction , we will not choosethe most general framework possible and will mention later the possibleextensions. For the same reason, we favor simple algebraic arguments.Many of the papers discussing the relative merits of Kriging and splines(see for example Lasslet, 1994) raise the controversial questions of whichis more general or more efficient . In the Kriging approach , the generalized covariance is usually estimated from the data whereas it is an apriori choice in the spline approach. Our belief is that the links are moregeneral than they are usually presented, in particular by those who claimthat Kriging is more general than splines. As Matheron (1980) pointsout, if we limit ourselves to Lg splines, we encompass only a limitedclass of random functions. This attitude reflects the fact that the concrete splines often used just form a small fraction of the abstract splinesfamily.In the model we consider, the process of interest or signal yt is decomposed into a deterministic part (or drift, or trend) D(t) and a randompart (or fluctuation) X, which will be assumed to be a mean zero secondorder process with known proper covariance K

yt = D(t) +X,

In practical applications, the second order structure must be estimatedfrom the data. The set T is a subset of]Rd whose elements we will callpositions or time in dimension d = 1. For simplicity, we now limit ourinvestigation to the case of a finite number of observations Yi , i rangingfrom 1 to n, which are realizations of random variables Yi related to theprocess yt by

i = 1, ..n (2.11)

The variables Ei, which model the noise in the measurement of the signal, are LLd. with variance (12 and independent of the process Xt. Wewill assume throughout that the design points ti are such that the matrix K(t j, tj) is invertible. Since K(t j, tj) =< K(tj, .) , K(tj, .) >1iK' thisGram matrix is invertible if and only if the functions K (t j, .) are linearlyindependent which happens if and only if there exists no linear combination of the Xti which is almost surely constant.As in the previous section , H« and l(X) denote respectively the reproducing kernel Hilbert space with kernel K and the Hilbert spacegenerated by a process X t, and 'l/J denotes the isometric isomorphismbetween them.


4.2.1

The aim is to predict, from the observations, the signal Yi for a value oft possibly distinct from the design values tl, ..tn .

We will proceed gradually starting from the simplest model in order toemphasize the respective role of each factor in the correspondence. Wefirst give three rather general correspondence results before specializingto the Kriging models.

NO DRIFT-NO NOISE MODEL ANDINTERPOLATING SPLINES

In this paragraph, we assume that there is no noise, i.e. 0 2 =0, andno deterministic component D(t) = 0, so that Yi = X, and Yi = X t i .

THEOREM 47 The ELP of Y, based on Y1"'.Yn is equal to 'l/J-l(ht ),

where ht is the interpolating spline which solves the following optimization problem in 1iK

Minllgll~Kg E 1iK9(ti) = K(til t), i = 1, ... , n

(2.12)

where 'l/J is the canonical isometry between l(X) and 1iK .

Proof. First note that in the notations of Theorem 44, U = Yi, T ={t1, ... tn } and PU(ti) = K(t ,ti). Since T is finite, l(Y1, ... ,Yn ) is theset of finite linear combinations of Y1 , . .. Yn . Furthermore, if we denoteby K T the restriction of the kernel K to the set T , then the 1lR ofTheorem 44 is the finite dimensional vector space 1iKT generated by thecolumns of the matrix E with (i, j)th element (K (ti, tj) (i, j = 1, ... , n).By Theorem 44 the BLP of Yi based on Y1 , .•. Yn is given by 'l/J1} (pu)where 'l/JT is the canonical isomorphism between £(Y1 , ... , Yn ) and 1lKT'By Lemma 13, PU belongs to 1lK as well as to 1lKT' Now if we applyTheorem 42 with H = 1lK and the family of vectors f(t) being givenby the functions K (ti , .), i = 1, ... , n or we apply this same theoremwith 1l = 1lKT and the family of vectors f(t) being given by the vectorswhich are restrictions of these functions to the set T, we get the samelinear system of equations for the coefficients. Therefore we concludethat 'l/Ji } (pu) coincides with 'l/J - l (h) where h is the element of minimalnorm in 1lK satisfying the interpolating conditions h(ti) = K(ti, t), i =1, ... ,n. _

For practical purposes, the BLP U" of Yi based on Y1 , • . . Yn is given bythe finite linear combination U" = L:~=l >'iYi where


Let Zt = (PYt(tl), . . . , pYt(tn))' (K(t,tI), . .. , K (t, tn))' . When thedesign is such that the matrix L: is invertible, we thus get A = L:- 1Zt.It is important to note the following corollary.

COROLLARY 9 The solution ht(s) of problem (2.12) is a spline as afunction of s as well as a function of t .

Proof. From the previous computations, we have

ht(s) = Z:L:- 1Zs

anf therefore ht(s) = hs(t) which proves the claim . •

We will see in Chapter 3 that problem (2.12) defines an abstract interpolating spline function of a particular form in the following sense . First,in the most general formulation of interpolating splines, the underlyingHilbert space does not have to be a reproducing kernel Hilbert space.Nevertheless, interpolating conditions are frequently true interpolatingconditions in the sense that the functionals they involve are evaluationfunctionals, in which case the requirement of their continuity necessitatesthe reproducing property. Secondly, in the most general formulation ofinterpolating splines the functional which is minimized is a semi-normwhereas in this case, it is a norm . Note that we derived Theorem 47as a simple consequence of Parzen's results. Note also that this casedoes not necessarily belong to the Kriging family unless the process X,is stationary.

4.2.2 NOISE WITHOUT DRIFT MODEL ANDSMOOTHING SPLINES

(2.13)

In this paragraph, we still have no drift i.e. D(t) = 0, but there issome noise i.e. (J'2 > O. The presence of noise will change the natureof the splines which will be smoothing splines instead of interpolatingsplines.

THEOREM 48 If one substitutes the realizations Yi for the random variable Yi, the BLP of Yt based on Y1 , . .. ,Yn becomes the smoothing splinewhich solves the following optimization problem in H«

Min I:i=l (g(ti) - Yi)2 + (J'2I1gll~Kg E llK

Proof. Although one could write more abstract arguments here, weprefer to do this proof with simple algebra. From the projection theorem,we know that U* = L:i=l AiYt. = NY where Y = (Yt 1 , • • • , Ytn)' and Asolves the following linear system

E(YsU*) = pyt(s) , \:Is E T


Therefore for t distinct from one of the ti, with the previous notationsthis linear system can be written

(2.14)

where In is the identity matrix of size nand

Note that E + (J2 In is automatically invertible. Therefore

U" = Z:(E + (J2 In) - l y .

On the other hand, as we will see in Chapter 3, the solution of (2.13) isa linear combination of the functions K( ., ti) that we write g"(t) = ,lZtwith jl E ]Rn. Since

n n

L:(9"(ti) - Yi)2 = L:(< s' ,K(., til > -Yi)2i=l i= l

= (Ejl - y)/(Ejl- y).

and IIgll~K = jl'Ejl , jl minimizes (Ejl - y)'(Ejl - y) + a 2jl /Ejl whereY = (Y1'"' ' Yn)' and therefore jl = (E + (J2 In)- l y. Hence g"(t)ZHE + a 2I n )- l y which proves the claim. •

COROLLARY 10 Th e solution htCs) oj problem (2.13) for Yi = K(t, ti)is a spline as a Junction oj s as well as a Junction oj t.

Proof. From the previous computations, we have

ht(s) = Z:(E+(J2In)-lZs

anf therefore ht(s) = hs(t) which proves the claim. •

In contrast with the previous case , to be able to derive this equivalencefrom Theorems 44 and 42, one would need to consider GSP with f.t asWiener white noise with covariance given by the Dirac measure c5(t s). One can also relate this case to Nashed and Wahba (1974). It isimportant t o note that t he interpolation case is obtained as a limit ofthis case when (J tends to O.

4.2.3 COMPLETE MODEL AND PARTIALSMOOTHING SPLINES

Let us now consider the complete model with noise and drift. Theintroduction of the drift will again change the nature of the splines which


will be called partial splines (or inf-convolution splines) and which maybe as previously interpolating or smoothing according to the presence ofnoise.The drift space V is a finite dimensional subspace of RT subject to thefollowing condition: if d l , , dL is a basis of V, the n X L matrix D =(dj(ti),i = 1, ... n,j = 1, ,L) should be of rank L. This conditioncorresponds to the fact that V should not contain non null functionsthat vanish on all of the design points u,

THEOREM 49 If one substitutes the realization Yi for the random variable}li, the BLUP of Y; based on YI , ... , Yn becomes the function h(t) =Er=1 (}/d/(t) +g*(t) where g- is the partial smoothing spline which solvesthe following optimization problem in 1iK

(2.15)

(2.16)

Proof. The BLUP U* of Yt is of the form U* = E7=1 AiYt i whereA minimizes E(E7=1 AiYt. - Yt)2 with the constraints E7=1 AiD(ti) =D(t), 'riD E V. Since E(YtYs ) = D(t)D(s) + K(t , s) + (12c5(t - s) , A issolution of the following optimization problem

where Zt = (K(t,tt}, ... ,K(t,tn ) )" D, = (dl(t) , ... , dn (t ) ), and E I =(E + (12 In). Introducing a Lagrange multiplier c5 E R L , we need tominimize A'EIA - 2A'Zt + 28'(D' A- Dt) which leads to the followinglinear system

It is easy to solve this last system by substitution and one gets

{8 = (D'EIID)-I(D'EIlZt - DtlA= Ell (In - D(D'E11D)-1 D'E11)Zt + E-1D(D'E11D)-1 D,

Hence

U* = Z:(In - Ell D(D'EllD)-l D')E1Iy

+ D~(D'Ell D) -1 D'Ell Y.

On the other hand , as we will see in Chapter 3, the solution of (2.15) canbe written h(s) = (}'D, + g*(s), with g*(s) = J.l' Zs , J.l E Rn and (} E ]RL.

86

Since by easy linear algebra


n L n L

L(g*(td +L (hd/(ti) - yd 2 = L(< g* + L O/d/, K(., ti) > -Yi)2i = l 1= 1 i = l 1=1

= (EJ.l + DO - y)'(EJ.l + DO - y)

and IIgll~K = J.l'EJ.l, J.l minimizes (EJ.l +DO - y)'(EJ.l +DO - y) +a2J.l'EJ.l .Taking partial derivatives with respect to J.l and 0 yields the followingsystem

which is equivalent to

The solution is therefore

(2.17)

It is then easy to check that 0'D, + J.l' Zt coincides with U· when onesubstitutes y for Y. •

Examples for this case will be presented in the next Section. Let us justunderline the fact that if one wants in model (2.11) a stationary processfor X together with a non-zero mean function, one gets automaticallypartial splines.In a similar model with additional regularity assumptions, Huang andLu (2001) derive the BLUP and their relationship with penalized leastsquares methods but without acknowledging their abstract spline nature. They present the model as a nonparametric mixed effects modeland the estimation as an empirical bayesian approach in the gaussiancase. Huang and Lu (2000) construct wavelet predictors in that sameframework.

4.2.4 CASE OF GAUSSIAN PROCESSES

In the case of gaussian processes, since BLUP coincides with conditional expectation, the equivalence result has been presented as an equivalence between spline smoothing estimation and bayesian estimation fora given prior distribution. It is the case in Kohn and Ansley (1988) and


Wahba (1990). More precisely, let us assume that in the Kriging modelwith gaussian fluctuation, instead of being an unknown constant vector,ohas a prior gaussian distribution given by NL(O, a(J"2h) , independentof €i and of X; for a positive real a. Let ()a and U: be respectively thebayesian estimate of () and the bayesian predictor of}'t in this model.

THEOREM 50 When a tends to 00, the limit of ()a is 0 and the almostsure limit of U; is U*.

Proof. We follow Wahba's proof in Wahba (1990). With the previousnotations we have

{Var(Y) = a(J"2 DD' +E +(J"2 InCov(}'t, Y) = a(J"2DDt +z,

Therefore by the classical formula for conditional expectation for gaussian vectors, one gets

{E(}'t I Y) = (a(J"2DDt+Zt)'(a(J"2DD'+E+(J"2In)-1y= D~a(J"2D'(a(J"2DD' + Ed- 1y + ZHa(J"2DD' + Ed-1y

Comparing with (2.16), we need to check that

lim a(J"2 D'(a(J"2 DD' + E1)- 1 = (D'E- 1D)-1 D'E-1a--+oo 1 1

and that

These two equations follow from letting a tend to 00 in

(a(J"2 DD' + Ed-1 =Ell

E-1D(D'E-1D)-1(I - n + _1_(D'E-1D)-1)-1 D'E-11 1 a(J"2 1 1

•When a tends to 00, the variance of the prior increases and that is whyone talks about a diffuse prior. It corresponds practically to the uncertainty about the practical range of values of the parameter.Kohn and Ansley (1988) express the fluctuation process in state spaceform and use Kalman filter to obtain the solution thus yielding an efficient computing algorithm for the equivalent smoothing spline. Thishad been pointed out previously by Weinert et al (1980) .


4.2.5 THE KRIGING MODELSThe Kriging model is a particular case of the complete model when

the X process has stationary increments. Let us assume that X is a continuous (m-l)-IRF with generalized covariance G. The complete modelis then called the universal Kriging model whereas the case of knownand constant mean is called simple Kriging and the case of unknownmean with O-IRF is called ordinary Kriging.The generalized covariance by itself does not completely characterize thesecond order structure of the process, since m - 1 degrees of freedom arestill unspecified. However, we will see that it is all that matters in thefinal solution of the Kriging problem.What makes the specificity of this case is that the Kriging equationscan be expressed in terms of the generalized covariance rather than thecovariance (they are then called dual Kriging equations). At the sametime, the corresponding spline is shown to solve an optimization probleminvolving a semi-norm instead of a norm.The drift is classically modelled as an unknown linear combination ofknown functions, some of which are polynomials in the location variable and others can be explanatory variables or other types of dependence in the location variable. It is usual to assume that the drift subspace contains the set lPm-l of polynomials of degree less or equal to

m - 1 of dimension M = ( d + ; - 2 ). Let a; ... ,dL be L func

tions in the drift space V that span a subspace in direct sum withlPm-1' The dimension of V is then M + L. Let D be the n x L matrixD= (dj(td,i= 1, ... »ci> 1, ... ,L).Let us define a covariance KG associated with a given generalized covariance G by the following construction. Fix M locations Xl , ••• , X M

in R d such that the restrictions to these locations of the polynomials inlPm-l form a subspace of dimension M of R M and let ~,i = 1, ... , Mdenote a basis of lPm- 1 so that Pi(Xj) = 8i j . Such a set of locations is called a lPm_1-unisolvent set. Let P be the n x M matrix(Pj(ti),i = l, ... ,n,j = 1, ... ,M). By the previous assumptions, wehave that the rank of the matrix (PD) is L + M. Note that in dimension d = 1, this condition red uces to n 2:: m.Then the function

M M

KG(s , t) = L Pi(t)Pi(S) + G(s - t) - L Pi(S)G(Xi - t)~1 ~1

M M- L Pi(t)G(s - Xi) + L Pi(S)Pj(t)G(Xi - Xj)

i = l i ,j = l


defines a positive definite function which generates the same variancesfor generalized increments of order m - 1 of the process. Indeed , ifv is a generalized increment of order m - 1, then Var(Ei=l V jXt ,) =E?'j=l VWjG(ti - tj) by definition of the generalized covariance and it iseasy to see by using the above formula for KG(s, t) that

n n

L vWjKG(ti,tj) = L VWjG(ti - tj).i,j=l i,j=l

In the following theorem, we are going to prove that G can be substitutedfor KG in the Kriging equations. It is clear that the reproducing kernelHilbert space llKG contains P m-l' Let Il be the orthogonal projectionfrom llKG onto the orthogonal complement of Pm-l'

THEOREM 51 In the Kriging model described above, if one substitutesthe realization Yi for the random variable Y~ the BL UP of Yf based onYl , ... , Yn becomes the function h(t) = E/=l 8/d/(t) + g*(t) where g*is the partial smoothing spline which solves the following optimizationproblem in llKG

(2.18)

where Il is the orthogonal projection onto the orthogonal complement ofP m - l in llKG.

Note that in the limiting case of a stationary fluctuation , we have m = 0,M = 0, G = KG and Il is the identity.Proof. The BLUP U* of Yf is of the form U* = E~l AiYf j where Aminimizes E(Ei=l AiYf j - Yf)2 with the unbiasedness constraints

and

n

L AiP(ti) = P(t), 'iP E Pm - l

i=l

n

LAiD(ti)=D(t), vo c cu; ... ,dn).j=l

(2.19)

Let D, = (dl(t), ... ,dn(t))" Pt = (P1(t), ... , Pn(t ))' . Condition (2.19) isequivalent to saying that the set of (n+ 1) coefficients (AI, .. . , An, -1) is ageneralized increment of order m-l relative to the locations (tl, ... , tn, t).


Therefore using the generalized covariance

n

E(L Airt; - rt)2i= l

n n

E(L AiXt; - x, + L Ai€d 2

i= l i= ln n nL AiAjG(ti - tj) - 2L AiG(ti - t) + (12 L Ari,j=l i= l i= l

Let G = (G(ti - tj) , i , j = 1, . .. , n), G l = G + (12In andGt = (G(tl - t) , . .. ,G(tn - t))' . G l is automatically invertible. Then Ais solution of the following optimization problem

(2.20)

Introducing two Lagrange multipliers aE R L and, E RM, we need tominimize A'GlA - 2A'Gt + 2a'(D'A - Dt) +2,'(P'A- Pt) which leads tothe following linear system of Kriging equations

By substitution, one gets the following solution

A = (0 - OD(D'nD)-l D'O)Gt + (2.21)

(In - OD(D'nD)-l D')G1l P(P'Gl l p)-l Pt +

OD(D'OD)-lo,

where

Solving this system by substitution can be done as in the previous theorem but writing the solution in terms of 0 is not completely straightforward and the details can be found in Exercise 3.We conclude that

U* = G~O(In - D(D'Gll D)-l D'O)Y + ...

+P:((P'Gl l p)-lPG1l (In - D(D'OD)-l D'O)Y + ..,+D~(D'OD)-lD'OY.

On the other hand, as we will see in Chapter 3, the solution of (2.18)can be written h(8) = 0'Da + f.'P; + g*(8) with g*(8) = J.L'Ga , J.L ERn,


oE aLand ~ E aM.Since by easy linear algebra

n L M

~)9*(ti) + L (}Idl(ti) + L~kPk(ti) - yd 2 =i=l 1=1 k=1

(GJ-l +DO +P~ - y)'(GJ-l + DO +P~ - y)

and IIgll~K = J-l'GJ-l, J-l minimizes

(GJ-l +DO +P~ - y)'(GJ-l +DO +P~ - y) +0'2J-l'GJ-l.

91

Taking partial derivatives with respect to u; ~ and 0 gives the followingsystem

{

(G+0'2In)J-l+DO+P~=y

D'DO+D'(GJ-l- Gt ) =0P'P~ + P'(GJ-l- Gt ) = 0

which is equivalent to

{

(G+0'2In)J-l+DO+P~=y

D'J-l = 0P'~ = 0

The solution is therefore

It is then easy to check that 0'D, +eP; + p'Gt coincides with U* whenone substitutes y for Y. •

Note that the kernel of the semi-norm is lPm-l. System (2.21) bears thename of dual Kriging equations. The origin of this vocabulary can beunderstood in Exercise 1. The Kriging type interpolators are obtainedas a limit of the previous case when 0' tends to O. As a function oft the final solution can be viewed as a polynomial in t plus a linearcombination of n copies of the covariance function or of the generalizedcovariance centered at the data sites. The behavior of the predictor outside the convex hull of the sample locations is largely determined by thepolynomial terms. When the generalized covariance is contant beyonda certain range, it is easy to see that the second term vanishes when all


(2.22)

the distances t - t ; exceed that range. Otherwise it can be shown thatit vanishes asymptotically.Example 1. Dm-splines and thin plate splines.By far the most popular example of such correspondence results is thecase of Dm-splines in dimension d = 1 and t hin plate splines in dimension d » 1.Let E m denote any fundamental solution of the m-iterated Laplacian ~m(see Chapter 6 for more details). The (m - I)-fold integrated Wienerprocess X is an (m - 1)-IRF with generalized covariance given by Em.It formally satisfies t he stochastic differential equation D":X = dW/dtwhere W is standard Wiener process and dW/ dt is white noise.In dimension 1, the Hilbert space generated by {Xt , t E (0, I)} is isometrically isomorphic to the subspace of the Sobolev space Hm(o, 1) offunctions f satisfying the boundary conditions jV(O) = 0,1J = 1, ... , m - 1with the norm II f 11 2= Jdu(m))2(t)dA(t). It is a particular case ofExample 4 with a = 1. Its covariance is

11 (s - ur~-1(t - u)+-1R(s, t) = ( )'2 dA(U).o m-l.

Hence filtering integrated Brownian motion is associated with polynomial splines in dimension 1 and filtering IRF is associated with thin platesplines in higher dimensions.Example 2. L-splines.The first easy extension of D"' splines is obtained by replacing D'" bya differential operator with constant coefficients (see Chapter 3) . InKimeldorf and Wahba (1970b) , model (2.11) is considered with the dimension d = 1, the mean D(t) = 0, the process X, being stationary withspectral density proportional to IPdw)r 2

, where PL is the characteristicpolynomial of a linear differential operator L with constant coefficients .The corresponding process X is then an autoregressive process AR(p),where p is the degree of PL. The corresponding spline is called an Lspline.Example 3 . Lg-splines.The next extension is then to let the coefficients of the differential operator be non constant functions with some smoothness properties. Suchsplines are described for example in Kimeldorf and Wahba (1971 and1970a). An application of the correspondence result described in this example is the recursive computation of Lg-spline functions interpolatingextended Hermite-Birkhoff data (see Weinert and Sidhu (1978), Kohnand Ansley (1983), Wecker and Ansley (1983)). The coefficients of Lare used to parametrize a dynamical model generating the corresponding stochastic process of t he form LY = dW/dt where W is standard


Wiener process and dW/ dt is white noise.Example 4. a-splines.Thornas-Agnan (1991) considers a Kriging model with an additional assumption on the generalized covariance. By paragraph (1.7.3) , using thefact that G is conditionally of positive type of order m - 1, if F denotesthe Fourier transform in S'(R d), the measure dj.Lx(w) =11 w Wm FG(w)is a positive slowly increasing measure referred to as the m-spectralmeasure. The additional assumption requires this m-spectral measureto be absolutely continuous with respect to Lebesgue measure and that(1+ II t W)-m be integrable with respect to dj.Lx. It is then possible todefine a function a satisfying

II W 11 2m FG(w) = la(w)r 2• (2.23)

In fact, equation (2.23) generalizes the definition of the fundamental solutions of the iterated Laplacian, which is obtained for a(w) = 1. Withthese assumptions, Thomas-Agnan (1991) proves that the reproducingkernel Hilbert space HKG is then a Beppo-Levi space, described in paragraph (6.1.5) and the corresponding spline is an a-spline (see Chapter3) . The smoothing parameter of the spline corresponds to the varianceof the noise 0-2 • The order of the spline m is determined by the smallerinteger m for which the fluctuation process is an (m-l)-IRF. Note thatthe case of stationary processes corresponds to the case when the seminorm in the spline minimization problem is in fact a norm. The spanof the dl functions of the partial spline correspond to any complementof IFm-l in the drift space. Finally the a function of the a-spline determines the generalized covariance of the fluctuation process by (2.23).Let us give some examples of a functions which yield classical models.First, if we consider a stationary process X, with a rational spectraldensity given by

IQ(27ri II w 11)12

fx(w) = IP(27ri II w IDI2 '(2.24)

where P and Q are real coefficient polynomials of degree p and q respectively, and where all the zeros of P and Q have positive real part. Withthe condition q > d/2, the spectral density is integrable and the processX, may be called an isotropic ARMA(p, q) process. If we let

( ) _ P(27ri II w II)a w - Q(27ri II w II)'

by Theorem 51, the BLUP of yt based on Y1 , ... , Yn in this model corresponds to a smoothing a-spline of order O. Still in the stationary

94 RKHS IN PROBABILITY A ND STATISTICS

(2.25)

framework, a common model of isotropic stationary covariance used inmagnetic fields studies is given by

K (s, t) = (1 + II s ~2t 112

) -3/2 ,

for a constant L , corresponding to the following a func tion

a(w) = L-1(21r )- 1/2ex p(1rL II w II ).In the non stationary case, if we let X, be an isotropic ARIMA(p, m, q)process, i.e. an (m - l )-IRF with m-spectral density of the form (2.24)with the condition m + p - q > d/2 , the BLUP in model (2.11) corresponds to an a-spline of order m, with a being given by (2.25). Inparticular, the so-called polynomial generalized covariance (Matheron,1973) corresponds to the case when the numerator of a (denominator ofthe m-spectral measure) is a constant and therefore to an ARIMA process with p = O. The case of ARIMA(O, m , 0) yields the thin plate splinesin odd dimension. To get the same correspondence in even dimension,the generalized covariance has to be a linear combination of functionsof the type II t 11 2p and II t 11 2p log (II t II) with certain restrictions on thecoefficients.In the numerical analysis literature, interpolators of the form E aig(ttil +E bihi(t) are known as radial basis function interpolators, where 9is called the radial basis function (radial because only isotropic func tionsare used). Approximation properties of interpolation of the Kriging typehave been studied for example in Duchon (1983).

4.2.6 DIRECTIONS OF GENERALIZATIONThe prediction of block averages or derivative values rather than sim

ple function values is also considered throughout this litterature. Theextension to a continuum of data is not a major difficulty (Matheron,1973). The extension to arbitrary set T is also possible although maybe of limited practical interest.Disjunctive Kriging (Matheron, 1976) extends the Kriging predictorfrom a linear to a non linear form.Myers (1992) and Mardia et al (1995) relax the condition that polynomials in the location variable belong to the drift subspace by using anextension of the conditionally positive-definiteness definition. This extension necessitates also an extension of the definition of IRF and that ofgeneralized increments, necessarily linked to the drift space basis. Onecan conjecture t hat some equivalence result may be obtained with thatextension in the case of a semi-norm with an arbitrary but finite dimensional null space.


Myers (1982, 1988, 1991, 1992) treats the vector valued case called cokriging which requires an extension of the conditionally positive definiteness condition to the matrix-valued case (see also Narcowich and Ward,1994). Cokriging allows the use of data on correlated variables to enhance the prediction of a primary variable.Pure interpolation constraints can be generalized in a straigthforwardfashion to arbitrary continuous linear constraints, for example involvingderivative values in the spline side but also in the Kriging side (Mardiaet al, 1995, Mardia and Little, 1994).The case when the signal process is observed trough a transformation isconsidered in Parzen (1971) .Generalized processes with stationary increments are often cited in theKriging litterature, but to our knowledge , no author has fully consideredthis generalization in a Kriging model.

5. UNIFORM MINIMUM VARIANCEUNBIASED ESTIMATION

Parzen (1959, page 339) shows how reproducing kernel Hilbert spacetheory allows an elegant presentation of the theory of Uniform Minimum Variance Unbiased Estimates (UMVUE hereafter) . In a classicalparametric model, let X be a random variable whose probability law Pe,where the parameter 0 varies in 0, belongs to a family dominated bya;:robability measure J1- . Assume that the Radon-Nykodim derivatives~ belong to L 2 (J1- ) for all () in 0. Given a function f on the parameter space 0, an estimate l' is MVUE of I = f(()) if l' is an unbiasedestimate of I with minimal variance among unbiased estimates. Whenthe model is dominated by Peo' f is called locally MVUE at ()o if it isMVUE for J1- = Peo.It is said UMVUE if it is locally MVUE at ()o for allvalues of the parameter ()o. We define a kernel function J/l on 0 X 0 by:

J (() ()) dPo, dPo2/l 1, 2 =< d/l 'dtL >U(/l) '

THEOREM 52 Given a function f on the parameter space 0, there existsan unbiased estimate of f (()) if and only if f belongs to llJ,... In thatcase, the MVUE of f(()) is given by'ljJ(J) where'ljJ is the congruence fromllJ,.. onto £(~, () E 0) satisfying

dPe'ljJ(K(., ())) = 7;'

If V is an unbiased estimate of f (()) , then the projection of V onto thesubspace £(!!£i, () E 0) is the MVUE of f(()) . The norm of the MVUEof f(O) is equal to II f IIJ,..·


In practice, to find the locally MVUE of f(B) at Bo, it is enough to finda representation of f in terms of linear operations on the reproducingkernel JP90 •

Note that the projection of V onto the subspace £(;;9 ,B E e) coincides90

with the conditional expectation of V given {:::9 ,B E e}. Once the90

locally UMVUE at Bo has been found, a UMVUE exists if there existsa determination of these conditional expectation which is functionallyindependent of Bo. An example of application can be found in Duttweilerand Kailath(1973a).An illustration of this theorem is the problem of UMVLUE of the meanfunction of a process with known covariance. Let X, be a process withknown covariance K and whose mean function m(t) is unknown butsupposed to belong to a known subset M of1iK. The extra L in MVLUEmeans that we are considering only estimates which are linear functionalsover the observed process i.e. elements of cix; t E T). The solution ismainly interesting in the non-finite index T case where it enables one tohave a theory of regression with an infinite number of observations. IfKM denotes the restriction of K to M, it can be shown that a functionf(m) is linearly estimable if and only if f belongs to 1iKM •

THEOREM 53 If 1/J denotes the canonical isomorphism between l(X)and 1iK , and M denotes the Hilbert subspace spanned by M, 1/J - l (g*) isthe UMVLUE of f(m) if and only if g* satisfies anyone of the followingequivalent conditions

• s: is the function in 1iK which has minimum norm among all functions 9 E 1iK satisfying < m,g >= f(m),Vm EM

• g* is the unique function 9 in M satisfying < m, 9 >= f(m), for allmEM

• g* is the projection onto Nt of any element 9 E H« satisfying< m,g >= f(m), for all mE M.

Moreover the minimum variance is equal to II g* Ilk.

If we rewrite the problem in 1iK, this theorem is just a consequence ofthe projection theorem. When the covariance is only known up to aconstant factor , Cov(Xt , X s ) = (72 K(s, t), the same result holds exceptthat the minimum variance is given by (72 II g* Ilk and it is then necessaryto estimate (72 (see Parzen , 1961a) .


6. DENSITY FUNCTIONAL OF A GAUSSIANPROCESS AND APPLICATIONS TOEXTRACTION AND DETECTIONPROBLEMS

In this section, we first consider the problem of computing when itexists the probability density functional of a gaussian process X, withrespect to a gaussian process yt when they have the same covarianceand different means.We then apply these results to the signal plus noise models. In contrastwith the models of Section (4.2), the data is no longer discrete and thenoise is no longer iid and these two processes are assumed to be gaussian.The questions that arise in these models are of different natures:

• Estimating () when it is nonrandom

• Detecting the presence of a signal of a specified shape

• Detecting the presence of a stochastic signal

• Classifying signals

The first problem is referred to as the extraction of signal in noise problem or sometimes to as the regression of time series problem .In a series of papers, Parzen (1962,1963) develops a unified approachto these problems based on reproducing kernel Hilbert space, applicablewhether the process be stationary or not, discrete "time" or not, univariate or multivariate. The remainder of this section summarizes thisapproach. All these problems involve likelihood ratios and the object ofthe next section is their computation. Important contributions in thisarea are also due to Kailath (1967, 1970) and Duttweiler and Kailath(1972, 1973a and 1973b) .

6.1. DENSITY FUNCTIONAL OF AGAUSSIAN PROCESS

Let X, and yt be separable stochastic processes on an index set Twith the same covariance function K and with respective means mxand my . T will be either countable or a separable metric space. Letn be the set of real valued functions defined on T and let Px and Fybe the probability measures on n with the sigma field of cylinder setsrespectively induced by X, and yt .The following problems date back to Hajek (1958) :

• to determine when Px will be absolutely continuous with respect toFy,


• to compute the Radon-Nykodim derivative of Px with respect to Pywhen it exists,

• to determine when Px and Py are orthogonal i.e. whether thereexists a set A such that Px (A) = 0 and Py (A) = 1.

These questions have been adressed by Hajek (1958), Kallianpur andOodaira (1963,1973), Capon (1964), Rozanov (1966), Neveu (1968) ,Jorsboe (1968), Kailath (1967,1970), Kallianpur (1970,1971), Duttweilerand Kailath (1973), Fortet (1973). Let Tn be a monotone increasing sequence of finite subsets Tn = {tIl' .. , tn} of T such that the union of theTn is equal to T in the countable case and is dense in T in the separable metric space case. Let KTn be the restriction of K to Tn. Let pXand Py: be the probability distributions of {Xt , t E Tn} and respectivelyof {yt , t E Tn}. The following theorem is sometimes called dichotomytheorem.

THEOREM 54 We assume that the index set T is either countable or aseparable metric space, that K is weakly continuous and that K has theproperty that it is non singular on every finite subset of T. Then

• the measures Px and Pv are either equivalent or orthogonal

• Px is orthogonal to Py if and only if my - mx does not belong tollK

• Px is equivalent to Py ~ and only if my - m x belongs to H« andin that case the density d~ of Py with respect to Px is given by

dPydPx = exp('t/J-l(my - mx) < mx, my - m x >K (2.26)

1 22 " my - mx 11K)

where 't/J is the canonical congruence between l (X) and 1iK.

The proof makes use of the theory of martingales. In formula (2.26), ~Ris a random variable called likelihood ratio and is the result of pluggingX in the actual density. Note that the sequence of densities ~~i can

xbe shown to converge pointwise to ~~ and that the random variable't/J-l(my - mx) is the limit in mean square sense as well as Px almostsure sense of 't/J:;;I(my - mx) where 't/Jn is the canonical isomorphismbetween 1iK

T nand iix; t E Tn)' The reader will check that this formula

generalizes the corresponding well known formula in the case when X t


and yt are multivariate gaussian vectors (finite T). Note the alternativeformula

dPy 1 1 2 2 )dPx = exp(1/J- (my - mx) - 2(11 my 11K - II mx 11K) .

In the signal plus noise models, let PN and PS+N be the probabilitymeasures respectively induced by the noise and by the data on the setof all real valued functions on the index set T . Let Pn(X) denote thedensity of P';+N with respect to P'N. Let KJV denote the restriction ofKN to Tn . The divergence I n between the measures PS+N and PN basedon the data (Xt , t E Tn) is defined by

This quantity originating from information theory is a measure of howfar it is possible to discriminate between the presence and absence ofnoise therefore a measure of signal to noise ratio.In the sure signal case, Parzen shows that one can express Pn and I n interms of reprod ucing kernels .

{log(Pn) = 1/J:;;1 (S) - ! < S,S >K'lrI n =< S,S >K~

where 1/Jn denote the canonical isomorphism given by the Loeve representation theorem between l(N) and 1iK~ '

Using Theorem 54, one sees that PS+N is equivalent to PN if and onlyif S(.) belongs to 1iKN i.e. the signal S belongs to the reproducing kernel Hilbert space representing the noise 1iKN' Moreover this happens ifand only if the sequence of divergences I n =< S,S >K~ converges asn --+ 00 to a finite limit Joo =< S, S >K. In that case the sequence ofdensities Pn converges pointwise to the density P of PS+N with respectto PN and we have

1log(p) = 1/J-l(S) - 2 < S,S >K . (2.27)

In the context of Mercer's theorem, Kutoyants (1984) shows that S(.)belongs to 1iKN if and only if E~=1 /n < S, <Pn >2< 00 and that thenthe density P is given by

00 1 1 001

p(x) = exp(L An Xn < S, <Pn > -2 L An < S, <Pn >2).n=1 n=1

We will not give details about the case of stochastic signal and thereader is referred to Parzen (1963). Let us just mention informally that


in that case the absolute continuity happens when almost all the samplepaths of the signal process belong to the reproducing kernel Hilbertspace representing the noise . The problem with correlated signal andnoise is studied in Kailath (1970). The equivalence between gaussianmeasures with equal means and unequal covariances has been consideredin Kailath (1970), Parzen (1971) and Kailath and Weinert (1975). Thissame question in a non gaussian framework is addressed to in Duttweilerand Kailath (1973b) using reproducing kernel Hilbert space tools, inparticular the congruence between the non linear Hilbert space generatedby the process and the reproducing kernel Hilbert space associated withthe characteristic functional.

6.2. MINIMUM VARIANCE UNBIASEDESTIMATION OF THE MEAN VALUE OFA GAUSSIAN PROCESS WITH KNOWNCOVARIANCE

Coming back to the problem of UMVUE of the mean function of agaussian process with known covariance, Parzen (1959) proves that theUMVUE coincides with the UMVLUE in the gaussian case with theassumptions of the previous section. In particular the mean class M issupposed to be a subset of Htc- With the same notations as in Section5,

THEOREM 55 If f is linearly estimable, the UMVUE of f(m) is equalto ?j; - l (g*) where ?j; is the canonical congruence from £(X) to 1iK ands: is the orthogonal projection onto M of any function g in 1iK suchthat < m, g >= f(m), "1m EM. In particular the UMVUE of m and m'are respectively given by

{m(t) = 7/1- 1(K*(., t))~'(t) = 7/I - 1(( gsK (., t ))*)

Proof. The parameter is () = m. Since M is included in l{K, the densityof Pm with respect to Pmo exists and by formula (2.26), it is given by

p(m) = exp(7/I-1(m - mo)- < mo, m - mo > -~ II m - mo 11 2) . (2.28)

The proof of this theorem relies on the following lemma.

LEMMA 15


To prove the lemma, by formula (2.28) and using the fact that for agaussian random variable Z we have E(exp(Z)) = exp(E(Z)+~Var(Z)),

we get

Em (dPm1dPm2)o dPmo dPmo

= exp(Emo(1fJ-I(ml - mo))- < mo, ml - mo >

-~ II ml - mo 112

2+exp(Emo(1fJ-I(m2 - mo))- < mO,m2 - mo >

-~ II m2 - mo 11 22

+~Var(1fJ-I(ml - mo)) + ~Var(1fJ-I(m2 - mo)).

One concludes the proof of the lemma using equations (2.3) and (2.3)and some easy linear algebra. Next in order to apply Theorem 52, weneed to define a kernel Jmo by

Jmo(mil m2) =< d:'i1, d:'i2>L2(Po)= Epo (p(ml)p(m2))'

•By the previous lemma, we have Jmo(m}, m2) = exp(< ml - mo, m2

mo » . As in Theorem 52, we are going to look for a representation off(m) in terms of Jmo(m, .) . Let ¢n, n E N be an orthonormal basis for.All and let f3n be the coefficients of ml - mo in this basis.It is easy to see that the derivative 8~n Jmo(ml ' m) evaluated at f3 = 0(i.e. ml = mo) is equal to < ¢n , m - mo >. For a function g satisfying

< m ,g >= f(m),Vm EM,

write00

f(m) =< g,mo > +L < g,¢n >< m - mO,¢n >n=l

and therefore

00 8p1fJJmo(J) =< g, mo > + L < g, ¢n > 8f3n (mo)

n=1

where 'l/JJmo is the congruence between 1lJmo and £(dJP:, m EM). Toevaluate this derivative, write

102

We thus get that

and therefore that


00

1/;Jmo (1) = L < g, 1>n > 1/;-1 (1)n) = 1/;-1 (g*).n = 1

•The case when the set M is parametrized is covered in the next section.Stulajter(1978) uses Parzen's and Kallianpur's results for the non linearestimation of polynomials of the mean Q(m), Q E Pn'

6.3. APPLICATIONS TO EXTRACTIONPROBLEMS

In t he linearly parametrized signal case , S, = Et:l (}/d/ (t) where (}are unknown (random or not) parameters and d/ are known functionsbelonging to 1iKN' For the sake of simplicity, we will only work out thecase L = 1 and write (}d(t), knowing that the results can be extendedin a straightforward way to the multiparameter model. As is usual, wewill introduce the parameter (} in the notation for the density p(X I8).

LEMMA 16 In this model, we have

THEOREM 56 If (} follows a gaussian prior distribution N((}o, a2 ), theBayes est im ate of 8 is equal to

(}* _ 1/;- 1(d) + a2(}o

-<dd> +2 ', K N a

with mean square estimation error given by « d, d >KN +a2)-I.

The maximum likelihood estimate of (} is equal to

8** =< d,d >K~ 1/; - I (d),

with mean square estim ation error given by « d, d >K N) -1.

Moreover (}** is the almost sure limit of ()* when the prior variance a2

tends to 00.

Finally, e: is also the minimum variance unbiased estimate and theminimum variance unbiased linear estimate of (}.


Corresponding formulas exist for the multiparameter model where theGram matrix < di, dj >KN takes the place of < d, d >KN' One can thenderive the classical expressions of the estimates of 0 in the case of a finitenumber of observations (see Parzen, 1961) .Ylvisaker (1962 and 1964) uses the inequality 1.13 of Chapter 1 to derive lower bounds on the covariance matrix of MVUE estimators forregression problem on time series.A series of papers, (Sacks and Ylvisaker (1966, 1968, 1969) and Wahba(1971, 1974)) are devoted to the problem of regression design. The regression design problem is to choose a su bset (or design) Tn = {tl' .. . , tn}of given size n of T such that the variance (or mean square error) ofthe maximum likelihood estimator Or: of °based on the observationsX t 1 , •• • , X t n is as small as possible. Let TITn be the projection operatorin 1lKN onto the subspace 1ln spanned by {KN(tl, .) , ... , KN(tn, .)} . Leto-f = E(B - Br* ) and crt = E(B - Br:).Since by Theorem 56 the meansquare error of Or: is given by < d,d>H~ ' it is clear that minimizingcrtn is obtained by minimizing II d - TITn (d) IIk

N• Therefore the prob

lem becomes that of choosing an optimal subspace spanned by a finitenumber of functions KN(ti,') for approximating the function d. If TJndenotes the set of designs of size n in T, a sequence of designs T~ is saidby Sacks and Ylvisaker (1966, 1968, 1969) asymptotically optimal if

2 2• CTT;' - CTT

hm . 2 2 =1,n-too lOfT EV CTT * - crTn n n

or equivalently if

At this degree of generality, the problem is untractable. Revising thisdefinition of asym ptotic optimality allowing some derivatives of the process to be observable at the design points, Sacks and Ylvisaker (1966,1968, 1969) characterize asymptotically optimal sequences and the rateof convergence of the approximation error for some classes of noise process. This result is extended later by Wahba (1971) to larger classesof noise process and by Ylvisaker (1975) to the case of a noise processindexed by a two dimensional parameter (random field).Kutoyants (1978) considers the case when the signal is parametrized in apossibly non linear fashion and investigates the asymptotic properties ofthe maximum likelihood and bayesian estimates of the parameter usingreproducing kernel Hilbert space theory.


6.4. APPLICATIONS TO DETECTIONPROBLEMS

The simple hypotheses for testing the presence of a signal

{ n.: x, = NtHi : X, = St + N,

(2.29)

are said perfectly detectable if the measures PN and PS+N are orthogonalin which case the decision problem is said to be singular i.e. capable ofresolution with zero probability of error.For the detection problem (2.29), the optimum rejection region for aBayes test or a Neyman-Pearson test is shown to be the set where thedensity p is above a certain threshold , which corresponds to regionswhere 'ljJ - i (S) is above a certain threshold. Kailath (1975) relates thelikelihood ratio test to reproducing kernel Hilbert space theory. Kailath(1972) exploits this relationship to obtain recursive solutions for certainFredholm equations of the first kind.


7. EXERCISES

105

(2.30)

1 Alternative proofs of Theorem 47.1) Write the Lagrangian for the minimization problem (2.12) and theassociated dual problem. Solve the inf part of the dual problem anduse the canonical isomorphism to write the obtained dual problem in£(X).2) Prove that for any function I satisfying l(ti) = Yi, the spline minimizing II 9 1I1lK under the constraints l(ti) = Yi is the projection ofI onto the linear span of (K(tt ,.), ... ,K(tn , .)). Using the canonicalisometry, tranlate this property in the Hilbert space generated by yt.

2 Prove that in the case of a gaussian random field following the Kriging model with no measurement error, if the coefficients of the driftare known, the Kriging predictor coincide with the conditional expectation of yt given the data and if the coefficients are unknown ,the Kriging predictor are obtained from the same formula replacingthese coefficients by their generalized least squares estimator.

3 Proof of formula (2.21). First prove that the system of Kriging equations is equivalent to

A = Gtl(G t - Do - P~)

ire;' Do + iro;' p~ = trct'c, - D,P'Gt

l Do + trc;'P~ = P'GtlGt - P.

From the definition of n and the last equation of (2.30), prove that

From the definition of n and the last two equations of (2.30), provethat

Compute 0 from the last result and plug it into the next to last toget formula (2.21).

4 Let a random function X, have the prior distribution given by thesolution of the stochastic differential equation

!!:. (t) = dW(t)dtm g dt

(2.31)

where W(t) is a zero-mean Wiener process with variance 1.

1) prove that the best predictor of Xt+h based on X t, X:, ... , Xt(m-l)


1< u,v >R= "2(u(a)v(a) +

is its Taylor series expansion of order m - 1. You will have to generalize Theorem 47 to the case of interpolating constraints given bygeneral continuous functionals.2) prove that the best predictor of Xt;h- l

) based on X t , Xf, ... , x1m- l)

is given by Xt(m-l).

5 Prove that (2.17) can be rewritten () = A'y and J.l = By where

where Q- denotes the Moore-Penrose generalized inverse of a matrixQ.

6 In the complete model of Section 4.2.3, assume that the derivative ofthe fluctuation process X, exists in mean square sense which happenswhen the second derivative of its covariance K exists at zero. Thedata consists of a set of function values y = (Yt1 , ••• , Yt n ) and a setof derivative values z = (~'l' ... , Yin)'1) prove that Cov(Xf, X~) = -K"(s-t) and Cov(Xs , XI) = K'(t-s).2) derive the Kriging predictor of yt based on this data.

7 Let X t be an Ornstein-Uhlenbeck process on the interval (a, b), i.e.the centered stationary gaussian process on this interval with covariance function given by R(s, t) = exp( -(3 Is - tl), for (3 > O. Provethat 1lR is the set of absolutely continuous functions on (a,b) withthe inner product

u(b)v(b)) + 2~ lb

(u'(t)v'(t)

+ (32u(t )v(t))d>-' (t )

8 Let X t , t E T be a second order stochastic process with known propercovariance K and unknown mean value function m(t) belonging to afinite dimensional subspace of 1lK spanned by q linearly independentfunctions Wl ,.' " wq • If (3 is the vector of coefficients of m(t) inthe basis WI, ... , W q , if 'l/J is a vector of known constants, prove thatthe UMVUE of 'l/J'(3 is given by 'l/J'f3 where f3 is any solution of thelinear system of equations W (3 = W t , W is t he matrix with elementsWij =< Wi,Wj >K and Wt = ('l/J - I(wd, oo.,'l/J- I(Wq) ) . Derive theUMVUE of m(t) for q = 1.

9 Assuming in the signal plus noise model that the signal is deterministic of the form S (t) = a + bt, and that the noise has covariance


K(s, t) = C exp(-,8 Is - tl), find the UMVUE of 5(t) based on thecontinuous observations X t , 0 ~ t ~ T.

10 In the gaussian signal plus noise model of Section 6.3, consider twosignals SI and 52 in 1lKN and let X, = Sdt)+Nt and Yt = S2(t)+Nt.For any real p > 0, prove the following formula

dPx (1 2 11 2 )EY(dPy)P=exp 2(P -p)IIS1-S2 K N •

11 This exercise has relationships with Section 3 of Chapter 7. Let uand v be two continuous functions of bounded variation on an interval(a, b), such that u(s) > 0 for s > a, v(s) > 0 for all s and ; strictlyincreasing. Let dj.l be the measure generated by ; which is a non

atomic measure apart from perhaps an atom of weight ~f:~ at a.1) check that f(s, t) = u(min(s, t))v(max(s , t)) can be written

f(s , t) = lb

u(s)l(a,s)(T)v(t)l(a,t)(T)dj.l(T)

2) check that f(s, t) = u(s /\ t)v(s V t) is a covariance function .3) prove that the reproducing kernel Hilbert space 1lr is the setof functions 9 such that there exists g* E L2 (( a, b) ,dp) such that

g(t) =v(t) J: g*(r)dp(r).4) write the corresponding norm.

12 Let <j>( ., .) be a function from IR n x IR such that <j>(x, y) = 0 for y < 0and x E IRn . Let En be a sequence of i.i.d . random variables withthe same distribution as that of E, and tn be a sequence such thattk - tk-l are LLd. exponentially distributed with parameter A andindependent of the En . Assume that E( <j>(E, t)) = 0 for all t > O. LetR(s, t) = E(<j>(E, t)<j>(E, s)) and assume that R(t, t) < 00. Let Zt bethe filtered Poisson process (or shot noise) Zt = I:~=1 <j>(En , t - tn)'1) If K(s, t) = E(ZsZt), prove that K(s, t) = AJomin(t,s) R(t - z, sZ)dA(Z).2) Compute K for E(E) = 0, E(£2) = 0'5 and <j>(£, t) = £exp(-at) ift > 0 and 0 otherwise.

13 Cramer-Rao inequality (from Parzen (1959)).1) Let VI, • • . ,Vn be n linearly independent vectors in a Hilbert spaceH and let the matrix K be defined by Kij =< Vi, Vj >11.. Prove that

n

II u 11~2: L < U, Vi >11. Ki/ < u, vs >11. .i ,j=1


Hint: note that the right hand side is the squared norm of the projection of u onto the span of Vi," . , Vn .

2) With the notations of Section 5, for e c lRn, k(fJ) denote the den

sity of X with respect to Peo' and let Vi = ~ log k(0) for i = 1 to n.Assuming that the first and second derivatives of the log-density aredefined as limits in quadratic mean, check that Ee(Vi) =0 and thatKij = Ee(ViVj)satisfy Kij = -Eo(~ log k(O)) .

OUiOUj

3) For a random variable U in the Hilbert space generated by X, anda function 9 on e, apply the inequality proved in 1) to the vectors Viand to u =U - 9 (0) to prove that

Check that this last inequality is the classical Cramer-Rae lowerbound.

Date post:	04-Dec-2016
Category:	Documents
Upload:	christine
View:	213 times
Download:	0 times

Reproducing Kernel Hilbert Spaces in Probability and Statistics || RKHS and Stochastic Processes

Documents