Probabilistic Graphical Modelsepxing/Class/10708-17/slides/lecture5-GLIM-anno… · Conditional...

School of Computer Science

Probabilistic Graphical Models

Generalized linear models

Eric XingLecture 5, February 1, 2017

Reading: KF-chap 17

X1

X4

X2 X3

X4

X2 X3

X1X1

X2

X1

X3

X1

X4

X2 X3

X4

X2 X3

X1X1

X2

X1

X3

1© Eric Xing @ CMU, 2005-2017

Parameterizing graphical models Bayesian network:

di

i iXPP

:1

)|()( XX

A B

C

a0 0.75

a1 0.25

b0 0.33

b1 0.67

a0b0 a0b1 a1b0 a1b1

c0 0.45 1 0.9 0.7

c1 0.55 0 0.1 0.3

OrA B

C

A~N(μa, Σa) B~N(μb, Σb)

C~N(A+B, Σc)

OrDiscrete

Continuous

Hybrid

?

2© Eric Xing @ CMU, 2005-2017

Recall Linear Regression Let us assume that the target variable and the inputs are

related by the equation:

where ε is an error term of unmodeled effects or random noise

Now assume that ε follows a Gaussian N(0,σ), then we have:

We can use LMS algorithm, which is a gradient ascent/descent approach, to estimate the parameter

iiT

iy x

2

2

221

)(exp);|( i

Ti

iiyxyp x

3© Eric Xing @ CMU, 2005-2017

Recall: Logistic Regression (sigmoid classifier, perceptron, etc.)

The condition distribution: a Bernoulli

where is a logistic function

We can used the brute-force gradient method as in LR

But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model!

yy xxxyp 11 ))(()()|(

xTex

11)(

4© Eric Xing @ CMU, 2005-2017

Parameterizing graphical models Markov random fields

)(exp1)(exp1)( xxx HZZ

pCc

cc

i

iiNji

jiij XXXZ

Xpi

0,

exp1)(

5© Eric Xing @ CMU, 2005-2017

hidden units

visible units

{ }∑ ∑∑,

,, )(-),(+)(+)(exp=)|,(j ji

jijijijjji

iii Ahxhxhxp θ

Restricted Boltzmann Machines

© Eric Xing @ CMU, 2005-2017 6

Conditional Random Fields

c

ccc yxfxZ

xyp ),(exp),(

1)|(

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

...

A AA AX2 X3X1 XT

Y2 Y3Y1 YT...

...

Y1 Y2 Y5…

X1 … Xn

Discriminative

Xi’s are assumed as features that are inter-dependent

When labeling Xi future observations are taken into account

7

© Eric Xing @ CMU, 2005-2017

Conditional Distribution If the graph G = (V, E) of Y is a tree, the conditional distribution over

the label sequence Y = y, given X = x, by the Hammersley Clifford theorem of random fields is:

─ x is a data sequence─ y is a label sequence ─ v is a vertex from vertex set V = set of label random variables─ e is an edge from edge set E over V─ fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge

feature─ k is the index number of the features─ are parameters to be estimated─ y|e is the set of components of y defined by edge e─ y|v is the set of components of y defined by vertex v

1 2 1 2( , , , ; , , , ); andn n k k

Y1 Y2 Y5

…

X1 … Xn

(y | x) exp ( , y | , x) ( , y |1(x)

, x)

k k e k k ve E,k v V ,k

p f e g vZ

8© Eric Xing @ CMU, 2005-2017

2-D Conditional Random Fields

Allow arbitrary dependencies on input

Clique dependencies on labels

Use approximate inference for general graphs

c

ccc yxfxZ

xyp ),(exp),(

)|( 1

9

© Eric Xing @ CMU, 2005-2017

Exponential family, a basic building block For a numeric random variable X

is an exponential family distribution with natural (canonical) parameter

Function T(x) is a sufficient statistic. Function A() = log Z() is the log normalizer. Examples: Bernoulli, multinomial, Gaussian, Poisson,

gamma,...

)(exp)(

)(

)()(exp)()|(

xTxhZ

AxTxhxp

T

T

1

10© Eric Xing @ CMU, 2005-2017

Example: Multivariate Gaussian Distribution For a continuous vector random variable XRk:

Exponential family representation

Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a (d+d2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)

logtrexp

)()(exp),(

/

//

12111

21

2

1212

21

21

21

TTTk

Tk

xxx

xxxp

2

221

112211

21

121

21

1211

211

2

2/)(

log)(trlog)(vec;)(

and ,vec,vec;

k

TT

T

xh

AxxxxT

Moment parameter

Natural parameter

11© Eric Xing @ CMU, 2005-2017

Example: Multinomial distribution For a binary vector random variable

Exponential family representation

),|(multi~ xx

k

kkxK

xx xxp K lnexp)( 1121

1

1

0

1

1

1

)(

lnln)(

)(

;ln

xh

eA

xxTK

k

K

kk

K

k

k

1

1

1

11

1

1

1

1

1

1

1

1ln1

lnexp

1ln1lnexp

K

kk

K

k kKk

kk

K

kk

K

kK

K

kkk

x

xx

12© Eric Xing @ CMU, 2005-2017

Why exponential family? Moment generating property

)()(

)(exp)()(

)(exp)()(

)()(

)(log

xTE

dxZ

xTxhxT

dxxTxhdd

Z

Zdd

ZZ

dd

ddA

T

T

1

1

)(

)()(

)()()(

)(exp)()()(

)(exp)()(

xTVarxTExTE

Zdd

Zdx

ZxTxhxTdx

ZxTxhxT

dAd

2

TT

2

22

2 1

13© Eric Xing @ CMU, 2005-2017

Moment estimation We can easily compute moments of any exponential family

distribution by taking the derivatives of the log normalizer A().

The qth derivative gives the qth centered moment.

When the sufficient statistic is a stacked vector, partial derivatives need to be considered.

variance)(

mean)(

2

2

dAdddA

14© Eric Xing @ CMU, 2005-2017

Moment vs canonical parameters The moment parameter µ can be derived from the natural

(canonical) parameter

A() is convex since

Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):

A distribution in the exponential family can be parameterized not only by the canonical parameterization, but also by the moment parameterization.

def

)()( xTE

ddA

02

2

)()( xTVardAd

)(def

4

8

-2 -1 0 1 2

4

8

-2 -1 0 1 2

A

15© Eric Xing @ CMU, 2005-2017

MLE for Exponential Family For iid data, the log-likelihood is

Take derivatives and set to zero:

This amounts to moment matching. We can infer the canonical parameters using

n nn

Tn

nn

Tn

NAxTxh

AxTxhD

)()()(log

)()(exp)(log);(

l

)(

)()(

)()(

nnMLE

nn

nn

xTN

xTN

A

ANxT

1

1

0

l

)( MLEMLE

16© Eric Xing @ CMU, 2005-2017

Sufficiency For p(x|), T(x) is sufficient for if there is no information in X

regarding beyond that in T(x). We can throw away X for the purpose of inference w.r.t. .

Bayesian view

Frequentist view

The Neyman factorization theorem T(x) is sufficient for if

T(x) X ))(|()),(|( xTpxxTp

T(x) X ))(|()),(|( xTxpxTxp

T(x) X

))(,()),(()),(,( xTxxTxTxp 21

))(,()),(()|( xTxhxTgxp 17© Eric Xing @ CMU, 2005-2017

Examples Gaussian:

Multinomial:

Poisson:

2211

21

1211

2 /)(

log)(vec;)(

vec;

k

T

T

xh

AxxxxT

n

nn

nMLE xN

xTN

111 )(

1

1

0

1

1

1

)(

lnln)(

)(

;ln

xh

eA

xxTK

k

K

kk

K

k

k

n

nMLE xN1

!)(

)()(

log

xxh

eAxxT

1

n

nMLE xN1

18© Eric Xing @ CMU, 2005-2017

Bayesian est.

© Eric Xing @ CMU, 2005-2017 19

Generalized Linear Models (GLIMs) The graphical model

Linear regression Discriminative linear classification Commonality:

model Ep(Y)==f(TX) What is p()? the cond. dist. of Y. What is f()? the response function.

GLIM The observed input x is assumed to enter into the model via a linear

combination of its elements The conditional mean is represented as a function f() of , where f is

known as the response function The observed output y is assumed to be characterized by an

exponential family distribution with conditional mean .

Xn

YnN

xT

20© Eric Xing @ CMU, 2005-2017

GLIM, cont.

The choice of exp family is constrained by the nature of the data Y Example: y is a continuous vector multivariate Gaussian

y is a class label Bernoulli or multinomial

The choice of the response function Following some mild constrains, e.g., [0,1]. Positivity … Canonical response function:

In this case Tx directly corresponds to canonical parameter .

)()(exp),(),|( 1 Ayxyhyp T

)( 1f

)()(exp)()|( Ayxyhyp T

f

x yEXP

21© Eric Xing @ CMU, 2005-2017

Example canonical response functions

© Eric Xing @ CMU, 2005-2017 22

MLE for GLIMs with canonical response Log-likelihood

Derivative of Log-likelihood

Online learning for canonical GLIMs Stochastic gradient ascent = least mean squares (LMS) algorithm:

n n

nnnT

n Ayxyh )()(log l

)(

)(

yX

xy

dd

ddAyx

dd

Tn

nnn

n

n

n

nnn

l

This is a fixed point function because is a function of

ntnn

tt xy 1

size step a is and where nTtt

n x23© Eric Xing @ CMU, 2005-2017

Batch learning for canonical GLIMs The Hessian matrix

where is the design matrix and

which can be computed by calculating the 2nd derivative of A(n)

WXX

xxddx

dd

ddx

ddxxy

dd

dddH

T

nT

nTn

n n

nn

Tn

n n

nn

nTn

nn

nnnTT

since

l2

TnxX

N

N

dd

ddW

,,diag

1

1

nx

xx

X

2

1

ny

yy

y

2

1

24© Eric Xing @ CMU, 2005-2017

Recall LMS Cost function in matrix form:

To minimize J(θ), take derivative and set to zero:

yy

yJ

T

n

ii

Ti

XX

x

2121

1

2)()(

nx

xx

X

2

1

ny

yy

y

2

1

0

221

22121

yXXX

yXXXXX

yyXyXX

yyXyyXXXJ

TT

TTT

TTTT

TTTTTT

trtrtr

tryXXX TT

The normal equations

yXXX TT 1*

25© Eric Xing @ CMU, 2005-2017

Iteratively Reweighted Least Squares (IRLS) Recall Newton-Raphson methods with cost function J

We now have

Now:

where the adjusted response is

This can be understood as solving the following " Iteratively reweighted least squares " problem

JHtt 11

WXXH

yXJT

T

)(

ttTtT

tTttTtT

tt

zWXXWX

yXXWXXWX

H

1

1

11

)(

l

)( tttt yWXz 1

)()(minarg

XzWXz Tt 1

yXXX TT 1*

26© Eric Xing @ CMU, 2005-2017

Example 1: logistic regression (sigmoid classifier) The condition distribution: a Bernoulli

where is a logistic function

p(y|x) is an exponential family function, with mean:

and canonical response function

IRLS

yy xxxyp 11 ))(()()|(

)()( xex

11

)(| xexyE

1

1

xT

)(

)(

)(

NN

W

dd

1

1

1

11

27© Eric Xing @ CMU, 2005-2017

Logistic regression: practical issues It is very common to use regularized maximum likelihood.

IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x.

Quasi-Newton methods, that approximate the Hessian, work faster. Conjugate gradient takes O(Nd) per iteration, and usually works best in practice. Stochastic gradient descent can also be used if N is large c.f. perceptron rule:

T

nn

Tn

Txy

xyl

Ip

xye

xyp T

2

01

11

1

)(log)(

),(Normal~)(

)(),(

nnnT

n xyxy )(1l

28© Eric Xing @ CMU, 2005-2017

Example 2: linear regression The condition distribution: a Gaussian

where is a linear function

p(y|x) is an exponential family function, with mean:

and canonical response function

IRLS

)()( xxx T

xxyE T |

xT 1

IWdd

1

)()(exp)(

))(())((exp),,( //

Ayxxh

xyxyxyp

T

Tk

121

1212 2

12

1

)(

)(tTTt

ttTT

ttTtTt

yXXX

yXXXX

zWXXWX

1

1

11

t

YXXX TT 1)(

Steepest descent Normal equation

Rescale

29© Eric Xing @ CMU, 2005-2017

ClassificationGenerative and discriminative approach

Q

X

Q

X

RegressionLinear, conditional mixture, nonparametric

X Y

Density estimationParametric and nonparametric methods

,

XX

Simple GMs are the building blocks of complex BNs

30© Eric Xing @ CMU, 2005-2017

School of Computer ScienceAn (incomplete)

genealogy of graphical

models

The structures of most GMs (e.g., all listed here), are not learned from data, but designed by human.

But such designs are useful and indeed favored because thereby human knowledge are put into good use …

31© Eric Xing @ CMU, 2005-2017

MLE for general BNs If we assume the parameters for each CPD (a GLIM) are

globally independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:

Therefore, MLE-based parameter estimation of GM reduces to local est. of each GLIM

i ninin

n iinin ii

xpxpDpD ),|(log),|(log)|(log);( ,,,, xxl

X2=1X2=0X5=0X5=1

32© Eric Xing @ CMU, 2005-2017

Earthquake

Radio

Burglary

Alarm

Call

M

ii ixpp

1

)|()( xxXFactorization:

ji

kii x

jkixp

x

x|

)|(

Local Distributions defined by, e.g., multinomial parameters:

How to define parameter prior?

Assumptions (Geiger & Heckerman 97,99):

Complete Model Equivalence Global Parameter Independence Local Parameter Independence Likelihood and Prior Modularity

? )|( Gp

33

© Eric Xing @ CMU, 2005-2017

Global Parameter IndependenceFor every DAG model:

Local Parameter IndependenceFor every node:

M

iim GpGp

1

)|()|(

Earthquake

Radio

Burglary

Alarm

Call

i

ji

ki

q

jxi GpGp

1|

)|()|(

x

independent of)( | YESAlarmCallP

)( | NOAlarmCallP

Global & Local Parameter Independence

34© Eric Xing @ CMU, 2005-2017

Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!

sample 1

sample 2

2|11 2|1

X1 X2

X1 X2

Global ParameterIndependence

Local ParameterIndependence

Parameter Independence,Graphical View

35© Eric Xing @ CMU, 2005-2017

Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)

Discrete DAG Models:

Dirichlet prior:

Gaussian DAG Models:

Normal prior:

Normal-Wishart prior:

kk

kk

kk

kk

kk CP 1-1- )()(

)()(

)(Multi~| jxi i

x

),(Normal~| jxi i

x

)()'(exp

||)(),|( //

1

212 21

21

np

. where

,WtrexpW),(),|W( /)(/

1

212

21

W

nww

wwncp

,)(,Normal),,|( 1 WW p

36© Eric Xing @ CMU, 2005-2017

Summary: Parameterizing GM For exponential family dist., MLE amounts to moment

matching

GLIM: Natural response Iteratively Reweighted Least Squares as a general algorithm

GLIMs are building blocks of most GMs in practical use

Parameter independence and appropriate priors

37© Eric Xing @ CMU, 2005-2017

Date post:	03-Aug-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Probabilistic Graphical Modelsepxing/Class/10708-17/slides/lecture5-GLIM-anno… · Conditional...

Documents