School of Computer Science
Probabilistic Graphical Models
Generalized linear models
Eric XingLecture 5, February 1, 2017
Reading: KF-chap 17
X1
X4
X2 X3
X4
X2 X3
X1X1
X2
X1
X3
X1
X4
X2 X3
X4
X2 X3
X1X1
X2
X1
X3
1© Eric Xing @ CMU, 2005-2017
Parameterizing graphical models Bayesian network:
di
i iXPP
:1
)|()( XX
A B
C
a0 0.75
a1 0.25
b0 0.33
b1 0.67
a0b0 a0b1 a1b0 a1b1
c0 0.45 1 0.9 0.7
c1 0.55 0 0.1 0.3
OrA B
C
A~N(μa, Σa) B~N(μb, Σb)
C~N(A+B, Σc)
OrDiscrete
Continuous
Hybrid
?
2© Eric Xing @ CMU, 2005-2017
Recall Linear Regression Let us assume that the target variable and the inputs are
related by the equation:
where ε is an error term of unmodeled effects or random noise
Now assume that ε follows a Gaussian N(0,σ), then we have:
We can use LMS algorithm, which is a gradient ascent/descent approach, to estimate the parameter
iiT
iy x
2
2
221
)(exp);|( i
Ti
iiyxyp x
3© Eric Xing @ CMU, 2005-2017
Recall: Logistic Regression (sigmoid classifier, perceptron, etc.)
The condition distribution: a Bernoulli
where is a logistic function
We can used the brute-force gradient method as in LR
But we can also apply generic laws by observing the p(y|x) is an exponential family function, more specifically, a generalized linear model!
yy xxxyp 11 ))(()()|(
xTex
11)(
4© Eric Xing @ CMU, 2005-2017
Parameterizing graphical models Markov random fields
)(exp1)(exp1)( xxx HZZ
pCc
cc
i
iiNji
jiij XXXZ
Xpi
0,
exp1)(
5© Eric Xing @ CMU, 2005-2017
hidden units
visible units
{ }∑ ∑∑,
,, )(-),(+)(+)(exp=)|,(j ji
jijijijjji
iii Ahxhxhxp θ
Restricted Boltzmann Machines
© Eric Xing @ CMU, 2005-2017 6
Conditional Random Fields
c
ccc yxfxZ
xyp ),(exp),(
1)|(
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
A AA AX2 X3X1 XT
Y2 Y3Y1 YT...
...
Y1 Y2 Y5…
X1 … Xn
Discriminative
Xi’s are assumed as features that are inter-dependent
When labeling Xi future observations are taken into account
7
© Eric Xing @ CMU, 2005-2017
Conditional Distribution If the graph G = (V, E) of Y is a tree, the conditional distribution over
the label sequence Y = y, given X = x, by the Hammersley Clifford theorem of random fields is:
─ x is a data sequence─ y is a label sequence ─ v is a vertex from vertex set V = set of label random variables─ e is an edge from edge set E over V─ fk and gk are given and fixed. gk is a Boolean vertex feature; fk is a Boolean edge
feature─ k is the index number of the features─ are parameters to be estimated─ y|e is the set of components of y defined by edge e─ y|v is the set of components of y defined by vertex v
1 2 1 2( , , , ; , , , ); andn n k k
Y1 Y2 Y5
…
X1 … Xn
(y | x) exp ( , y | , x) ( , y |1(x)
, x)
k k e k k ve E,k v V ,k
p f e g vZ
8© Eric Xing @ CMU, 2005-2017
2-D Conditional Random Fields
Allow arbitrary dependencies on input
Clique dependencies on labels
Use approximate inference for general graphs
c
ccc yxfxZ
xyp ),(exp),(
)|( 1
9
© Eric Xing @ CMU, 2005-2017
Exponential family, a basic building block For a numeric random variable X
is an exponential family distribution with natural (canonical) parameter
Function T(x) is a sufficient statistic. Function A() = log Z() is the log normalizer. Examples: Bernoulli, multinomial, Gaussian, Poisson,
gamma,...
)(exp)(
)(
)()(exp)()|(
xTxhZ
AxTxhxp
T
T
1
10© Eric Xing @ CMU, 2005-2017
Example: Multivariate Gaussian Distribution For a continuous vector random variable XRk:
Exponential family representation
Note: a k-dimensional Gaussian is a (d+d2)-parameter distribution with a (d+d2)-element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom)
logtrexp
)()(exp),(
/
//
12111
21
2
1212
21
21
21
TTTk
Tk
xxx
xxxp
2
221
112211
21
121
21
1211
211
2
2/)(
log)(trlog)(vec;)(
and ,vec,vec;
k
TT
T
xh
AxxxxT
Moment parameter
Natural parameter
11© Eric Xing @ CMU, 2005-2017
Example: Multinomial distribution For a binary vector random variable
Exponential family representation
),|(multi~ xx
k
kkxK
xx xxp K lnexp)( 1121
1
1
0
1
1
1
)(
lnln)(
)(
;ln
xh
eA
xxTK
k
K
kk
K
k
k
1
1
1
11
1
1
1
1
1
1
1
1ln1
lnexp
1ln1lnexp
K
kk
K
k kKk
kk
K
kk
K
kK
K
kkk
x
xx
12© Eric Xing @ CMU, 2005-2017
Why exponential family? Moment generating property
)()(
)(exp)()(
)(exp)()(
)()(
)(log
xTE
dxZ
xTxhxT
dxxTxhdd
Z
Zdd
ZZ
dd
ddA
T
T
1
1
)(
)()(
)()()(
)(exp)()()(
)(exp)()(
xTVarxTExTE
Zdd
Zdx
ZxTxhxTdx
ZxTxhxT
dAd
2
TT
2
22
2 1
13© Eric Xing @ CMU, 2005-2017
Moment estimation We can easily compute moments of any exponential family
distribution by taking the derivatives of the log normalizer A().
The qth derivative gives the qth centered moment.
When the sufficient statistic is a stacked vector, partial derivatives need to be considered.
variance)(
mean)(
2
2
dAdddA
14© Eric Xing @ CMU, 2005-2017
Moment vs canonical parameters The moment parameter µ can be derived from the natural
(canonical) parameter
A() is convex since
Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1):
A distribution in the exponential family can be parameterized not only by the canonical parameterization, but also by the moment parameterization.
def
)()( xTE
ddA
02
2
)()( xTVardAd
)(def
4
8
-2 -1 0 1 2
4
8
-2 -1 0 1 2
A
15© Eric Xing @ CMU, 2005-2017
MLE for Exponential Family For iid data, the log-likelihood is
Take derivatives and set to zero:
This amounts to moment matching. We can infer the canonical parameters using
n nn
Tn
nn
Tn
NAxTxh
AxTxhD
)()()(log
)()(exp)(log);(
l
)(
)()(
)()(
nnMLE
nn
nn
xTN
xTN
A
ANxT
1
1
0
l
)( MLEMLE
16© Eric Xing @ CMU, 2005-2017
Sufficiency For p(x|), T(x) is sufficient for if there is no information in X
regarding beyond that in T(x). We can throw away X for the purpose of inference w.r.t. .
Bayesian view
Frequentist view
The Neyman factorization theorem T(x) is sufficient for if
T(x) X ))(|()),(|( xTpxxTp
T(x) X ))(|()),(|( xTxpxTxp
T(x) X
))(,()),(()),(,( xTxxTxTxp 21
))(,()),(()|( xTxhxTgxp 17© Eric Xing @ CMU, 2005-2017
Examples Gaussian:
Multinomial:
Poisson:
2211
21
1211
2 /)(
log)(vec;)(
vec;
k
T
T
xh
AxxxxT
n
nn
nMLE xN
xTN
111 )(
1
1
0
1
1
1
)(
lnln)(
)(
;ln
xh
eA
xxTK
k
K
kk
K
k
k
n
nMLE xN1
!)(
)()(
log
xxh
eAxxT
1
n
nMLE xN1
18© Eric Xing @ CMU, 2005-2017
Bayesian est.
© Eric Xing @ CMU, 2005-2017 19
Generalized Linear Models (GLIMs) The graphical model
Linear regression Discriminative linear classification Commonality:
model Ep(Y)==f(TX) What is p()? the cond. dist. of Y. What is f()? the response function.
GLIM The observed input x is assumed to enter into the model via a linear
combination of its elements The conditional mean is represented as a function f() of , where f is
known as the response function The observed output y is assumed to be characterized by an
exponential family distribution with conditional mean .
Xn
YnN
xT
20© Eric Xing @ CMU, 2005-2017
GLIM, cont.
The choice of exp family is constrained by the nature of the data Y Example: y is a continuous vector multivariate Gaussian
y is a class label Bernoulli or multinomial
The choice of the response function Following some mild constrains, e.g., [0,1]. Positivity … Canonical response function:
In this case Tx directly corresponds to canonical parameter .
)()(exp),(),|( 1 Ayxyhyp T
)( 1f
)()(exp)()|( Ayxyhyp T
f
x yEXP
21© Eric Xing @ CMU, 2005-2017
Example canonical response functions
© Eric Xing @ CMU, 2005-2017 22
MLE for GLIMs with canonical response Log-likelihood
Derivative of Log-likelihood
Online learning for canonical GLIMs Stochastic gradient ascent = least mean squares (LMS) algorithm:
n n
nnnT
n Ayxyh )()(log l
)(
)(
yX
xy
dd
ddAyx
dd
Tn
nnn
n
n
n
nnn
l
This is a fixed point function because is a function of
ntnn
tt xy 1
size step a is and where nTtt
n x23© Eric Xing @ CMU, 2005-2017
Batch learning for canonical GLIMs The Hessian matrix
where is the design matrix and
which can be computed by calculating the 2nd derivative of A(n)
WXX
xxddx
dd
ddx
ddxxy
dd
dddH
T
nT
nTn
n n
nn
Tn
n n
nn
nTn
nn
nnnTT
since
l2
TnxX
N
N
dd
ddW
,,diag
1
1
nx
xx
X
2
1
ny
yy
y
2
1
24© Eric Xing @ CMU, 2005-2017
Recall LMS Cost function in matrix form:
To minimize J(θ), take derivative and set to zero:
yy
yJ
T
n
ii
Ti
XX
x
2121
1
2)()(
nx
xx
X
2
1
ny
yy
y
2
1
0
221
22121
yXXX
yXXXXX
yyXyXX
yyXyyXXXJ
TT
TTT
TTTT
TTTTTT
trtrtr
tryXXX TT
The normal equations
yXXX TT 1*
25© Eric Xing @ CMU, 2005-2017
Iteratively Reweighted Least Squares (IRLS) Recall Newton-Raphson methods with cost function J
We now have
Now:
where the adjusted response is
This can be understood as solving the following " Iteratively reweighted least squares " problem
JHtt 11
WXXH
yXJT
T
)(
ttTtT
tTttTtT
tt
zWXXWX
yXXWXXWX
H
1
1
11
)(
l
)( tttt yWXz 1
)()(minarg
XzWXz Tt 1
yXXX TT 1*
26© Eric Xing @ CMU, 2005-2017
Example 1: logistic regression (sigmoid classifier) The condition distribution: a Bernoulli
where is a logistic function
p(y|x) is an exponential family function, with mean:
and canonical response function
IRLS
yy xxxyp 11 ))(()()|(
)()( xex
11
)(| xexyE
1
1
xT
)(
)(
)(
NN
W
dd
1
1
1
11
27© Eric Xing @ CMU, 2005-2017
Logistic regression: practical issues It is very common to use regularized maximum likelihood.
IRLS takes O(Nd3) per iteration, where N = number of training cases and d = dimension of input x.
Quasi-Newton methods, that approximate the Hessian, work faster. Conjugate gradient takes O(Nd) per iteration, and usually works best in practice. Stochastic gradient descent can also be used if N is large c.f. perceptron rule:
T
nn
Tn
Txy
xyl
Ip
xye
xyp T
2
01
11
1
)(log)(
),(Normal~)(
)(),(
nnnT
n xyxy )(1l
28© Eric Xing @ CMU, 2005-2017
Example 2: linear regression The condition distribution: a Gaussian
where is a linear function
p(y|x) is an exponential family function, with mean:
and canonical response function
IRLS
)()( xxx T
xxyE T |
xT 1
IWdd
1
)()(exp)(
))(())((exp),,( //
Ayxxh
xyxyxyp
T
Tk
121
1212 2
12
1
)(
)(tTTt
ttTT
ttTtTt
yXXX
yXXXX
zWXXWX
1
1
11
t
YXXX TT 1)(
Steepest descent Normal equation
Rescale
29© Eric Xing @ CMU, 2005-2017
ClassificationGenerative and discriminative approach
Q
X
Q
X
RegressionLinear, conditional mixture, nonparametric
X Y
Density estimationParametric and nonparametric methods
,
XX
Simple GMs are the building blocks of complex BNs
30© Eric Xing @ CMU, 2005-2017
School of Computer ScienceAn (incomplete)
genealogy of graphical
models
The structures of most GMs (e.g., all listed here), are not learned from data, but designed by human.
But such designs are useful and indeed favored because thereby human knowledge are put into good use …
31© Eric Xing @ CMU, 2005-2017
MLE for general BNs If we assume the parameters for each CPD (a GLIM) are
globally independent, and all nodes are fully observed, then the log-likelihood function decomposes into a sum of local terms, one per node:
Therefore, MLE-based parameter estimation of GM reduces to local est. of each GLIM
i ninin
n iinin ii
xpxpDpD ),|(log),|(log)|(log);( ,,,, xxl
X2=1X2=0X5=0X5=1
32© Eric Xing @ CMU, 2005-2017
Earthquake
Radio
Burglary
Alarm
Call
M
ii ixpp
1
)|()( xxXFactorization:
ji
kii x
jkixp
x
x|
)|(
Local Distributions defined by, e.g., multinomial parameters:
How to define parameter prior?
Assumptions (Geiger & Heckerman 97,99):
Complete Model Equivalence Global Parameter Independence Local Parameter Independence Likelihood and Prior Modularity
? )|( Gp
33
© Eric Xing @ CMU, 2005-2017
Global Parameter IndependenceFor every DAG model:
Local Parameter IndependenceFor every node:
M
iim GpGp
1
)|()|(
Earthquake
Radio
Burglary
Alarm
Call
i
ji
ki
q
jxi GpGp
1|
)|()|(
x
independent of)( | YESAlarmCallP
)( | NOAlarmCallP
Global & Local Parameter Independence
34© Eric Xing @ CMU, 2005-2017
Provided all variables are observed in all cases, we can perform Bayesian update each parameter independently !!!
sample 1
sample 2
2|11 2|1
X1 X2
X1 X2
Global ParameterIndependence
Local ParameterIndependence
Parameter Independence,Graphical View
35© Eric Xing @ CMU, 2005-2017
Which PDFs Satisfy Our Assumptions? (Geiger & Heckerman 97,99)
Discrete DAG Models:
Dirichlet prior:
Gaussian DAG Models:
Normal prior:
Normal-Wishart prior:
kk
kk
kk
kk
kk CP 1-1- )()(
)()(
)(Multi~| jxi i
x
),(Normal~| jxi i
x
)()'(exp
||)(),|( //
1
212 21
21
np
. where
,WtrexpW),(),|W( /)(/
1
212
21
W
nww
wwncp
,)(,Normal),,|( 1 WW p
36© Eric Xing @ CMU, 2005-2017
Summary: Parameterizing GM For exponential family dist., MLE amounts to moment
matching
GLIM: Natural response Iteratively Reweighted Least Squares as a general algorithm
GLIMs are building blocks of most GMs in practical use
Parameter independence and appropriate priors
37© Eric Xing @ CMU, 2005-2017