Lecture 2: Generative Learning
Tuo Zhao
Schools of ISYE and CSE, Georgia Tech
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Generative Learning
Tuo Zhao — Lecture 2: Generative Learning 2/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Generative Learning
Tuo Zhao — Lecture 2: Generative Learning 3/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Modeling Dogs
Tuo Zhao — Lecture 2: Generative Learning 4/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Modeling Cats
Tuo Zhao — Lecture 2: Generative Learning 5/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Discriminative Learning
GenerativeDiscriminative GenerativeDiscriminative
Tuo Zhao — Lecture 2: Generative Learning 6/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Which One is Better for Classification?
Tuo Zhao — Lecture 2: Generative Learning 7/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Joint and Posterior Distributions
We consider a binary classification problem:
Feature: X ∈ Rd
Response: Y ∈ {0, 1}Class Prior: P(Y = 1) = p and P(Y = 0) = 1− pPosterior: Conditional Probability of Y Given X, i.e.,
P(Y |X) =P(Y )P(X|Y )
P(X).
Tuo Zhao — Lecture 2: Generative Learning 8/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Joint and Posterior Distributions
We consider a binary classification problem:
Feature: X ∈ Rd
Response: Y ∈ {0, 1}Class Prior: P(Y = 1) = p and P(Y = 0) = 1− pPosterior: Conditional Probability of Y Given X, i.e.,
P(Y |X) =P(Y )P(X|Y )
P(X).
Tuo Zhao — Lecture 2: Generative Learning 8/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Joint and Posterior Distributions
We consider a binary classification problem:
Feature: X ∈ Rd
Response: Y ∈ {0, 1}Class Prior: P(Y = 1) = p and P(Y = 0) = 1− pPosterior: Conditional Probability of Y Given X, i.e.,
P(Y |X) =P(Y )P(X|Y )
P(X).
Tuo Zhao — Lecture 2: Generative Learning 8/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Joint and Posterior Distributions
We consider a binary classification problem:
Feature: X ∈ Rd
Response: Y ∈ {0, 1}Class Prior: P(Y = 1) = p and P(Y = 0) = 1− pPosterior: Conditional Probability of Y Given X, i.e.,
P(Y |X) =P(Y )P(X|Y )
P(X).
Tuo Zhao — Lecture 2: Generative Learning 8/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Joint and Posterior Distributions
We consider a binary classification problem:
Feature: X ∈ Rd
Response: Y ∈ {0, 1}Class Prior: P(Y = 1) = p and P(Y = 0) = 1− pPosterior: Conditional Probability of Y Given X, i.e.,
P(Y |X) =P(Y )P(X|Y )
P(X).
Tuo Zhao — Lecture 2: Generative Learning 8/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Discriminative Learning
Posterior is sufficient for prediction:
y = argmaxy
P(Y = y|X = x)
= argmaxy
P(Y = y)P(X = x|Y = y)
P(X = x)
= argmaxy
P(Y = y)P(X = x|Y = y)
= argmaxy
P(X = x, Y = y)
Which one to model?
Joint Distribution? Conditional Distribution?
Tuo Zhao — Lecture 2: Generative Learning 9/47
Gaussian Discriminant Analysis
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Gaussian Discriminant Analysis
Multivariate Gaussian Distribution: X ∼ N(µ,Σ)
Probability Density Function
f(x;µ,Σ) =1
(2π)d/2|Σ|1/2 exp(−1
2(x− µ)>Σ−1(x− µ)
)
Expectation: EX = µ
Covariance: E(X − µ)(X − µ)> = Σ
Standard Gaussian Distribution: µ = 0 and Σ = Id.
Tuo Zhao — Lecture 2: Generative Learning 11/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Gaussian Discriminant Analysis
Multivariate Gaussian Distribution: X ∼ N(µ,Σ)
Probability Density Function
f(x;µ,Σ) =1
(2π)d/2|Σ|1/2 exp(−1
2(x− µ)>Σ−1(x− µ)
)
Expectation: EX = µ
Covariance: E(X − µ)(X − µ)> = Σ
Standard Gaussian Distribution: µ = 0 and Σ = Id.
Tuo Zhao — Lecture 2: Generative Learning 11/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Gaussian Discriminant Analysis
Multivariate Gaussian Distribution: X ∼ N(µ,Σ)
Probability Density Function
f(x;µ,Σ) =1
(2π)d/2|Σ|1/2 exp(−1
2(x− µ)>Σ−1(x− µ)
)
Expectation: EX = µ
Covariance: E(X − µ)(X − µ)> = Σ
Standard Gaussian Distribution: µ = 0 and Σ = Id.
Tuo Zhao — Lecture 2: Generative Learning 11/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Gaussian Discriminant Analysis
Multivariate Gaussian Distribution: X ∼ N(µ,Σ)
Probability Density Function
f(x;µ,Σ) =1
(2π)d/2|Σ|1/2 exp(−1
2(x− µ)>Σ−1(x− µ)
)
Expectation: EX = µ
Covariance: E(X − µ)(X − µ)> = Σ
Standard Gaussian Distribution: µ = 0 and Σ = Id.
Tuo Zhao — Lecture 2: Generative Learning 11/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Gaussian Discriminant Analysis
Multivariate Gaussian Distribution: X ∼ N(µ,Σ)
3
real-valued random variable. The covariance can also be defined as Cov(Z) =E[ZZT ]− (E[Z])(E[Z])T . (You should be able to prove to yourself that thesetwo definitions are equivalent.) If X ∼ N (µ, Σ), then
Cov(X) = Σ.
Here’re some examples of what the density of a Gaussian distributionlooks like:
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
The left-most figure shows a Gaussian with mean zero (that is, the 2x1zero-vector) and covariance matrix Σ = I (the 2x2 identity matrix). A Gaus-sian with zero mean and identity covariance is also called the standard nor-mal distribution. The middle figure shows the density of a Gaussian withzero mean and Σ = 0.6I; and in the rightmost figure shows one with , Σ = 2I.We see that as Σ becomes larger, the Gaussian becomes more “spread-out,”and as it becomes smaller, the distribution becomes more “compressed.”
Let’s look at some more examples.
−3−2
−10
12
3
−3
−2
−1
0
1
2
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3
−2
−1
0
1
2
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3
−2
−1
0
1
2
3
0.05
0.1
0.15
0.2
0.25
The figures above show Gaussians with mean 0, and with covariancematrices respectively
Σ =
!1 00 1
"; Σ =
!1 0.5
0.5 1
"; .Σ =
!1 0.8
0.8 1
".
The leftmost figure shows the familiar standard normal distribution, and wesee that as we increase the off-diagonal entry in Σ, the density becomes more“compressed” towards the 45◦ line (given by x1 = x2). We can see this moreclearly when we look at the contours of the same three densities:
4
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Here’s one last set of examples generated by varying Σ:
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
The plots above used, respectively,
Σ =
!1 -0.5
-0.5 1
"; Σ =
!1 -0.8
-0.8 1
"; .Σ =
!3 0.8
0.8 1
".
From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).
As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
The figures above were generated using Σ = I, and respectively
µ =
!10
"; µ =
!-0.50
"; µ =
!-1
-1.5
".
Tuo Zhao — Lecture 2: Generative Learning 12/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Gaussian Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Y = y) = py(1− p)1−y
P(X|Y = 0) ∼ N(µ0,Σ)
P(X = x|Y = 0) =exp
(−1
2(x− µ0)>Σ−1(x− µ0)
)
(2π)d/2|Σ|1/2
P(X|Y = 1) ∼ N(µ1,Σ)
P(X = x|Y = 1) =exp
(−1
2(x− µ1)>Σ−1(x− µ1)
)
(2π)d/2|Σ|1/2
Tuo Zhao — Lecture 2: Generative Learning 13/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Gaussian Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Y = y) = py(1− p)1−y
P(X|Y = 0) ∼ N(µ0,Σ)
P(X = x|Y = 0) =exp
(−1
2(x− µ0)>Σ−1(x− µ0)
)
(2π)d/2|Σ|1/2
P(X|Y = 1) ∼ N(µ1,Σ)
P(X = x|Y = 1) =exp
(−1
2(x− µ1)>Σ−1(x− µ1)
)
(2π)d/2|Σ|1/2
Tuo Zhao — Lecture 2: Generative Learning 13/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Gaussian Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Y = y) = py(1− p)1−y
P(X|Y = 0) ∼ N(µ0,Σ)
P(X = x|Y = 0) =exp
(−1
2(x− µ0)>Σ−1(x− µ0)
)
(2π)d/2|Σ|1/2
P(X|Y = 1) ∼ N(µ1,Σ)
P(X = x|Y = 1) =exp
(−1
2(x− µ1)>Σ−1(x− µ1)
)
(2π)d/2|Σ|1/2
Tuo Zhao — Lecture 2: Generative Learning 13/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Generative Learning
Maximum Likelihood Estimation:
L(p,µ0,µ1,Σ) = log
n∏
i=1
f(xi, yi; p,µ0,µ1,Σ)
= log
n∏
i=1
h(xi|yi; p,µ0,µ1,Σ)g(yi; p).
6
By maximizing ℓ with respect to the parameters, we find the maximum like-lihood estimate of the parameters (see problem set 1) to be:
φ =1
m
m!
i=1
1{y(i) = 1}
µ0 =
"mi=1 1{y(i) = 0}x(i)
"mi=1 1{y(i) = 0}
µ1 =
"mi=1 1{y(i) = 1}x(i)
"mi=1 1{y(i) = 1}
Σ =1
m
m!
i=1
(x(i) − µy(i))(x(i) − µy(i))T .
Pictorially, what the algorithm is doing can be seen in as follows:
−2 −1 0 1 2 3 4 5 6 7−7
−6
−5
−4
−3
−2
−1
0
1
Shown in the figure are the training set, as well as the contours of thetwo Gaussian distributions that have been fit to the data in each of thetwo classes. Note that the two Gaussians have contours that are the sameshape and orientation, since they share a covariance matrix Σ, but they havedifferent means µ0 and µ1. Also shown in the figure is the straight linegiving the decision boundary at which p(y = 1|x) = 0.5. On one side ofthe boundary, we’ll predict y = 1 to be the most likely outcome, and on theother side, we’ll predict y = 0.
1.3 Discussion: GDA and logistic regression
The GDA model has an interesting relationship to logistic regression. If weview the quantity p(y = 1|x; φ, µ0, µ1, Σ) as a function of x, we’ll find that it
Tuo Zhao — Lecture 2: Generative Learning 14/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Generative Learning
Maximum Likelihood Estimation:
L(p,µ0,µ1,Σ) = log
n∏
i=1
h(xi|yi; p,µ0,µ1,Σ)g(yi; p).
Convex Minimization
µ0 =
∑ni=1 xi · (1− yi)n−∑n
i=1 yiand µ1 =
∑ni=1 xi · yi∑ni=1 yi
p =
∑ni=1 yin
and Σk =1
nk
∑
yi=k
(xi − µyi)(xi − µyi)>
d(d+ 1) + 2d+ 1 parameters to estimate.
Tuo Zhao — Lecture 2: Generative Learning 15/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Generative Learning
Maximum Likelihood Estimation:
L(p,µ0,µ1,Σ) = log
n∏
i=1
h(xi|yi; p,µ0,µ1,Σ)g(yi; p).
Convex Minimization
µ0 =
∑ni=1 xi · (1− yi)n−∑n
i=1 yiand µ1 =
∑ni=1 xi · yi∑ni=1 yi
p =
∑ni=1 yin
and Σk =1
nk
∑
yi=k
(xi − µyi)(xi − µyi)>
d(d+ 1) + 2d+ 1 parameters to estimate.
Tuo Zhao — Lecture 2: Generative Learning 15/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Generative Learning
Maximum Likelihood Estimation:
L(p,µ0,µ1,Σ) = log
n∏
i=1
h(xi|yi; p,µ0,µ1,Σ)g(yi; p).
Convex Minimization
µ0 =
∑ni=1 xi · (1− yi)n−∑n
i=1 yiand µ1 =
∑ni=1 xi · yi∑ni=1 yi
p =
∑ni=1 yin
and Σk =1
nk
∑
yi=k
(xi − µyi)(xi − µyi)>
d(d+ 1) + 2d+ 1 parameters to estimate.
Tuo Zhao — Lecture 2: Generative Learning 15/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Generative Learning
Maximum Likelihood Estimation:
L(p,µ0,µ1,Σ) = log
n∏
i=1
h(xi|yi; p,µ0,µ1,Σ)g(yi; p).
Convex Minimization
µ0 =
∑ni=1 xi · (1− yi)n−∑n
i=1 yiand µ1 =
∑ni=1 xi · yi∑ni=1 yi
p =
∑ni=1 yin
and Σk =1
nk
∑
yi=k
(xi − µyi)(xi − µyi)>
d(d+ 1) + 2d+ 1 parameters to estimate.
Tuo Zhao — Lecture 2: Generative Learning 15/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Generative Learning
Maximum Likelihood Estimation:
L(p,µ0,µ1,Σ) = log
n∏
i=1
h(xi|yi; p,µ0,µ1,Σ)g(yi; p).
Convex Minimization
µ0 =
∑ni=1 xi · (1− yi)n−∑n
i=1 yiand µ1 =
∑ni=1 xi · yi∑ni=1 yi
p =
∑ni=1 yin
and Σk =1
nk
∑
yi=k
(xi − µyi)(xi − µyi)>
d(d+ 1) + 2d+ 1 parameters to estimate.
Tuo Zhao — Lecture 2: Generative Learning 15/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Gaussian Discriminant Analysis
Prediction: Given X ∈ Rd, we predict
Y = argmaxY ∈{0,1}
P(Y |X; p, µ0, µ1, Σ).
Since we have [Analytical Problem in HW3]
log
(P(Y = 1|X)
1− P(Y = 1|X)
)= −1
2(µ1 + µ0)
>Σ−1(µ1 − µ0)
+ (µ1 − µ0)Σ−1X + log
(p
1− p
),
this is actually a logistic regression model!
But different from maximizing the conditional log likelihood!
Tuo Zhao — Lecture 2: Generative Learning 16/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Gaussian Discriminant Analysis
Prediction: Given X ∈ Rd, we predict
Y = argmaxY ∈{0,1}
P(Y |X; p, µ0, µ1, Σ).
Since we have [Analytical Problem in HW3]
log
(P(Y = 1|X)
1− P(Y = 1|X)
)= −1
2(µ1 + µ0)
>Σ−1(µ1 − µ0)
+ (µ1 − µ0)Σ−1X + log
(p
1− p
),
this is actually a logistic regression model!
But different from maximizing the conditional log likelihood!
Tuo Zhao — Lecture 2: Generative Learning 16/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Gaussian Discriminant Analysis
Prediction: Given X ∈ Rd, we predict
Y = argmaxY ∈{0,1}
P(Y |X; p, µ0, µ1, Σ).
Since we have [Analytical Problem in HW3]
log
(P(Y = 1|X)
1− P(Y = 1|X)
)= −1
2(µ1 + µ0)
>Σ−1(µ1 − µ0)
+ (µ1 − µ0)Σ−1X + log
(p
1− p
),
this is actually a logistic regression model!
But different from maximizing the conditional log likelihood!
Tuo Zhao — Lecture 2: Generative Learning 16/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GDA v.s. Logistic Regression
Gaussian Discriminant Analysis
Modeling Assumption: Terrible
d(d+ 1)/2 + 2d+ 1 parameters: Terrible
Simple with a closed form solution: Not very useful!
Logistic Regression
Modeling Assumption: More Robust!
d parameters: Fewer!
Need an iterative optimization algorithm: Not bad!
Tuo Zhao — Lecture 2: Generative Learning 17/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GDA v.s. Logistic Regression
Gaussian Discriminant Analysis
Modeling Assumption: Terrible
d(d+ 1)/2 + 2d+ 1 parameters: Terrible
Simple with a closed form solution: Not very useful!
Logistic Regression
Modeling Assumption: More Robust!
d parameters: Fewer!
Need an iterative optimization algorithm: Not bad!
Tuo Zhao — Lecture 2: Generative Learning 17/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GDA v.s. Logistic Regression
Gaussian Discriminant Analysis
Modeling Assumption: Terrible
d(d+ 1)/2 + 2d+ 1 parameters: Terrible
Simple with a closed form solution: Not very useful!
Logistic Regression
Modeling Assumption: More Robust!
d parameters: Fewer!
Need an iterative optimization algorithm: Not bad!
Tuo Zhao — Lecture 2: Generative Learning 17/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GDA v.s. Logistic Regression
Gaussian Discriminant Analysis
Modeling Assumption: Terrible
d(d+ 1)/2 + 2d+ 1 parameters: Terrible
Simple with a closed form solution: Not very useful!
Logistic Regression
Modeling Assumption: More Robust!
d parameters: Fewer!
Need an iterative optimization algorithm: Not bad!
Tuo Zhao — Lecture 2: Generative Learning 17/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GDA v.s. Logistic Regression
Gaussian Discriminant Analysis
Modeling Assumption: Terrible
d(d+ 1)/2 + 2d+ 1 parameters: Terrible
Simple with a closed form solution: Not very useful!
Logistic Regression
Modeling Assumption: More Robust!
d parameters: Fewer!
Need an iterative optimization algorithm: Not bad!
Tuo Zhao — Lecture 2: Generative Learning 17/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GDA v.s. Logistic Regression
Gaussian Discriminant Analysis
Modeling Assumption: Terrible
d(d+ 1)/2 + 2d+ 1 parameters: Terrible
Simple with a closed form solution: Not very useful!
Logistic Regression
Modeling Assumption: More Robust!
d parameters: Fewer!
Need an iterative optimization algorithm: Not bad!
Tuo Zhao — Lecture 2: Generative Learning 17/47
Naive Bayes Classification
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Gaussian Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(X|Y = 0) ∼ N(µ0,Σ)
P(X|Y = 1) ∼ N(µ1,Σ)
Σ =
σ21σ22
. . .
σ2d
4
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Here’s one last set of examples generated by varying Σ:
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
The plots above used, respectively,
Σ =
!1 -0.5
-0.5 1
"; Σ =
!1 -0.8
-0.8 1
"; .Σ =
!3 0.8
0.8 1
".
From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).
As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
The figures above were generated using Σ = I, and respectively
µ =
!10
"; µ =
!-0.50
"; µ =
!-1
-1.5
".
4
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Here’s one last set of examples generated by varying Σ:
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
The plots above used, respectively,
Σ =
!1 -0.5
-0.5 1
"; Σ =
!1 -0.8
-0.8 1
"; .Σ =
!3 0.8
0.8 1
".
From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).
As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
The figures above were generated using Σ = I, and respectively
µ =
!10
"; µ =
!-0.50
"; µ =
!-1
-1.5
".
Tuo Zhao — Lecture 2: Generative Learning 19/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Gaussian Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(X|Y = 0) ∼ N(µ0,Σ)
P(X|Y = 1) ∼ N(µ1,Σ)
Σ =
σ21σ22
. . .
σ2d
4
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Here’s one last set of examples generated by varying Σ:
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
The plots above used, respectively,
Σ =
!1 -0.5
-0.5 1
"; Σ =
!1 -0.8
-0.8 1
"; .Σ =
!3 0.8
0.8 1
".
From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).
As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
The figures above were generated using Σ = I, and respectively
µ =
!10
"; µ =
!-0.50
"; µ =
!-1
-1.5
".
4
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Here’s one last set of examples generated by varying Σ:
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
The plots above used, respectively,
Σ =
!1 -0.5
-0.5 1
"; Σ =
!1 -0.8
-0.8 1
"; .Σ =
!3 0.8
0.8 1
".
From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).
As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
The figures above were generated using Σ = I, and respectively
µ =
!10
"; µ =
!-0.50
"; µ =
!-1
-1.5
".
Tuo Zhao — Lecture 2: Generative Learning 19/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Gaussian Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(X|Y = 0) ∼ N(µ0,Σ)
P(X|Y = 1) ∼ N(µ1,Σ)
Σ =
σ21σ22
. . .
σ2d
4
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Here’s one last set of examples generated by varying Σ:
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
The plots above used, respectively,
Σ =
!1 -0.5
-0.5 1
"; Σ =
!1 -0.8
-0.8 1
"; .Σ =
!3 0.8
0.8 1
".
From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).
As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
The figures above were generated using Σ = I, and respectively
µ =
!10
"; µ =
!-0.50
"; µ =
!-1
-1.5
".
4
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Here’s one last set of examples generated by varying Σ:
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
The plots above used, respectively,
Σ =
!1 -0.5
-0.5 1
"; Σ =
!1 -0.8
-0.8 1
"; .Σ =
!3 0.8
0.8 1
".
From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).
As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
The figures above were generated using Σ = I, and respectively
µ =
!10
"; µ =
!-0.50
"; µ =
!-1
-1.5
".
Tuo Zhao — Lecture 2: Generative Learning 19/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Gaussian Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(X|Y = 0) ∼ N(µ0,Σ)
P(X|Y = 1) ∼ N(µ1,Σ)
Σ =
σ21σ22
. . .
σ2d
4
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Here’s one last set of examples generated by varying Σ:
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
The plots above used, respectively,
Σ =
!1 -0.5
-0.5 1
"; Σ =
!1 -0.8
-0.8 1
"; .Σ =
!3 0.8
0.8 1
".
From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).
As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
The figures above were generated using Σ = I, and respectively
µ =
!10
"; µ =
!-0.50
"; µ =
!-1
-1.5
".
4
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
Here’s one last set of examples generated by varying Σ:
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
−3 −2 −1 0 1 2 3−3
−2
−1
0
1
2
3
The plots above used, respectively,
Σ =
!1 -0.5
-0.5 1
"; Σ =
!1 -0.8
-0.8 1
"; .Σ =
!3 0.8
0.8 1
".
From the leftmost and middle figures, we see that by decreasing the off-diagonal elements of the covariance matrix, the density now becomes “com-pressed” again, but in the opposite direction. Lastly, as we vary the pa-rameters, more generally the contours will form ellipses (the rightmost figureshowing an example).
As our last set of examples, fixing Σ = I, by varying µ, we can also movethe mean of the density around.
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
−3−2
−10
12
3
−3−2
−10
12
3
0.05
0.1
0.15
0.2
0.25
The figures above were generated using Σ = I, and respectively
µ =
!10
"; µ =
!-0.50
"; µ =
!-1
-1.5
".
Tuo Zhao — Lecture 2: Generative Learning 19/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Gaussian Discriminant Analysis
Conditional Independence:
P(X|Y ) =
d∏
j=1
P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )
A Simpler Decision Rule:
P(Y = 1|X) =P(X|Y = 1)P(Y = 1)
P(X)
=
∏dj=1 P(Xj |Y = 1)P(Y = 1)
∏dj=1 P(Xj |Y = 1)P(Y = 1) +
∏dj=1 P(Xj |Y = 0)P(Y = 0)
Tuo Zhao — Lecture 2: Generative Learning 20/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Gaussian Discriminant Analysis
Conditional Independence:
P(X|Y ) =
d∏
j=1
P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )
A Simpler Decision Rule:
P(Y = 1|X) =P(X|Y = 1)P(Y = 1)
P(X)
=
∏dj=1 P(Xj |Y = 1)P(Y = 1)
∏dj=1 P(Xj |Y = 1)P(Y = 1) +
∏dj=1 P(Xj |Y = 0)P(Y = 0)
Tuo Zhao — Lecture 2: Generative Learning 20/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Gaussian Discriminant Analysis
Maximum Likelihood Estimation:
µ0 =
∑ni=1 xi · (1− yi)n−∑n
i=1 yiand µ1 =
∑ni=1 xi · yi∑ni=1 yi
p =
∑ni=1 yin
and σ2j =1
n
n∑
i=1
(xi,j − µyi,j)2
3d+ 1 parameters to estimate.
Tuo Zhao — Lecture 2: Generative Learning 21/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Gaussian Discriminant Analysis
Maximum Likelihood Estimation:
µ0 =
∑ni=1 xi · (1− yi)n−∑n
i=1 yiand µ1 =
∑ni=1 xi · yi∑ni=1 yi
p =
∑ni=1 yin
and σ2j =1
n
n∑
i=1
(xi,j − µyi,j)2
3d+ 1 parameters to estimate.
Tuo Zhao — Lecture 2: Generative Learning 21/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Gaussian Discriminant Analysis
Maximum Likelihood Estimation:
µ0 =
∑ni=1 xi · (1− yi)n−∑n
i=1 yiand µ1 =
∑ni=1 xi · yi∑ni=1 yi
p =
∑ni=1 yin
and σ2j =1
n
n∑
i=1
(xi,j − µyi,j)2
3d+ 1 parameters to estimate.
Tuo Zhao — Lecture 2: Generative Learning 21/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Gaussian Discriminant Analysis
Missing values?
Example: X = (X1, ..., Xd−1)
P(Y = 1|X) =P(X|Y = 1)P(Y = 1)
P(X)
=
∏d−1j=1 P(Xj |Y = 1)P(Y = 1)
∏d−1j=1 P(Xj |Y = 1)P(Y = 1) +
∏d−1j=1 P(Xj |Y = 0)P(Y = 0)
Tuo Zhao — Lecture 2: Generative Learning 22/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GDA v.s. Naive Bayes GDA
Gaussian Discriminant Analysis
Stronger Modeling Assumption: Terrible
d(d+ 1)/2 + 2d+ 1 parameters: Terrible
A simple closed form solution: Not very useful!
Naive Bayes GDA
Even Stronger Modeling Assumption: Terrible!
3d+ 1 parameters: Good!
A super simple closed form solution: Useful sometimes!
Tuo Zhao — Lecture 2: Generative Learning 23/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Bernoulli Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Xj |Y = 0) ∼ Bernoulli(γ(0)j )
P(Xj |Y = 1) ∼ Bernoulli(γ(1)j )
Conditional Independence:
P(X|Y ) =d∏
j=1
P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )
Tuo Zhao — Lecture 2: Generative Learning 24/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Bernoulli Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Xj |Y = 0) ∼ Bernoulli(γ(0)j )
P(Xj |Y = 1) ∼ Bernoulli(γ(1)j )
Conditional Independence:
P(X|Y ) =d∏
j=1
P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )
Tuo Zhao — Lecture 2: Generative Learning 24/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Bernoulli Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Xj |Y = 0) ∼ Bernoulli(γ(0)j )
P(Xj |Y = 1) ∼ Bernoulli(γ(1)j )
Conditional Independence:
P(X|Y ) =
d∏
j=1
P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )
Tuo Zhao — Lecture 2: Generative Learning 24/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Bernoulli Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Xj |Y = 0) ∼ Bernoulli(γ(0)j )
P(Xj |Y = 1) ∼ Bernoulli(γ(1)j )
Conditional Independence:
P(X|Y ) =
d∏
j=1
P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )
Tuo Zhao — Lecture 2: Generative Learning 24/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Poisson Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Xj |Y = 0) ∼ Poisson(λ(0)j )
P(Xj |Y = 1) ∼ Poisson(λ(1)j )
Conditional Independence:
P(X|Y ) =d∏
j=1
P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )
Tuo Zhao — Lecture 2: Generative Learning 25/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Poisson Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Xj |Y = 0) ∼ Poisson(λ(0)j )
P(Xj |Y = 1) ∼ Poisson(λ(1)j )
Conditional Independence:
P(X|Y ) =d∏
j=1
P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )
Tuo Zhao — Lecture 2: Generative Learning 25/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Poisson Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Xj |Y = 0) ∼ Poisson(λ(0)j )
P(Xj |Y = 1) ∼ Poisson(λ(1)j )
Conditional Independence:
P(X|Y ) =
d∏
j=1
P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )
Tuo Zhao — Lecture 2: Generative Learning 25/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Naive Bayes Poisson Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Xj |Y = 0) ∼ Poisson(λ(0)j )
P(Xj |Y = 1) ∼ Poisson(λ(1)j )
Conditional Independence:
P(X|Y ) =
d∏
j=1
P(Xj |Y ) = P(X1|Y )P(X2|Y ) · · ·P(Xd|Y )
Tuo Zhao — Lecture 2: Generative Learning 25/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Example: Spam Email Classification
Data Set:
4601 email messages
Goal: predict whether an email message is spam or good.
Features: the frequencies in a message of 48 of the mostcommonly occurring words in all these email messages.
We coded spam as 1 and email as 0.
Tuo Zhao — Lecture 2: Generative Learning 26/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Example: Spam Email Classification
Data Set:
4601 email messages
Goal: predict whether an email message is spam or good.
Features: the frequencies in a message of 48 of the mostcommonly occurring words in all these email messages.
We coded spam as 1 and email as 0.
Tuo Zhao — Lecture 2: Generative Learning 26/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Example: Spam Email Classification
Data Set:
4601 email messages
Goal: predict whether an email message is spam or good.
Features: the frequencies in a message of 48 of the mostcommonly occurring words in all these email messages.
We coded spam as 1 and email as 0.
Tuo Zhao — Lecture 2: Generative Learning 26/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Example: Spam Email Classification
Data Set:
4601 email messages
Goal: predict whether an email message is spam or good.
Features: the frequencies in a message of 48 of the mostcommonly occurring words in all these email messages.
We coded spam as 1 and email as 0.
Tuo Zhao — Lecture 2: Generative Learning 26/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Example: Spam Email Classification
Data Set:
4601 email messages
Goal: predict whether an email message is spam or good.
Features: the frequencies in a message of 48 of the mostcommonly occurring words in all these email messages.
We coded spam as 1 and email as 0.
Tuo Zhao — Lecture 2: Generative Learning 26/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Example: Spam Email Classification
Transforming Features:
Naive Bayes GDA:
Relative Frequency of “free” =# of free in this email
# of all words in this email
Naive Bayes Bernoulli DA:
Indicator of “free” = 1 if “free” appears in this email
Naive Bayes Poisson DA: No transformation needed
Coding Problem in HW2.
Tuo Zhao — Lecture 2: Generative Learning 27/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Example: Spam Email Classification
Transforming Features:
Naive Bayes GDA:
Relative Frequency of “free” =# of free in this email
# of all words in this email
Naive Bayes Bernoulli DA:
Indicator of “free” = 1 if “free” appears in this email
Naive Bayes Poisson DA: No transformation needed
Coding Problem in HW2.
Tuo Zhao — Lecture 2: Generative Learning 27/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Example: Spam Email Classification
Transforming Features:
Naive Bayes GDA:
Relative Frequency of “free” =# of free in this email
# of all words in this email
Naive Bayes Bernoulli DA:
Indicator of “free” = 1 if “free” appears in this email
Naive Bayes Poisson DA: No transformation needed
Coding Problem in HW2.
Tuo Zhao — Lecture 2: Generative Learning 27/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Example: Spam Email Classification
Transforming Features:
Naive Bayes GDA:
Relative Frequency of “free” =# of free in this email
# of all words in this email
Naive Bayes Bernoulli DA:
Indicator of “free” = 1 if “free” appears in this email
Naive Bayes Poisson DA: No transformation needed
Coding Problem in HW2.
Tuo Zhao — Lecture 2: Generative Learning 27/47
Multiclass Fisher Discriminant Analysis
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Revisiting GDA
A Dimensionality Reduction Perspective:
Between-Class Scatter Matrix:
Γ =∑
k=0,1
nkn(µk − µ)(µk − µ)>,
where
µ =1
n
n∑
i=1
xi, n1 =
n∑
i=1
yi and n0 = n− n1.
Rayleigh Quotient Formulation
w = argmaxw
w>Γw
w>Σw= argmax
ww>Γw s.t. w>Σw = 1.
Tuo Zhao — Lecture 2: Generative Learning 29/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Revisiting GDA
A Dimensionality Reduction Perspective:
Between-Class Scatter Matrix:
Γ =∑
k=0,1
nkn(µk − µ)(µk − µ)>,
where
µ =1
n
n∑
i=1
xi, n1 =
n∑
i=1
yi and n0 = n− n1.
Rayleigh Quotient Formulation
w = argmaxw
w>Γw
w>Σw= argmax
ww>Γw s.t. w>Σw = 1.
Tuo Zhao — Lecture 2: Generative Learning 29/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
FDA and Dimension Reduction
Tuo Zhao — Lecture 2: Generative Learning 30/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Multiclass Fisher Discriminant Analysis
Generative View:
Y ∼ Discrete(p1, ..., pm) with∑m
k=1 pk = 1
P(X|Y = k) ∼ N(µk,Σ)
Between-Class Scatter Matrix:
Γ =1
m
m∑
k=1
nkn(µk − µ)(µk − µ)> with nk =
n∑
i=1
1(yi = k).
Rayleigh Quotient Formulation
W = argmaxW∈Rd×r
trace(W>ΓW) s.t. W>ΣW = Ir.
Tuo Zhao — Lecture 2: Generative Learning 31/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Multiclass Fisher Discriminant Analysis
Generative View:
Y ∼ Discrete(p1, ..., pm) with∑m
k=1 pk = 1
P(X|Y = k) ∼ N(µk,Σ)
Between-Class Scatter Matrix:
Γ =1
m
m∑
k=1
nkn(µk − µ)(µk − µ)> with nk =
n∑
i=1
1(yi = k).
Rayleigh Quotient Formulation
W = argmaxW∈Rd×r
trace(W>ΓW) s.t. W>ΣW = Ir.
Tuo Zhao — Lecture 2: Generative Learning 31/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Multiclass Fisher Discriminant Analysis
Generative View:
Y ∼ Discrete(p1, ..., pm) with∑m
k=1 pk = 1
P(X|Y = k) ∼ N(µk,Σ)
Between-Class Scatter Matrix:
Γ =1
m
m∑
k=1
nkn(µk − µ)(µk − µ)> with nk =
n∑
i=1
1(yi = k).
Rayleigh Quotient Formulation
W = argmaxW∈Rd×r
trace(W>ΓW) s.t. W>ΣW = Ir.
Tuo Zhao — Lecture 2: Generative Learning 31/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Multiclass Fisher Discriminant Analysis
Generative View:
Y ∼ Discrete(p1, ..., pm) with∑m
k=1 pk = 1
P(X|Y = k) ∼ N(µk,Σ)
Between-Class Scatter Matrix:
Γ =1
m
m∑
k=1
nkn(µk − µ)(µk − µ)> with nk =
n∑
i=1
1(yi = k).
Rayleigh Quotient Formulation
W = argmaxW∈Rd×r
trace(W>ΓW) s.t. W>ΣW = Ir.
Tuo Zhao — Lecture 2: Generative Learning 31/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Multiclass Fisher Discriminant Analysis
Generative View:
Y ∼ Discrete(p1, ..., pm) with∑m
k=1 pk = 1
P(X|Y = k) ∼ N(µk,Σ)
Between-Class Scatter Matrix:
Γ =1
m
m∑
k=1
nkn(µk − µ)(µk − µ)> with nk =
n∑
i=1
1(yi = k).
Rayleigh Quotient Formulation
W = argmaxW∈Rd×r
trace(W>ΓW) s.t. W>ΣW = Ir.
Tuo Zhao — Lecture 2: Generative Learning 31/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Multiclass Fisher Discriminant Analysis
Tuo Zhao — Lecture 2: Generative Learning 32/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Eigenvalue Problem (Rank-1)
Rayleigh Quotient Formulation
w = argmaxw∈Rd
w>Aw s.t. w>w = 1.
Lagrangian Multiplier Method: λ ∈ R
L(w, λ) = w>Aw − λ(w>w − 1).
We only need eigenvectors of A, since
∇wL(w, λ) = 2Aw − 2λw = 0,
∇λL(w, λ) = w>w − 1 = 0.
Tuo Zhao — Lecture 2: Generative Learning 33/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Eigenvalue Problem (Rank-1)
Rayleigh Quotient Formulation
w = argmaxw∈Rd
w>Aw s.t. w>w = 1.
Lagrangian Multiplier Method: λ ∈ R
L(w, λ) = w>Aw − λ(w>w − 1).
We only need eigenvectors of A, since
∇wL(w, λ) = 2Aw − 2λw = 0,
∇λL(w, λ) = w>w − 1 = 0.
Tuo Zhao — Lecture 2: Generative Learning 33/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Eigenvalue Problem (Rank-1)
Rayleigh Quotient Formulation
w = argmaxw∈Rd
w>Aw s.t. w>w = 1.
Lagrangian Multiplier Method: λ ∈ R
L(w, λ) = w>Aw − λ(w>w − 1).
We only need eigenvectors of A, since
∇wL(w, λ) = 2Aw − 2λw = 0,
∇λL(w, λ) = w>w − 1 = 0.
Tuo Zhao — Lecture 2: Generative Learning 33/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Eigenvalue Problem (Rank-r)
Rayleigh Quotient Formulation
U = argmaxU∈Rd×r
trace(U>AU) s.t. U>U = Ir,
Lagrangian Multiplier Method: Λ ∈ Rr×r and Λ = Λ>
L(U,Λ) = trace(U>AU)− trace(Λ>(U>U− Ir))
We only need eigenvectors of A, since
∇UL(U,Λ) = 2AU− 2UΛ = 0,
∇ΛL(U,Λ) = U>U− Ir = 0.
Tuo Zhao — Lecture 2: Generative Learning 34/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Eigenvalue Problem (Rank-r)
Rayleigh Quotient Formulation
U = argmaxU∈Rd×r
trace(U>AU) s.t. U>U = Ir,
Lagrangian Multiplier Method: Λ ∈ Rr×r and Λ = Λ>
L(U,Λ) = trace(U>AU)− trace(Λ>(U>U− Ir))
We only need eigenvectors of A, since
∇UL(U,Λ) = 2AU− 2UΛ = 0,
∇ΛL(U,Λ) = U>U− Ir = 0.
Tuo Zhao — Lecture 2: Generative Learning 34/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Eigenvalue Problem (Rank-r)
Rayleigh Quotient Formulation
U = argmaxU∈Rd×r
trace(U>AU) s.t. U>U = Ir,
Lagrangian Multiplier Method: Λ ∈ Rr×r and Λ = Λ>
L(U,Λ) = trace(U>AU)− trace(Λ>(U>U− Ir))
We only need eigenvectors of A, since
∇UL(U,Λ) = 2AU− 2UΛ = 0,
∇ΛL(U,Λ) = U>U− Ir = 0.
Tuo Zhao — Lecture 2: Generative Learning 34/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Generalized Eigenvalue Problem
Rayleigh Quotient Formulation
W = argmaxW∈Rd×r
trace(W>ΓW) s.t. W>ΣW = Ir.
Replace U = Σ1/2
W
U = argmaxU∈Rd×r
trace(U>AU) s.t. U>U = Ir,
where A = Σ−1/2
ΓΣ−1/2
.
Eigenvalue Problem!
Tuo Zhao — Lecture 2: Generative Learning 35/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Generalized Eigenvalue Problem
Rayleigh Quotient Formulation
W = argmaxW∈Rd×r
trace(W>ΓW) s.t. W>ΣW = Ir.
Replace U = Σ1/2
W
U = argmaxU∈Rd×r
trace(U>AU) s.t. U>U = Ir,
where A = Σ−1/2
ΓΣ−1/2
.
Eigenvalue Problem!
Tuo Zhao — Lecture 2: Generative Learning 35/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Generalized Eigenvalue Problem
Rayleigh Quotient Formulation
W = argmaxW∈Rd×r
trace(W>ΓW) s.t. W>ΣW = Ir.
Replace U = Σ1/2
W
U = argmaxU∈Rd×r
trace(U>AU) s.t. U>U = Ir,
where A = Σ−1/2
ΓΣ−1/2
.
Eigenvalue Problem!
Tuo Zhao — Lecture 2: Generative Learning 35/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Eigenvalue Problem
Power Iteration:
U(t+1) = QR(ΘU(t))
When r = 1, we have
u(t+1) =Θu(t)
∥∥u(t)∥∥2
.
where Θ = Σ−1/2
ΓΣ−1/2
,
We need T = O(gap · log(1/ε)) iterations to guarantee
|u>u(T )| = 1− ε,where gap = λ1/(λ1 − λ2).
Tuo Zhao — Lecture 2: Generative Learning 36/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Eigenvalue Problem
Power Iteration:
U(t+1) = QR(ΘU(t))
When r = 1, we have
u(t+1) =Θu(t)
∥∥u(t)∥∥2
.
where Θ = Σ−1/2
ΓΣ−1/2
,
We need T = O(gap · log(1/ε)) iterations to guarantee
|u>u(T )| = 1− ε,where gap = λ1/(λ1 − λ2).
Tuo Zhao — Lecture 2: Generative Learning 36/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Eigenvalue Problem
Power Iteration:
U(t+1) = QR(ΘU(t))
When r = 1, we have
u(t+1) =Θu(t)
∥∥u(t)∥∥2
.
where Θ = Σ−1/2
ΓΣ−1/2
,
We need T = O(gap · log(1/ε)) iterations to guarantee
|u>u(T )| = 1− ε,where gap = λ1/(λ1 − λ2).
Tuo Zhao — Lecture 2: Generative Learning 36/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Quadratic Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Y = y) = py(1− p)1−y
P(X|Y = 0) ∼ N(µ0,Σ0)
P(X = x|Y = 0) =exp
(−1
2(x− µ0)>Σ−10 (x− µ0)
)
(2π)d/2|Σ0|1/2
P(X|Y = 1) ∼ N(µ1,Σ1)
P(X = x|Y = 1) =exp
(−1
2(x− µ1)>Σ−11 (x− µ1)
)
(2π)d/2|Σ1|1/2
Tuo Zhao — Lecture 2: Generative Learning 37/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Quadratic Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Y = y) = py(1− p)1−y
P(X|Y = 0) ∼ N(µ0,Σ0)
P(X = x|Y = 0) =exp
(−1
2(x− µ0)>Σ−10 (x− µ0)
)
(2π)d/2|Σ0|1/2
P(X|Y = 1) ∼ N(µ1,Σ1)
P(X = x|Y = 1) =exp
(−1
2(x− µ1)>Σ−11 (x− µ1)
)
(2π)d/2|Σ1|1/2
Tuo Zhao — Lecture 2: Generative Learning 37/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Quadratic Discriminant Analysis
Generative View:
Y ∼ Bernoulli(p)
P(Y = y) = py(1− p)1−y
P(X|Y = 0) ∼ N(µ0,Σ0)
P(X = x|Y = 0) =exp
(−1
2(x− µ0)>Σ−10 (x− µ0)
)
(2π)d/2|Σ0|1/2
P(X|Y = 1) ∼ N(µ1,Σ1)
P(X = x|Y = 1) =exp
(−1
2(x− µ1)>Σ−11 (x− µ1)
)
(2π)d/2|Σ1|1/2
Tuo Zhao — Lecture 2: Generative Learning 37/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Quadratic Discriminant Analysis
Tuo Zhao — Lecture 2: Generative Learning 38/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Quadratic Discriminant Analysis
Maximum Likelihood Estimation:
L(p,µ0,µ1,Σ0,Σ1) = log
n∏
i=1
h(xi|yi; p,µ0,µ1,Σ0,Σ1)g(yi; p).
Convex Minimization
µ0 =
∑ni=1 xi · (1− yi)n−∑n
i=1 yiand µ1 =
∑ni=1 xi · yi∑ni=1 yi
p =
∑ni=1 yin
and Σk =1
nk
∑
yi=k
(xi − µk)(xi − µk)>
d(d+ 1) + 2d+ 1 parameters to estimate.
Tuo Zhao — Lecture 2: Generative Learning 39/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Quadratic Discriminant Analysis
Maximum Likelihood Estimation:
L(p,µ0,µ1,Σ0,Σ1) = log
n∏
i=1
h(xi|yi; p,µ0,µ1,Σ0,Σ1)g(yi; p).
Convex Minimization
µ0 =
∑ni=1 xi · (1− yi)n−∑n
i=1 yiand µ1 =
∑ni=1 xi · yi∑ni=1 yi
p =
∑ni=1 yin
and Σk =1
nk
∑
yi=k
(xi − µk)(xi − µk)>
d(d+ 1) + 2d+ 1 parameters to estimate.
Tuo Zhao — Lecture 2: Generative Learning 39/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Quadratic Discriminant Analysis
Maximum Likelihood Estimation:
L(p,µ0,µ1,Σ0,Σ1) = log
n∏
i=1
h(xi|yi; p,µ0,µ1,Σ0,Σ1)g(yi; p).
Convex Minimization
µ0 =
∑ni=1 xi · (1− yi)n−∑n
i=1 yiand µ1 =
∑ni=1 xi · yi∑ni=1 yi
p =
∑ni=1 yin
and Σk =1
nk
∑
yi=k
(xi − µk)(xi − µk)>
d(d+ 1) + 2d+ 1 parameters to estimate.
Tuo Zhao — Lecture 2: Generative Learning 39/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Quadratic Discriminant Analysis
Maximum Likelihood Estimation:
L(p,µ0,µ1,Σ0,Σ1) = log
n∏
i=1
h(xi|yi; p,µ0,µ1,Σ0,Σ1)g(yi; p).
Convex Minimization
µ0 =
∑ni=1 xi · (1− yi)n−∑n
i=1 yiand µ1 =
∑ni=1 xi · yi∑ni=1 yi
p =
∑ni=1 yin
and Σk =1
nk
∑
yi=k
(xi − µk)(xi − µk)>
d(d+ 1) + 2d+ 1 parameters to estimate.
Tuo Zhao — Lecture 2: Generative Learning 39/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Quadratic Discriminant Analysis
Maximum Likelihood Estimation:
L(p,µ0,µ1,Σ0,Σ1) = log
n∏
i=1
h(xi|yi; p,µ0,µ1,Σ0,Σ1)g(yi; p).
Convex Minimization
µ0 =
∑ni=1 xi · (1− yi)n−∑n
i=1 yiand µ1 =
∑ni=1 xi · yi∑ni=1 yi
p =
∑ni=1 yin
and Σk =1
nk
∑
yi=k
(xi − µk)(xi − µk)>
d(d+ 1) + 2d+ 1 parameters to estimate.
Tuo Zhao — Lecture 2: Generative Learning 39/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
GDA v.s. QDA
Gaussian Discriminant Analysis
Stronger Modeling Assumption: Terrible
d(d+ 1)/2 + 2d+ 1 parameters: Terrible
A simple closed form solution: Not very useful!
Quadratic Discriminant Analysis
Weaker Modeling Assumption: Still Terrible!
d(d+ 1) + 2d+ 1 parameters: More Terrible!
A simple closed form solution: Not very useful!
Tuo Zhao — Lecture 2: Generative Learning 40/47
Multiclass Classification
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
K-Nearest Neighbor Classification
Very intuitive....
Tuo Zhao — Lecture 2: Generative Learning 42/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Model Complexity?
More flexible for larger K’s?
Not really!
Tuo Zhao — Lecture 2: Generative Learning 43/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Curse of Dimensionality
Tuo Zhao — Lecture 2: Generative Learning 44/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Local Linear Regression
Build linear regression models using ONLY neighbors
Tuo Zhao — Lecture 2: Generative Learning 45/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Local Logistic Regression
Build logistic regression models using ONLY neighbors
Tuo Zhao — Lecture 2: Generative Learning 46/47
CS7641/ISYE/CSE 6740: Machine Learning/Computational Data Analysis
Multinomial Regression
Given x1, ...,xn ∈ Rd, y1, ..., yn ∈ {1, 2, ...,m}, andθ∗1, ...,θ
∗m−1 ∈ Rd, for k = 1, ...,m− 1 and i = 1, ..., n,
P(yi = k) =exp(−x>i θ∗k)
1 +
m−1∑
k=1
exp(−x>i θ∗k),
P(yi = m) =1
1 +
m−1∑
k=1
exp(−x>i θ∗k)
Maximum Likelihood Estimation: Still a convex problem.
Tuo Zhao — Lecture 2: Generative Learning 47/47