CSE 575: Sta*s*cal Machine Learning
Jingrui He CIDSE, ASU
Instance-based Learning
3
1-Nearest Neighbor
Four things make a memory based learner: 1. A distance metric
Euclidian (and many more) 2. How many nearby neighbors to look at? One 1. A weigh:ng func:on (op:onal)
Unused
2. How to t with the local points? Just predict the same output as the nearest neighbor.
4
Consistency of 1-NN Consider an es*mator fn trained on n examples
e.g., 1-NN, regression, ... Es*mator is consistent if true error goes to zero as amount of
data increases e.g., for no noise data, consistent if:
Regression is not consistent! Representa*on bias
1-NN is consistent (under some mild neprint)
What about variance???
5
1-NN overts?
6
k-Nearest Neighbor Four things make a memory based learner: 1. A distance metric
Euclidian (and many more) 2. How many nearby neighbors to look at?
k 1. A weigh:ng func:on (op:onal)
Unused
2. How to t with the local points? Just predict the average output among the k nearest neighbors.
7
k-Nearest Neighbor (here k=9)
K-nearest neighbor for funcFon Hng smooth away noise, but there are clear deciencies. What can we do about all the discon*nui*es that k-NN gives us?
8
Curse of dimensionality for instance-based learning
Must store and retrieve all data! Most real work done during tes*ng For every test sample, must search through all dataset very slow! There are fast methods for dealing with large datasets, e.g., tree-based
methods, hashing methods,
Instance-based learning o^en poor with noisy or irrelevant features
Support Vector Machines
w.x = j w(j) x(j) 10
Linear classiers Which line is beber?
Data:
Example i:
11
Pick the one with the largest margin!
w.x = j w(j) x(j)
w.x + b = 0
12
Maximize the margin
w.x + b = 0
13
But there are a many planes
w.x + b = 0
14
w.x + b = 0
Review: Normal to a plane
15
w.x + b = +1
w.x + b = -1
w.x + b = 0
margin 2
x- x+
Normalized margin Canonical hyperplanes
16
Normalized margin Canonical hyperplanes
w.x + b = +1
w.x + b = -1
w.x + b = 0
margin 2
x- x+
17
Margin maximiza*on using canonical hyperplanes
w.x + b = +1
w.x + b = -1
w.x + b = 0
margin 2
18
Support vector machines (SVMs)
w.x + b = +1
w.x + b = -1
w.x + b = 0
margin 2
Solve eciently by quadra*c programming (QP) Well-studied solu*on algorithms
Hyperplane dened by support vectors
19
What if the data is not linearly separable?
Use features of features of features of features.
20
What if the data is s*ll not linearly separable?
Minimize w.w and number of training mistakes Tradeo two criteria?
Tradeo #(mistakes) and w.w 0/1 loss Slack penalty C Not QP anymore Also doesnt dis*nguish near misses
and really bad mistakes
21
Slack variables Hinge loss
If margin 1, dont care If margin < 1, pay linear penalty
22
Side note: Whats the dierence between SVMs and logis*c regression?
SVM: LogisFc regression:
Log loss:
23
Constrained op*miza*on
24
Lagrange mul*pliers Dual variables
Moving the constraint to objecFve funcFon Lagrangian:
Solve:
25
Lagrange mul*pliers Dual variables
Solving:
26
Dual SVM deriva*on (1) the linearly separable case
27
Dual SVM deriva*on (2) the linearly separable case
28
Dual SVM interpreta*on
w.x + b = 0
29
Dual SVM formula*on the linearly separable case
30
Dual SVM deriva*on the non-separable case
31
Dual SVM formula*on the non-separable case
32
Why did we learn about the dual SVM?
There are some quadra*c programming algorithms that can solve the dual faster than the primal
But, more importantly, the kernel trick!!! Another lible detour
33
Reminder from last *me: What if the data is not linearly separable?
Use features of features of features of features.
Feature space can get really large really quickly!
34
Higher order polynomials
number of input dimensions
numbe
r of m
onom
ial terms
d=2
d=4
d=3
m input features d degree of polynomial
grows fast! d = 6, m = 100 about 1.6 billion terms
35
Dual formula*on only depends on dot-products, not on w!
36
Dot-product of polynomials
37
Finally: the kernel trick!
Never represent features explicitly Compute dot products in closed form
Constant-*me high-dimensional dot-products for many classes of features
Very interes*ng theory Reproducing Kernel Hilbert Spaces
38
Polynomial kernels
All monomials of degree d in O(d) opera*ons:
How about all monomials of degree up to d? Solu*on 0:
Beber solu*on:
39
Common kernels
Polynomials of degree d
Polynomials of degree up to d
Gaussian kernels
Sigmoid
40
Overvng?
Huge feature space with kernels, what about overvng??? Maximizing margin leads to sparse set of support vectors
Some interes*ng theory says that SVMs search for simple hypothesis with large margin
O^en robust to overvng
41
What about at classica*on *me For a new input x, if we need to represent (x), we are in trouble!
Recall classier: sign(w.(x)+b) Using kernels we are cool!
42
SVMs with kernels Choose a set of features and kernel func*on Solve dual problem to obtain support vectors i
At classica*on *me, compute:
Classify as
43
Whats the dierence between SVMs and Logis*c Regression?
SVMs Logistic Regression
Loss function Hinge loss Log-loss
High dimensional features with kernels
Yes! No
44
Kernels in logis*c regression
Dene weights in terms of support vectors:
Derive simple gradient descent rule on i
45
Whats the dierence between SVMs and Logis*c Regression? (Revisited)
SVMs Logistic Regression
Loss function Hinge loss Log-loss
High dimensional features with kernels
Yes! Yes!
Solution sparse Often yes! Almost always no!
Semantics of output
Margin Real probabilities