+ All Categories
Home > Documents > machine learning

machine learning

Date post: 28-Sep-2015
Category:
Upload: abbas
View: 11 times
Download: 1 times
Share this document with a friend
Description:
statistical machine learning
Popular Tags:
45
CSE 575: Sta*s*cal Machine Learning Jingrui He CIDSE, ASU
Transcript
  • CSE 575: Sta*s*cal Machine Learning

    Jingrui He CIDSE, ASU

  • Instance-based Learning

  • 3

    1-Nearest Neighbor

    Four things make a memory based learner: 1. A distance metric

    Euclidian (and many more) 2. How many nearby neighbors to look at? One 1. A weigh:ng func:on (op:onal)

    Unused

    2. How to t with the local points? Just predict the same output as the nearest neighbor.

  • 4

    Consistency of 1-NN Consider an es*mator fn trained on n examples

    e.g., 1-NN, regression, ... Es*mator is consistent if true error goes to zero as amount of

    data increases e.g., for no noise data, consistent if:

    Regression is not consistent! Representa*on bias

    1-NN is consistent (under some mild neprint)

    What about variance???

  • 5

    1-NN overts?

  • 6

    k-Nearest Neighbor Four things make a memory based learner: 1. A distance metric

    Euclidian (and many more) 2. How many nearby neighbors to look at?

    k 1. A weigh:ng func:on (op:onal)

    Unused

    2. How to t with the local points? Just predict the average output among the k nearest neighbors.

  • 7

    k-Nearest Neighbor (here k=9)

    K-nearest neighbor for funcFon Hng smooth away noise, but there are clear deciencies. What can we do about all the discon*nui*es that k-NN gives us?

  • 8

    Curse of dimensionality for instance-based learning

    Must store and retrieve all data! Most real work done during tes*ng For every test sample, must search through all dataset very slow! There are fast methods for dealing with large datasets, e.g., tree-based

    methods, hashing methods,

    Instance-based learning o^en poor with noisy or irrelevant features

  • Support Vector Machines

  • w.x = j w(j) x(j) 10

    Linear classiers Which line is beber?

    Data:

    Example i:

  • 11

    Pick the one with the largest margin!

    w.x = j w(j) x(j)

    w.x + b = 0

  • 12

    Maximize the margin

    w.x + b = 0

  • 13

    But there are a many planes

    w.x + b = 0

  • 14

    w.x + b = 0

    Review: Normal to a plane

  • 15

    w.x + b = +1

    w.x + b = -1

    w.x + b = 0

    margin 2

    x- x+

    Normalized margin Canonical hyperplanes

  • 16

    Normalized margin Canonical hyperplanes

    w.x + b = +1

    w.x + b = -1

    w.x + b = 0

    margin 2

    x- x+

  • 17

    Margin maximiza*on using canonical hyperplanes

    w.x + b = +1

    w.x + b = -1

    w.x + b = 0

    margin 2

  • 18

    Support vector machines (SVMs)

    w.x + b = +1

    w.x + b = -1

    w.x + b = 0

    margin 2

    Solve eciently by quadra*c programming (QP) Well-studied solu*on algorithms

    Hyperplane dened by support vectors

  • 19

    What if the data is not linearly separable?

    Use features of features of features of features.

  • 20

    What if the data is s*ll not linearly separable?

    Minimize w.w and number of training mistakes Tradeo two criteria?

    Tradeo #(mistakes) and w.w 0/1 loss Slack penalty C Not QP anymore Also doesnt dis*nguish near misses

    and really bad mistakes

  • 21

    Slack variables Hinge loss

    If margin 1, dont care If margin < 1, pay linear penalty

  • 22

    Side note: Whats the dierence between SVMs and logis*c regression?

    SVM: LogisFc regression:

    Log loss:

  • 23

    Constrained op*miza*on

  • 24

    Lagrange mul*pliers Dual variables

    Moving the constraint to objecFve funcFon Lagrangian:

    Solve:

  • 25

    Lagrange mul*pliers Dual variables

    Solving:

  • 26

    Dual SVM deriva*on (1) the linearly separable case

  • 27

    Dual SVM deriva*on (2) the linearly separable case

  • 28

    Dual SVM interpreta*on

    w.x + b = 0

  • 29

    Dual SVM formula*on the linearly separable case

  • 30

    Dual SVM deriva*on the non-separable case

  • 31

    Dual SVM formula*on the non-separable case

  • 32

    Why did we learn about the dual SVM?

    There are some quadra*c programming algorithms that can solve the dual faster than the primal

    But, more importantly, the kernel trick!!! Another lible detour

  • 33

    Reminder from last *me: What if the data is not linearly separable?

    Use features of features of features of features.

    Feature space can get really large really quickly!

  • 34

    Higher order polynomials

    number of input dimensions

    numbe

    r of m

    onom

    ial terms

    d=2

    d=4

    d=3

    m input features d degree of polynomial

    grows fast! d = 6, m = 100 about 1.6 billion terms

  • 35

    Dual formula*on only depends on dot-products, not on w!

  • 36

    Dot-product of polynomials

  • 37

    Finally: the kernel trick!

    Never represent features explicitly Compute dot products in closed form

    Constant-*me high-dimensional dot-products for many classes of features

    Very interes*ng theory Reproducing Kernel Hilbert Spaces

  • 38

    Polynomial kernels

    All monomials of degree d in O(d) opera*ons:

    How about all monomials of degree up to d? Solu*on 0:

    Beber solu*on:

  • 39

    Common kernels

    Polynomials of degree d

    Polynomials of degree up to d

    Gaussian kernels

    Sigmoid

  • 40

    Overvng?

    Huge feature space with kernels, what about overvng??? Maximizing margin leads to sparse set of support vectors

    Some interes*ng theory says that SVMs search for simple hypothesis with large margin

    O^en robust to overvng

  • 41

    What about at classica*on *me For a new input x, if we need to represent (x), we are in trouble!

    Recall classier: sign(w.(x)+b) Using kernels we are cool!

  • 42

    SVMs with kernels Choose a set of features and kernel func*on Solve dual problem to obtain support vectors i

    At classica*on *me, compute:

    Classify as

  • 43

    Whats the dierence between SVMs and Logis*c Regression?

    SVMs Logistic Regression

    Loss function Hinge loss Log-loss

    High dimensional features with kernels

    Yes! No

  • 44

    Kernels in logis*c regression

    Dene weights in terms of support vectors:

    Derive simple gradient descent rule on i

  • 45

    Whats the dierence between SVMs and Logis*c Regression? (Revisited)

    SVMs Logistic Regression

    Loss function Hinge loss Log-loss

    High dimensional features with kernels

    Yes! Yes!

    Solution sparse Often yes! Almost always no!

    Semantics of output

    Margin Real probabilities


Recommended