+ All Categories
Home > Documents > Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine...

Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine...

Date post: 14-May-2020
Category:
Upload: others
View: 6 times
Download: 0 times
Share this document with a friend
90
Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 Machine Learning (CS771A) Introduction to Learning Theory 1
Transcript
Page 1: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Introduction to Learning Theory

Piyush Rai

Machine Learning (CS771A)

Oct 24, 2016

Machine Learning (CS771A) Introduction to Learning Theory 1

Page 2: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Why Learning Theory?

How can we tell if our learning algo will do a good job in future (test time)?

Experimental results

Theoretical analysis

Why theory?

Can only run a limited number of experiments..

Experiments rarely tell us what will go wrong

Want to deploy our learning algorithms on Mars

Using learning theory, we can make formal statements/give guarantees on

Expected performance (“generalization”) of a learning algorithm on test data

Number of examples required to attain a certain level of test accuracy

Hardness of learning problems in general

“Theory is the first term in the Taylor series expansion of Practice” - T. Cover

Machine Learning (CS771A) Introduction to Learning Theory 2

Page 3: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Why Learning Theory?

How can we tell if our learning algo will do a good job in future (test time)?

Experimental results

Theoretical analysis

Why theory?

Can only run a limited number of experiments..

Experiments rarely tell us what will go wrong

Want to deploy our learning algorithms on Mars

Using learning theory, we can make formal statements/give guarantees on

Expected performance (“generalization”) of a learning algorithm on test data

Number of examples required to attain a certain level of test accuracy

Hardness of learning problems in general

“Theory is the first term in the Taylor series expansion of Practice” - T. Cover

Machine Learning (CS771A) Introduction to Learning Theory 2

Page 4: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Why Learning Theory?

How can we tell if our learning algo will do a good job in future (test time)?

Experimental results

Theoretical analysis

Why theory?

Can only run a limited number of experiments..

Experiments rarely tell us what will go wrong

Want to deploy our learning algorithms on Mars

Using learning theory, we can make formal statements/give guarantees on

Expected performance (“generalization”) of a learning algorithm on test data

Number of examples required to attain a certain level of test accuracy

Hardness of learning problems in general

“Theory is the first term in the Taylor series expansion of Practice” - T. Cover

Machine Learning (CS771A) Introduction to Learning Theory 2

Page 5: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Why Learning Theory?

How can we tell if our learning algo will do a good job in future (test time)?

Experimental results

Theoretical analysis

Why theory?

Can only run a limited number of experiments..

Experiments rarely tell us what will go wrong

Want to deploy our learning algorithms on Mars

Using learning theory, we can make formal statements/give guarantees on

Expected performance (“generalization”) of a learning algorithm on test data

Number of examples required to attain a certain level of test accuracy

Hardness of learning problems in general

“Theory is the first term in the Taylor series expansion of Practice” - T. Cover

Machine Learning (CS771A) Introduction to Learning Theory 2

Page 6: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Why Learning Theory?

How can we tell if our learning algo will do a good job in future (test time)?

Experimental results

Theoretical analysis

Why theory?

Can only run a limited number of experiments..

Experiments rarely tell us what will go wrong

Want to deploy our learning algorithms on Mars

Using learning theory, we can make formal statements/give guarantees on

Expected performance (“generalization”) of a learning algorithm on test data

Number of examples required to attain a certain level of test accuracy

Hardness of learning problems in general

“Theory is the first term in the Taylor series expansion of Practice” - T. Cover

Machine Learning (CS771A) Introduction to Learning Theory 2

Page 7: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Why Learning Theory?

How can we tell if our learning algo will do a good job in future (test time)?

Experimental results

Theoretical analysis

Why theory?

Can only run a limited number of experiments..

Experiments rarely tell us what will go wrong

Want to deploy our learning algorithms on Mars

Using learning theory, we can make formal statements/give guarantees on

Expected performance (“generalization”) of a learning algorithm on test data

Number of examples required to attain a certain level of test accuracy

Hardness of learning problems in general

“Theory is the first term in the Taylor series expansion of Practice” - T. Cover

Machine Learning (CS771A) Introduction to Learning Theory 2

Page 8: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Why Learning Theory?

How can we tell if our learning algo will do a good job in future (test time)?

Experimental results

Theoretical analysis

Why theory?

Can only run a limited number of experiments..

Experiments rarely tell us what will go wrong

Want to deploy our learning algorithms on Mars

Using learning theory, we can make formal statements/give guarantees on

Expected performance (“generalization”) of a learning algorithm on test data

Number of examples required to attain a certain level of test accuracy

Hardness of learning problems in general

“Theory is the first term in the Taylor series expansion of Practice” - T. Cover

Machine Learning (CS771A) Introduction to Learning Theory 2

Page 9: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Why Learning Theory?

How can we tell if our learning algo will do a good job in future (test time)?

Experimental results

Theoretical analysis

Why theory?

Can only run a limited number of experiments..

Experiments rarely tell us what will go wrong

Want to deploy our learning algorithms on Mars

Using learning theory, we can make formal statements/give guarantees on

Expected performance (“generalization”) of a learning algorithm on test data

Number of examples required to attain a certain level of test accuracy

Hardness of learning problems in general

“Theory is the first term in the Taylor series expansion of Practice” - T. Cover

Machine Learning (CS771A) Introduction to Learning Theory 2

Page 10: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Why Learning Theory?

How can we tell if our learning algo will do a good job in future (test time)?

Experimental results

Theoretical analysis

Why theory?

Can only run a limited number of experiments..

Experiments rarely tell us what will go wrong

Want to deploy our learning algorithms on Mars

Using learning theory, we can make formal statements/give guarantees on

Expected performance (“generalization”) of a learning algorithm on test data

Number of examples required to attain a certain level of test accuracy

Hardness of learning problems in general

“Theory is the first term in the Taylor series expansion of Practice” - T. Cover

Machine Learning (CS771A) Introduction to Learning Theory 2

Page 11: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Why Learning Theory?

How can we tell if our learning algo will do a good job in future (test time)?

Experimental results

Theoretical analysis

Why theory?

Can only run a limited number of experiments..

Experiments rarely tell us what will go wrong

Want to deploy our learning algorithms on Mars

Using learning theory, we can make formal statements/give guarantees on

Expected performance (“generalization”) of a learning algorithm on test data

Number of examples required to attain a certain level of test accuracy

Hardness of learning problems in general

“Theory is the first term in the Taylor series expansion of Practice” - T. Cover

Machine Learning (CS771A) Introduction to Learning Theory 2

Page 12: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Why Learning Theory?

How can we tell if our learning algo will do a good job in future (test time)?

Experimental results

Theoretical analysis

Why theory?

Can only run a limited number of experiments..

Experiments rarely tell us what will go wrong

Want to deploy our learning algorithms on Mars

Using learning theory, we can make formal statements/give guarantees on

Expected performance (“generalization”) of a learning algorithm on test data

Number of examples required to attain a certain level of test accuracy

Hardness of learning problems in general

“Theory is the first term in the Taylor series expansion of Practice” - T. Cover

Machine Learning (CS771A) Introduction to Learning Theory 2

Page 13: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Why Learning Theory?

How can we tell if our learning algo will do a good job in future (test time)?

Experimental results

Theoretical analysis

Why theory?

Can only run a limited number of experiments..

Experiments rarely tell us what will go wrong

Want to deploy our learning algorithms on Mars

Using learning theory, we can make formal statements/give guarantees on

Expected performance (“generalization”) of a learning algorithm on test data

Number of examples required to attain a certain level of test accuracy

Hardness of learning problems in general

“Theory is the first term in the Taylor series expansion of Practice” - T. CoverMachine Learning (CS771A) Introduction to Learning Theory 2

Page 14: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Hypothesis Class, Training and True Error

A hypothesis class H is a set of functions/hypotheses (assume finite for now)

The learning algorithm, given training data, learns a hypothesis h ∈ H

Assume h is learned using a sample D of N i.i.d. training examples (xn, yn)Nn=1 drawn fromP(x , y); (also denoted as D ∼ PN)

The 0-1 training error (also called the empirical error) of h

LD(h) =1

N

N∑n=1

I(h(xn) 6= yn)

The 0-1 true error (also called the expected error) of h

LP (h) = E(x,y)∼P [I(h(x) 6= y)]

The true error, in general, is much worse than the training error

We want to know how much worse it is..

.. without doing experiments

Machine Learning (CS771A) Introduction to Learning Theory 3

Page 15: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Hypothesis Class, Training and True Error

A hypothesis class H is a set of functions/hypotheses (assume finite for now)

The learning algorithm, given training data, learns a hypothesis h ∈ H

Assume h is learned using a sample D of N i.i.d. training examples (xn, yn)Nn=1 drawn fromP(x , y); (also denoted as D ∼ PN)

The 0-1 training error (also called the empirical error) of h

LD(h) =1

N

N∑n=1

I(h(xn) 6= yn)

The 0-1 true error (also called the expected error) of h

LP (h) = E(x,y)∼P [I(h(x) 6= y)]

The true error, in general, is much worse than the training error

We want to know how much worse it is..

.. without doing experiments

Machine Learning (CS771A) Introduction to Learning Theory 3

Page 16: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Hypothesis Class, Training and True Error

A hypothesis class H is a set of functions/hypotheses (assume finite for now)

The learning algorithm, given training data, learns a hypothesis h ∈ H

Assume h is learned using a sample D of N i.i.d. training examples (xn, yn)Nn=1 drawn fromP(x , y); (also denoted as D ∼ PN)

The 0-1 training error (also called the empirical error) of h

LD(h) =1

N

N∑n=1

I(h(xn) 6= yn)

The 0-1 true error (also called the expected error) of h

LP (h) = E(x,y)∼P [I(h(x) 6= y)]

The true error, in general, is much worse than the training error

We want to know how much worse it is..

.. without doing experiments

Machine Learning (CS771A) Introduction to Learning Theory 3

Page 17: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Hypothesis Class, Training and True Error

A hypothesis class H is a set of functions/hypotheses (assume finite for now)

The learning algorithm, given training data, learns a hypothesis h ∈ H

Assume h is learned using a sample D of N i.i.d. training examples (xn, yn)Nn=1 drawn fromP(x , y); (also denoted as D ∼ PN)

The 0-1 training error (also called the empirical error) of h

LD(h) =1

N

N∑n=1

I(h(xn) 6= yn)

The 0-1 true error (also called the expected error) of h

LP (h) = E(x,y)∼P [I(h(x) 6= y)]

The true error, in general, is much worse than the training error

We want to know how much worse it is..

.. without doing experiments

Machine Learning (CS771A) Introduction to Learning Theory 3

Page 18: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Hypothesis Class, Training and True Error

A hypothesis class H is a set of functions/hypotheses (assume finite for now)

The learning algorithm, given training data, learns a hypothesis h ∈ H

Assume h is learned using a sample D of N i.i.d. training examples (xn, yn)Nn=1 drawn fromP(x , y); (also denoted as D ∼ PN)

The 0-1 training error (also called the empirical error) of h

LD(h) =1

N

N∑n=1

I(h(xn) 6= yn)

The 0-1 true error (also called the expected error) of h

LP (h) = E(x,y)∼P [I(h(x) 6= y)]

The true error, in general, is much worse than the training error

We want to know how much worse it is..

.. without doing experiments

Machine Learning (CS771A) Introduction to Learning Theory 3

Page 19: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Hypothesis Class, Training and True Error

A hypothesis class H is a set of functions/hypotheses (assume finite for now)

The learning algorithm, given training data, learns a hypothesis h ∈ H

Assume h is learned using a sample D of N i.i.d. training examples (xn, yn)Nn=1 drawn fromP(x , y); (also denoted as D ∼ PN)

The 0-1 training error (also called the empirical error) of h

LD(h) =1

N

N∑n=1

I(h(xn) 6= yn)

The 0-1 true error (also called the expected error) of h

LP (h) = E(x,y)∼P [I(h(x) 6= y)]

The true error, in general, is much worse than the training error

We want to know how much worse it is..

.. without doing experiments

Machine Learning (CS771A) Introduction to Learning Theory 3

Page 20: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Hypothesis Class, Training and True Error

A hypothesis class H is a set of functions/hypotheses (assume finite for now)

The learning algorithm, given training data, learns a hypothesis h ∈ H

Assume h is learned using a sample D of N i.i.d. training examples (xn, yn)Nn=1 drawn fromP(x , y); (also denoted as D ∼ PN)

The 0-1 training error (also called the empirical error) of h

LD(h) =1

N

N∑n=1

I(h(xn) 6= yn)

The 0-1 true error (also called the expected error) of h

LP (h) = E(x,y)∼P [I(h(x) 6= y)]

The true error, in general, is much worse than the training error

We want to know how much worse it is..

.. without doing experiments

Machine Learning (CS771A) Introduction to Learning Theory 3

Page 21: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 1: Zero Training Error

Assume some h ∈ H can achieve zero training error

Assume its true error LP(h) > ε

Probability of h being correct on a single training example ≤ 1− ε

Probability of h having zero error on any training set of N examples

PD∼PN (LD(h) = 0 ∩ LP(h) > ε) ≤ (1− ε)N

Let’s call LD(h) = 0 ∩ LP(h) > ε as “h is bad”

Consider K hyp. {h1, . . . , hK}. Prob. that at least one of these is bad

PD∼PN (“h1 is bad” ∪ . . . ∪ “hK is bad”) ≤ K (1− ε)N (using union bound)

Since K ≤ |H|, K can be replaced by the size of set H

PD∼PN (∃h : “h is bad”) ≤ |H|(1− ε)N (Uniform Convergence)

Machine Learning (CS771A) Introduction to Learning Theory 4

Page 22: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 1: Zero Training Error

Assume some h ∈ H can achieve zero training error

Assume its true error LP(h) > ε

Probability of h being correct on a single training example ≤ 1− ε

Probability of h having zero error on any training set of N examples

PD∼PN (LD(h) = 0 ∩ LP(h) > ε) ≤ (1− ε)N

Let’s call LD(h) = 0 ∩ LP(h) > ε as “h is bad”

Consider K hyp. {h1, . . . , hK}. Prob. that at least one of these is bad

PD∼PN (“h1 is bad” ∪ . . . ∪ “hK is bad”) ≤ K (1− ε)N (using union bound)

Since K ≤ |H|, K can be replaced by the size of set H

PD∼PN (∃h : “h is bad”) ≤ |H|(1− ε)N (Uniform Convergence)

Machine Learning (CS771A) Introduction to Learning Theory 4

Page 23: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 1: Zero Training Error

Assume some h ∈ H can achieve zero training error

Assume its true error LP(h) > ε

Probability of h being correct on a single training example ≤ 1− ε

Probability of h having zero error on any training set of N examples

PD∼PN (LD(h) = 0 ∩ LP(h) > ε) ≤ (1− ε)N

Let’s call LD(h) = 0 ∩ LP(h) > ε as “h is bad”

Consider K hyp. {h1, . . . , hK}. Prob. that at least one of these is bad

PD∼PN (“h1 is bad” ∪ . . . ∪ “hK is bad”) ≤ K (1− ε)N (using union bound)

Since K ≤ |H|, K can be replaced by the size of set H

PD∼PN (∃h : “h is bad”) ≤ |H|(1− ε)N (Uniform Convergence)

Machine Learning (CS771A) Introduction to Learning Theory 4

Page 24: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 1: Zero Training Error

Assume some h ∈ H can achieve zero training error

Assume its true error LP(h) > ε

Probability of h being correct on a single training example ≤ 1− ε

Probability of h having zero error on any training set of N examples

PD∼PN (LD(h) = 0 ∩ LP(h) > ε) ≤ (1− ε)N

Let’s call LD(h) = 0 ∩ LP(h) > ε as “h is bad”

Consider K hyp. {h1, . . . , hK}. Prob. that at least one of these is bad

PD∼PN (“h1 is bad” ∪ . . . ∪ “hK is bad”) ≤ K (1− ε)N (using union bound)

Since K ≤ |H|, K can be replaced by the size of set H

PD∼PN (∃h : “h is bad”) ≤ |H|(1− ε)N (Uniform Convergence)

Machine Learning (CS771A) Introduction to Learning Theory 4

Page 25: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 1: Zero Training Error

Assume some h ∈ H can achieve zero training error

Assume its true error LP(h) > ε

Probability of h being correct on a single training example ≤ 1− ε

Probability of h having zero error on any training set of N examples

PD∼PN (LD(h) = 0 ∩ LP(h) > ε) ≤ (1− ε)N

Let’s call LD(h) = 0 ∩ LP(h) > ε as “h is bad”

Consider K hyp. {h1, . . . , hK}. Prob. that at least one of these is bad

PD∼PN (“h1 is bad” ∪ . . . ∪ “hK is bad”) ≤ K (1− ε)N (using union bound)

Since K ≤ |H|, K can be replaced by the size of set H

PD∼PN (∃h : “h is bad”) ≤ |H|(1− ε)N (Uniform Convergence)

Machine Learning (CS771A) Introduction to Learning Theory 4

Page 26: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 1: Zero Training Error

Assume some h ∈ H can achieve zero training error

Assume its true error LP(h) > ε

Probability of h being correct on a single training example ≤ 1− ε

Probability of h having zero error on any training set of N examples

PD∼PN (LD(h) = 0 ∩ LP(h) > ε) ≤ (1− ε)N

Let’s call LD(h) = 0 ∩ LP(h) > ε as “h is bad”

Consider K hyp. {h1, . . . , hK}. Prob. that at least one of these is bad

PD∼PN (“h1 is bad” ∪ . . . ∪ “hK is bad”) ≤ K (1− ε)N (using union bound)

Since K ≤ |H|, K can be replaced by the size of set H

PD∼PN (∃h : “h is bad”) ≤ |H|(1− ε)N (Uniform Convergence)

Machine Learning (CS771A) Introduction to Learning Theory 4

Page 27: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 1: Zero Training Error

Assume some h ∈ H can achieve zero training error

Assume its true error LP(h) > ε

Probability of h being correct on a single training example ≤ 1− ε

Probability of h having zero error on any training set of N examples

PD∼PN (LD(h) = 0 ∩ LP(h) > ε) ≤ (1− ε)N

Let’s call LD(h) = 0 ∩ LP(h) > ε as “h is bad”

Consider K hyp. {h1, . . . , hK}. Prob. that at least one of these is bad

PD∼PN (“h1 is bad” ∪ . . . ∪ “hK is bad”) ≤ K (1− ε)N (using union bound)

Since K ≤ |H|, K can be replaced by the size of set H

PD∼PN (∃h : “h is bad”) ≤ |H|(1− ε)N (Uniform Convergence)

Machine Learning (CS771A) Introduction to Learning Theory 4

Page 28: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 1: Zero Training Error

Using (1− ε) < e−ε, we get:

PD∼PN (∃h : “h is bad”) ≤ |H|e−Nε

Suppose |H|e−Nε = δ. Then for a given ε and δ

N ≥1

ε(log |H| + log

1

δ)

.. gives the min. number of training ex. to ensure that there is a “bad” h with probability at mostδ (or no bad h with probability at least 1− δ)

Essentially, gives a condition that h will be probably (with probability 1− δ) and approximately(with error ε) correct, given at least these many examples

Framework of “Probably and Approximately Correct” (PAC) Learning

Likewise, given N and δ, with probability 1− δ, the true error

LP (h) ≤log |H| + log 1

δ

N

Machine Learning (CS771A) Introduction to Learning Theory 5

Page 29: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 1: Zero Training Error

Using (1− ε) < e−ε, we get:

PD∼PN (∃h : “h is bad”) ≤ |H|e−Nε

Suppose |H|e−Nε = δ. Then for a given ε and δ

N ≥1

ε(log |H| + log

1

δ)

.. gives the min. number of training ex. to ensure that there is a “bad” h with probability at mostδ (or no bad h with probability at least 1− δ)

Essentially, gives a condition that h will be probably (with probability 1− δ) and approximately(with error ε) correct, given at least these many examples

Framework of “Probably and Approximately Correct” (PAC) Learning

Likewise, given N and δ, with probability 1− δ, the true error

LP (h) ≤log |H| + log 1

δ

N

Machine Learning (CS771A) Introduction to Learning Theory 5

Page 30: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 1: Zero Training Error

Using (1− ε) < e−ε, we get:

PD∼PN (∃h : “h is bad”) ≤ |H|e−Nε

Suppose |H|e−Nε = δ. Then for a given ε and δ

N ≥1

ε(log |H| + log

1

δ)

.. gives the min. number of training ex. to ensure that there is a “bad” h with probability at mostδ (or no bad h with probability at least 1− δ)

Essentially, gives a condition that h will be probably (with probability 1− δ) and approximately(with error ε) correct, given at least these many examples

Framework of “Probably and Approximately Correct” (PAC) Learning

Likewise, given N and δ, with probability 1− δ, the true error

LP (h) ≤log |H| + log 1

δ

N

Machine Learning (CS771A) Introduction to Learning Theory 5

Page 31: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 1: Zero Training Error

Using (1− ε) < e−ε, we get:

PD∼PN (∃h : “h is bad”) ≤ |H|e−Nε

Suppose |H|e−Nε = δ. Then for a given ε and δ

N ≥1

ε(log |H| + log

1

δ)

.. gives the min. number of training ex. to ensure that there is a “bad” h with probability at mostδ (or no bad h with probability at least 1− δ)

Essentially, gives a condition that h will be probably (with probability 1− δ) and approximately(with error ε) correct, given at least these many examples

Framework of “Probably and Approximately Correct” (PAC) Learning

Likewise, given N and δ, with probability 1− δ, the true error

LP (h) ≤log |H| + log 1

δ

N

Machine Learning (CS771A) Introduction to Learning Theory 5

Page 32: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 1: Zero Training Error

Using (1− ε) < e−ε, we get:

PD∼PN (∃h : “h is bad”) ≤ |H|e−Nε

Suppose |H|e−Nε = δ. Then for a given ε and δ

N ≥1

ε(log |H| + log

1

δ)

.. gives the min. number of training ex. to ensure that there is a “bad” h with probability at mostδ (or no bad h with probability at least 1− δ)

Essentially, gives a condition that h will be probably (with probability 1− δ) and approximately(with error ε) correct, given at least these many examples

Framework of “Probably and Approximately Correct” (PAC) Learning

Likewise, given N and δ, with probability 1− δ, the true error

LP (h) ≤log |H| + log 1

δ

N

Machine Learning (CS771A) Introduction to Learning Theory 5

Page 33: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 1: Zero Training Error

Using (1− ε) < e−ε, we get:

PD∼PN (∃h : “h is bad”) ≤ |H|e−Nε

Suppose |H|e−Nε = δ. Then for a given ε and δ

N ≥1

ε(log |H| + log

1

δ)

.. gives the min. number of training ex. to ensure that there is a “bad” h with probability at mostδ (or no bad h with probability at least 1− δ)

Essentially, gives a condition that h will be probably (with probability 1− δ) and approximately(with error ε) correct, given at least these many examples

Framework of “Probably and Approximately Correct” (PAC) Learning

Likewise, given N and δ, with probability 1− δ, the true error

LP (h) ≤log |H| + log 1

δ

N

Machine Learning (CS771A) Introduction to Learning Theory 5

Page 34: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

PAC Learnability and Efficient PAC Learnability

Definition: An algorithm A is an (ε, δ)-PAC learning algorithm if, for all distributions D: givensamples from D, the probability that it returns a “bad hypothesis” h is at most δ, where a “bad”hypothesis is one with test error rate more than ε on D.

Definition: An algorithm A is an efficient (ε, δ)-PAC learning algorithm if it is an (ε, δ)-PAClearning algorithm with runtime polynomial in 1

ε and 1δ

Note: a similar notion of an efficient (ε, δ)-PAC learning algorithm holds in terms of the number oftraining examples required (polynomial in 1

εand 1

δ)

Machine Learning (CS771A) Introduction to Learning Theory 6

Page 35: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

PAC Learnability and Efficient PAC Learnability

Definition: An algorithm A is an (ε, δ)-PAC learning algorithm if, for all distributions D: givensamples from D, the probability that it returns a “bad hypothesis” h is at most δ, where a “bad”hypothesis is one with test error rate more than ε on D.

Definition: An algorithm A is an efficient (ε, δ)-PAC learning algorithm if it is an (ε, δ)-PAClearning algorithm with runtime polynomial in 1

ε and 1δ

Note: a similar notion of an efficient (ε, δ)-PAC learning algorithm holds in terms of the number oftraining examples required (polynomial in 1

εand 1

δ)

Machine Learning (CS771A) Introduction to Learning Theory 6

Page 36: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

PAC Learnability and Efficient PAC Learnability

Definition: An algorithm A is an (ε, δ)-PAC learning algorithm if, for all distributions D: givensamples from D, the probability that it returns a “bad hypothesis” h is at most δ, where a “bad”hypothesis is one with test error rate more than ε on D.

Definition: An algorithm A is an efficient (ε, δ)-PAC learning algorithm if it is an (ε, δ)-PAClearning algorithm with runtime polynomial in 1

ε and 1δ

Note: a similar notion of an efficient (ε, δ)-PAC learning algorithm holds in terms of the number oftraining examples required (polynomial in 1

εand 1

δ)

Machine Learning (CS771A) Introduction to Learning Theory 6

Page 37: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 2: Non-Zero Training Error

Given N random variables z1, . . . , zN , the empirical mean

z̄ =1

N

N∑n=1

zn

Let’s assume the true mean is µz

Hoeffding’s inequality says:

P(|µz − z̄ | ≥ ε) ≤ e−2Nε2

Using the same result, for any single hypothesis h ∈ H, we have:

P(LP(h)− LD(h) ≥ ε) ≤ e−2Nε2

Using the union bound, we have:

P(∃h : LP(h)− LD(h) ≥ ε) ≤ |H|e−2Nε2

Machine Learning (CS771A) Introduction to Learning Theory 7

Page 38: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 2: Non-Zero Training Error

Given N random variables z1, . . . , zN , the empirical mean

z̄ =1

N

N∑n=1

zn

Let’s assume the true mean is µz

Hoeffding’s inequality says:

P(|µz − z̄ | ≥ ε) ≤ e−2Nε2

Using the same result, for any single hypothesis h ∈ H, we have:

P(LP(h)− LD(h) ≥ ε) ≤ e−2Nε2

Using the union bound, we have:

P(∃h : LP(h)− LD(h) ≥ ε) ≤ |H|e−2Nε2

Machine Learning (CS771A) Introduction to Learning Theory 7

Page 39: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 2: Non-Zero Training Error

Given N random variables z1, . . . , zN , the empirical mean

z̄ =1

N

N∑n=1

zn

Let’s assume the true mean is µz

Hoeffding’s inequality says:

P(|µz − z̄ | ≥ ε) ≤ e−2Nε2

Using the same result, for any single hypothesis h ∈ H, we have:

P(LP(h)− LD(h) ≥ ε) ≤ e−2Nε2

Using the union bound, we have:

P(∃h : LP(h)− LD(h) ≥ ε) ≤ |H|e−2Nε2

Machine Learning (CS771A) Introduction to Learning Theory 7

Page 40: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 2: Non-Zero Training Error

Given N random variables z1, . . . , zN , the empirical mean

z̄ =1

N

N∑n=1

zn

Let’s assume the true mean is µz

Hoeffding’s inequality says:

P(|µz − z̄ | ≥ ε) ≤ e−2Nε2

Using the same result, for any single hypothesis h ∈ H, we have:

P(LP(h)− LD(h) ≥ ε) ≤ e−2Nε2

Using the union bound, we have:

P(∃h : LP(h)− LD(h) ≥ ε) ≤ |H|e−2Nε2

Machine Learning (CS771A) Introduction to Learning Theory 7

Page 41: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 2: Non-Zero Training Error

Given N random variables z1, . . . , zN , the empirical mean

z̄ =1

N

N∑n=1

zn

Let’s assume the true mean is µz

Hoeffding’s inequality says:

P(|µz − z̄ | ≥ ε) ≤ e−2Nε2

Using the same result, for any single hypothesis h ∈ H, we have:

P(LP(h)− LD(h) ≥ ε) ≤ e−2Nε2

Using the union bound, we have:

P(∃h : LP(h)− LD(h) ≥ ε) ≤ |H|e−2Nε2

Machine Learning (CS771A) Introduction to Learning Theory 7

Page 42: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 2: Non-Zero Training Error

Given N random variables z1, . . . , zN , the empirical mean

z̄ =1

N

N∑n=1

zn

Let’s assume the true mean is µz

Hoeffding’s inequality says:

P(|µz − z̄ | ≥ ε) ≤ e−2Nε2

Using the same result, for any single hypothesis h ∈ H, we have:

P(LP(h)− LD(h) ≥ ε) ≤ e−2Nε2

Using the union bound, we have:

P(∃h : LP(h)− LD(h) ≥ ε) ≤ |H|e−2Nε2

Machine Learning (CS771A) Introduction to Learning Theory 7

Page 43: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 2: Non-Zero Training Error

Suppose |H|e−2Nε2

= δ. Then for a given ε and δ

N ≥1

2ε2(log |H| + log

1

δ)

.. gives the min. number of training ex. required to ensure that LP(h)− LD(h) ≤ ε with

probability at least 1− δ

Note: Number of examples grows as square of 1/ε (note: ε < 1)

In zero training error case, it grows linearly with 1/ε

For given ε, δ, the non-zero training error case requires more examples

Likewise, given N and δ, with probability 1− δ, the true error

LP (h) ≤ LD(h) +

√log |H| + log 1

δ

2N

Machine Learning (CS771A) Introduction to Learning Theory 8

Page 44: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 2: Non-Zero Training Error

Suppose |H|e−2Nε2

= δ. Then for a given ε and δ

N ≥1

2ε2(log |H| + log

1

δ)

.. gives the min. number of training ex. required to ensure that LP(h)− LD(h) ≤ ε with

probability at least 1− δ

Note: Number of examples grows as square of 1/ε (note: ε < 1)

In zero training error case, it grows linearly with 1/ε

For given ε, δ, the non-zero training error case requires more examples

Likewise, given N and δ, with probability 1− δ, the true error

LP (h) ≤ LD(h) +

√log |H| + log 1

δ

2N

Machine Learning (CS771A) Introduction to Learning Theory 8

Page 45: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Case 2: Non-Zero Training Error

Suppose |H|e−2Nε2

= δ. Then for a given ε and δ

N ≥1

2ε2(log |H| + log

1

δ)

.. gives the min. number of training ex. required to ensure that LP(h)− LD(h) ≤ ε with

probability at least 1− δ

Note: Number of examples grows as square of 1/ε (note: ε < 1)

In zero training error case, it grows linearly with 1/ε

For given ε, δ, the non-zero training error case requires more examples

Likewise, given N and δ, with probability 1− δ, the true error

LP (h) ≤ LD(h) +

√log |H| + log 1

δ

2N

Machine Learning (CS771A) Introduction to Learning Theory 8

Page 46: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Example: Decision Trees

Let’s consider the hypothesis class of DTs with k leaves

Suppose data has D binary features/attributes

A loose bound (using Sterling’s approximation): Hk ≤ Dk−122k−1

Thus log2 Hk ≤ (k − 1) log2 D + 2k − 1 (linear in k)

Machine Learning (CS771A) Introduction to Learning Theory 9

Page 47: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Example: Decision Trees

Let’s consider the hypothesis class of DTs with k leaves

Suppose data has D binary features/attributes

A loose bound (using Sterling’s approximation): Hk ≤ Dk−122k−1

Thus log2 Hk ≤ (k − 1) log2 D + 2k − 1 (linear in k)

Machine Learning (CS771A) Introduction to Learning Theory 9

Page 48: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Example: Decision Trees

Let’s consider the hypothesis class of DTs with k leaves

Suppose data has D binary features/attributes

A loose bound (using Sterling’s approximation): Hk ≤ Dk−122k−1

Thus log2 Hk ≤ (k − 1) log2 D + 2k − 1 (linear in k)

Machine Learning (CS771A) Introduction to Learning Theory 9

Page 49: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Example: Decision Trees

Let’s consider the hypothesis class of DTs with k leaves

Suppose data has D binary features/attributes

A loose bound (using Sterling’s approximation): Hk ≤ Dk−122k−1

Thus log2 Hk ≤ (k − 1) log2 D + 2k − 1 (linear in k)

Machine Learning (CS771A) Introduction to Learning Theory 9

Page 50: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Infinite Sized Hypothesis Spaces

For the finite sized hypothesis class H

LP(h) ≤ LD(h) +

√log |H|+ log 1

δ

2N

What happens when the hypothesis class size |H| is infinite?

Example: the set of all linear classifiers

The above bound doesn’t apply (it just becomes trivial)

We need some other way of measuring the size of HOne way: use the complexity H as a measure of its size

.. enters the Vapnik-Chervonenkis dimension (VC dimension)

VC dimension: a measure of the complexity of a hypothesis class

Machine Learning (CS771A) Introduction to Learning Theory 10

Page 51: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Infinite Sized Hypothesis Spaces

For the finite sized hypothesis class H

LP(h) ≤ LD(h) +

√log |H|+ log 1

δ

2N

What happens when the hypothesis class size |H| is infinite?

Example: the set of all linear classifiers

The above bound doesn’t apply (it just becomes trivial)

We need some other way of measuring the size of H

One way: use the complexity H as a measure of its size

.. enters the Vapnik-Chervonenkis dimension (VC dimension)

VC dimension: a measure of the complexity of a hypothesis class

Machine Learning (CS771A) Introduction to Learning Theory 10

Page 52: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Infinite Sized Hypothesis Spaces

For the finite sized hypothesis class H

LP(h) ≤ LD(h) +

√log |H|+ log 1

δ

2N

What happens when the hypothesis class size |H| is infinite?

Example: the set of all linear classifiers

The above bound doesn’t apply (it just becomes trivial)

We need some other way of measuring the size of HOne way: use the complexity H as a measure of its size

.. enters the Vapnik-Chervonenkis dimension (VC dimension)

VC dimension: a measure of the complexity of a hypothesis class

Machine Learning (CS771A) Introduction to Learning Theory 10

Page 53: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Infinite Sized Hypothesis Spaces

For the finite sized hypothesis class H

LP(h) ≤ LD(h) +

√log |H|+ log 1

δ

2N

What happens when the hypothesis class size |H| is infinite?

Example: the set of all linear classifiers

The above bound doesn’t apply (it just becomes trivial)

We need some other way of measuring the size of HOne way: use the complexity H as a measure of its size

.. enters the Vapnik-Chervonenkis dimension (VC dimension)

VC dimension: a measure of the complexity of a hypothesis class

Machine Learning (CS771A) Introduction to Learning Theory 10

Page 54: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Shattering

A set of points is shattered by a hypothesis class H if, no matter how the points are labeled, thereexists some h ∈ H that can separate the points

Figure above: 3 points in 2D, H: set of linear classifiers

Machine Learning (CS771A) Introduction to Learning Theory 11

Page 55: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Shattering

A set of points is shattered by a hypothesis class H if, no matter how the points are labeled, thereexists some h ∈ H that can separate the points

Figure above: 3 points in 2D, H: set of linear classifiers

Machine Learning (CS771A) Introduction to Learning Theory 11

Page 56: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension: The Shattering Game

The concept of shattering is used to define the VC dimension of hypothesis classes

Consider the following shattering game between us and an adversary

We choose d points in an input space, positioned however we want

Adversary labels these d points

We find a hypothesis h ∈ H that separates the points

Note: Shattering just one configuration of d points is enough to win

The VC dimension of H, in that input space, is the maximum d we can choose so that we alwayssucceed in the game

Machine Learning (CS771A) Introduction to Learning Theory 12

Page 57: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension: The Shattering Game

The concept of shattering is used to define the VC dimension of hypothesis classes

Consider the following shattering game between us and an adversary

We choose d points in an input space, positioned however we want

Adversary labels these d points

We find a hypothesis h ∈ H that separates the points

Note: Shattering just one configuration of d points is enough to win

The VC dimension of H, in that input space, is the maximum d we can choose so that we alwayssucceed in the game

Machine Learning (CS771A) Introduction to Learning Theory 12

Page 58: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension: The Shattering Game

The concept of shattering is used to define the VC dimension of hypothesis classes

Consider the following shattering game between us and an adversary

We choose d points in an input space, positioned however we want

Adversary labels these d points

We find a hypothesis h ∈ H that separates the points

Note: Shattering just one configuration of d points is enough to win

The VC dimension of H, in that input space, is the maximum d we can choose so that we alwayssucceed in the game

Machine Learning (CS771A) Introduction to Learning Theory 12

Page 59: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension: The Shattering Game

The concept of shattering is used to define the VC dimension of hypothesis classes

Consider the following shattering game between us and an adversary

We choose d points in an input space, positioned however we want

Adversary labels these d points

We find a hypothesis h ∈ H that separates the points

Note: Shattering just one configuration of d points is enough to win

The VC dimension of H, in that input space, is the maximum d we can choose so that we alwayssucceed in the game

Machine Learning (CS771A) Introduction to Learning Theory 12

Page 60: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension: The Shattering Game

The concept of shattering is used to define the VC dimension of hypothesis classes

Consider the following shattering game between us and an adversary

We choose d points in an input space, positioned however we want

Adversary labels these d points

We find a hypothesis h ∈ H that separates the points

Note: Shattering just one configuration of d points is enough to win

The VC dimension of H, in that input space, is the maximum d we can choose so that we alwayssucceed in the game

Machine Learning (CS771A) Introduction to Learning Theory 12

Page 61: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension: The Shattering Game

The concept of shattering is used to define the VC dimension of hypothesis classes

Consider the following shattering game between us and an adversary

We choose d points in an input space, positioned however we want

Adversary labels these d points

We find a hypothesis h ∈ H that separates the points

Note: Shattering just one configuration of d points is enough to win

The VC dimension of H, in that input space, is the maximum d we can choose so that we alwayssucceed in the game

Machine Learning (CS771A) Introduction to Learning Theory 12

Page 62: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension: The Shattering Game

The concept of shattering is used to define the VC dimension of hypothesis classes

Consider the following shattering game between us and an adversary

We choose d points in an input space, positioned however we want

Adversary labels these d points

We find a hypothesis h ∈ H that separates the points

Note: Shattering just one configuration of d points is enough to win

The VC dimension of H, in that input space, is the maximum d we can choose so that we alwayssucceed in the game

Machine Learning (CS771A) Introduction to Learning Theory 12

Page 63: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension: The Shattering Game

The concept of shattering is used to define the VC dimension of hypothesis classes

Consider the following shattering game between us and an adversary

We choose d points in an input space, positioned however we want

Adversary labels these d points

We find a hypothesis h ∈ H that separates the points

Note: Shattering just one configuration of d points is enough to win

The VC dimension of H, in that input space, is the maximum d we can choose so that we alwayssucceed in the game

Machine Learning (CS771A) Introduction to Learning Theory 12

Page 64: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension

VC dimension of linear classifiers in R2 = 3?

VC dimension of linear classifiers in R3 = 4?

What about the VC dimension of linear classifiers in RD?

VC = D + 1

Recall: a linear classifier in RD is defined by D parameters

For linear classifiers, high D ⇒ high VC dimension ⇒ high complexity

What about the VC dimension of 1-nearest neighbors?Infinite. Why?

What about the VC dimension of SVM with RBF kernel?Infinite. Why?

VC dimension intuition: How many points the hypothesis class can “memorize”

Machine Learning (CS771A) Introduction to Learning Theory 13

Page 65: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension

VC dimension of linear classifiers in R2 = 3?

VC dimension of linear classifiers in R3 = 4?

What about the VC dimension of linear classifiers in RD?

VC = D + 1

Recall: a linear classifier in RD is defined by D parameters

For linear classifiers, high D ⇒ high VC dimension ⇒ high complexity

What about the VC dimension of 1-nearest neighbors?Infinite. Why?

What about the VC dimension of SVM with RBF kernel?Infinite. Why?

VC dimension intuition: How many points the hypothesis class can “memorize”

Machine Learning (CS771A) Introduction to Learning Theory 13

Page 66: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension

VC dimension of linear classifiers in R2 = 3?

VC dimension of linear classifiers in R3 = 4?

What about the VC dimension of linear classifiers in RD?

VC = D + 1

Recall: a linear classifier in RD is defined by D parameters

For linear classifiers, high D ⇒ high VC dimension ⇒ high complexity

What about the VC dimension of 1-nearest neighbors?Infinite. Why?

What about the VC dimension of SVM with RBF kernel?Infinite. Why?

VC dimension intuition: How many points the hypothesis class can “memorize”

Machine Learning (CS771A) Introduction to Learning Theory 13

Page 67: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension

VC dimension of linear classifiers in R2 = 3?

VC dimension of linear classifiers in R3 = 4?

What about the VC dimension of linear classifiers in RD?

VC = D + 1

Recall: a linear classifier in RD is defined by D parameters

For linear classifiers, high D ⇒ high VC dimension ⇒ high complexity

What about the VC dimension of 1-nearest neighbors?Infinite. Why?

What about the VC dimension of SVM with RBF kernel?Infinite. Why?

VC dimension intuition: How many points the hypothesis class can “memorize”

Machine Learning (CS771A) Introduction to Learning Theory 13

Page 68: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension

VC dimension of linear classifiers in R2 = 3?

VC dimension of linear classifiers in R3 = 4?

What about the VC dimension of linear classifiers in RD?

VC = D + 1

Recall: a linear classifier in RD is defined by D parameters

For linear classifiers, high D ⇒ high VC dimension ⇒ high complexity

What about the VC dimension of 1-nearest neighbors?Infinite. Why?

What about the VC dimension of SVM with RBF kernel?Infinite. Why?

VC dimension intuition: How many points the hypothesis class can “memorize”

Machine Learning (CS771A) Introduction to Learning Theory 13

Page 69: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension

VC dimension of linear classifiers in R2 = 3?

VC dimension of linear classifiers in R3 = 4?

What about the VC dimension of linear classifiers in RD?

VC = D + 1

Recall: a linear classifier in RD is defined by D parameters

For linear classifiers, high D ⇒ high VC dimension ⇒ high complexity

What about the VC dimension of 1-nearest neighbors?Infinite. Why?

What about the VC dimension of SVM with RBF kernel?Infinite. Why?

VC dimension intuition: How many points the hypothesis class can “memorize”

Machine Learning (CS771A) Introduction to Learning Theory 13

Page 70: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension

VC dimension of linear classifiers in R2 = 3?

VC dimension of linear classifiers in R3 = 4?

What about the VC dimension of linear classifiers in RD?

VC = D + 1

Recall: a linear classifier in RD is defined by D parameters

For linear classifiers, high D ⇒ high VC dimension ⇒ high complexity

What about the VC dimension of 1-nearest neighbors?

Infinite. Why?

What about the VC dimension of SVM with RBF kernel?Infinite. Why?

VC dimension intuition: How many points the hypothesis class can “memorize”

Machine Learning (CS771A) Introduction to Learning Theory 13

Page 71: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension

VC dimension of linear classifiers in R2 = 3?

VC dimension of linear classifiers in R3 = 4?

What about the VC dimension of linear classifiers in RD?

VC = D + 1

Recall: a linear classifier in RD is defined by D parameters

For linear classifiers, high D ⇒ high VC dimension ⇒ high complexity

What about the VC dimension of 1-nearest neighbors?Infinite. Why?

What about the VC dimension of SVM with RBF kernel?Infinite. Why?

VC dimension intuition: How many points the hypothesis class can “memorize”

Machine Learning (CS771A) Introduction to Learning Theory 13

Page 72: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension

VC dimension of linear classifiers in R2 = 3?

VC dimension of linear classifiers in R3 = 4?

What about the VC dimension of linear classifiers in RD?

VC = D + 1

Recall: a linear classifier in RD is defined by D parameters

For linear classifiers, high D ⇒ high VC dimension ⇒ high complexity

What about the VC dimension of 1-nearest neighbors?Infinite. Why?

What about the VC dimension of SVM with RBF kernel?

Infinite. Why?

VC dimension intuition: How many points the hypothesis class can “memorize”

Machine Learning (CS771A) Introduction to Learning Theory 13

Page 73: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension

VC dimension of linear classifiers in R2 = 3?

VC dimension of linear classifiers in R3 = 4?

What about the VC dimension of linear classifiers in RD?

VC = D + 1

Recall: a linear classifier in RD is defined by D parameters

For linear classifiers, high D ⇒ high VC dimension ⇒ high complexity

What about the VC dimension of 1-nearest neighbors?Infinite. Why?

What about the VC dimension of SVM with RBF kernel?Infinite. Why?

VC dimension intuition: How many points the hypothesis class can “memorize”

Machine Learning (CS771A) Introduction to Learning Theory 13

Page 74: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension

VC dimension of linear classifiers in R2 = 3?

VC dimension of linear classifiers in R3 = 4?

What about the VC dimension of linear classifiers in RD?

VC = D + 1

Recall: a linear classifier in RD is defined by D parameters

For linear classifiers, high D ⇒ high VC dimension ⇒ high complexity

What about the VC dimension of 1-nearest neighbors?Infinite. Why?

What about the VC dimension of SVM with RBF kernel?Infinite. Why?

VC dimension intuition: How many points the hypothesis class can “memorize”

Machine Learning (CS771A) Introduction to Learning Theory 13

Page 75: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Using VC Dimension in Generalization Bounds

Recall the PAC based Generalization Bound

ExpectedLoss(h) ≤ TrainingLoss(h) +

√log |H|+ log 1

δ

2N

For hypothesis classes with infinite size (|H| =∞), but VC dimension d :

ExpectedLoss(h) ≤ TrainingLoss(h) +

√d(log 2N

d + 1) + log 4δ

2N

For linear classifiers, what does it imply?

Having fewer features is better (since it means smaller VC dimension)

Machine Learning (CS771A) Introduction to Learning Theory 14

Page 76: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Using VC Dimension in Generalization Bounds

Recall the PAC based Generalization Bound

ExpectedLoss(h) ≤ TrainingLoss(h) +

√log |H|+ log 1

δ

2N

For hypothesis classes with infinite size (|H| =∞), but VC dimension d :

ExpectedLoss(h) ≤ TrainingLoss(h) +

√d(log 2N

d + 1) + log 4δ

2N

For linear classifiers, what does it imply?

Having fewer features is better (since it means smaller VC dimension)

Machine Learning (CS771A) Introduction to Learning Theory 14

Page 77: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Using VC Dimension in Generalization Bounds

Recall the PAC based Generalization Bound

ExpectedLoss(h) ≤ TrainingLoss(h) +

√log |H|+ log 1

δ

2N

For hypothesis classes with infinite size (|H| =∞), but VC dimension d :

ExpectedLoss(h) ≤ TrainingLoss(h) +

√d(log 2N

d + 1) + log 4δ

2N

For linear classifiers, what does it imply?

Having fewer features is better (since it means smaller VC dimension)

Machine Learning (CS771A) Introduction to Learning Theory 14

Page 78: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Using VC Dimension in Generalization Bounds

Recall the PAC based Generalization Bound

ExpectedLoss(h) ≤ TrainingLoss(h) +

√log |H|+ log 1

δ

2N

For hypothesis classes with infinite size (|H| =∞), but VC dimension d :

ExpectedLoss(h) ≤ TrainingLoss(h) +

√d(log 2N

d + 1) + log 4δ

2N

For linear classifiers, what does it imply?

Having fewer features is better (since it means smaller VC dimension)

Machine Learning (CS771A) Introduction to Learning Theory 14

Page 79: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension of Support Vector Machines

Recall: VC dimension of an SVM with RBF kernel is infinite. Is it a bad thing?

Not really. SVM’s large margin property ensures good generalization

Theorem (Vapnik, 1982):• Given N data points in RD : X = {x1, . . . , xN} with ||xn|| ≤ R• Define Hγ : set of classifiers in RD having margin γ on XThe VC dimension of Hγ is bounded by:

VC(Hγ) ≤ min

{D,

⌈4R2

γ2

⌉}

Generalization bound for the SVM:

ExpectedLoss(h) ≤ TrainingLoss(h) +

√VC(Hγ)(log 2N

VC(Hγ ) + 1) + log 4δ

2N

Large γ ⇒ small VC dim. ⇒ small complexity of Hγ ⇒ good generalization

Machine Learning (CS771A) Introduction to Learning Theory 15

Page 80: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension of Support Vector Machines

Recall: VC dimension of an SVM with RBF kernel is infinite. Is it a bad thing?

Not really. SVM’s large margin property ensures good generalization

Theorem (Vapnik, 1982):• Given N data points in RD : X = {x1, . . . , xN} with ||xn|| ≤ R• Define Hγ : set of classifiers in RD having margin γ on X

The VC dimension of Hγ is bounded by:

VC(Hγ) ≤ min

{D,

⌈4R2

γ2

⌉}

Generalization bound for the SVM:

ExpectedLoss(h) ≤ TrainingLoss(h) +

√VC(Hγ)(log 2N

VC(Hγ ) + 1) + log 4δ

2N

Large γ ⇒ small VC dim. ⇒ small complexity of Hγ ⇒ good generalization

Machine Learning (CS771A) Introduction to Learning Theory 15

Page 81: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension of Support Vector Machines

Recall: VC dimension of an SVM with RBF kernel is infinite. Is it a bad thing?

Not really. SVM’s large margin property ensures good generalization

Theorem (Vapnik, 1982):• Given N data points in RD : X = {x1, . . . , xN} with ||xn|| ≤ R• Define Hγ : set of classifiers in RD having margin γ on XThe VC dimension of Hγ is bounded by:

VC(Hγ) ≤ min

{D,

⌈4R2

γ2

⌉}

Generalization bound for the SVM:

ExpectedLoss(h) ≤ TrainingLoss(h) +

√VC(Hγ)(log 2N

VC(Hγ ) + 1) + log 4δ

2N

Large γ ⇒ small VC dim. ⇒ small complexity of Hγ ⇒ good generalization

Machine Learning (CS771A) Introduction to Learning Theory 15

Page 82: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension of Support Vector Machines

Recall: VC dimension of an SVM with RBF kernel is infinite. Is it a bad thing?

Not really. SVM’s large margin property ensures good generalization

Theorem (Vapnik, 1982):• Given N data points in RD : X = {x1, . . . , xN} with ||xn|| ≤ R• Define Hγ : set of classifiers in RD having margin γ on XThe VC dimension of Hγ is bounded by:

VC(Hγ) ≤ min

{D,

⌈4R2

γ2

⌉}

Generalization bound for the SVM:

ExpectedLoss(h) ≤ TrainingLoss(h) +

√VC(Hγ)(log 2N

VC(Hγ ) + 1) + log 4δ

2N

Large γ ⇒ small VC dim. ⇒ small complexity of Hγ ⇒ good generalization

Machine Learning (CS771A) Introduction to Learning Theory 15

Page 83: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

VC Dimension of Support Vector Machines

Recall: VC dimension of an SVM with RBF kernel is infinite. Is it a bad thing?

Not really. SVM’s large margin property ensures good generalization

Theorem (Vapnik, 1982):• Given N data points in RD : X = {x1, . . . , xN} with ||xn|| ≤ R• Define Hγ : set of classifiers in RD having margin γ on XThe VC dimension of Hγ is bounded by:

VC(Hγ) ≤ min

{D,

⌈4R2

γ2

⌉}

Generalization bound for the SVM:

ExpectedLoss(h) ≤ TrainingLoss(h) +

√VC(Hγ)(log 2N

VC(Hγ ) + 1) + log 4δ

2N

Large γ ⇒ small VC dim. ⇒ small complexity of Hγ ⇒ good generalization

Machine Learning (CS771A) Introduction to Learning Theory 15

Page 84: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Things to Remember..

We care about the expected error, not the training error

Generalization bounds quantify the difference between these two errors

It has the following general form

ExpLoss(h) ≤ TrainLoss(h) + f (model complexity, N)︸ ︷︷ ︸approaches 0 as N → ∞

Finite sized hypothesis spaces: log |H| is a measure of complexity

Finite sized hypothesis spaces: VC dimension is a measure of complexity

Often these bounds are loose for moderate values of N

Tighter generalization bounds exist (often data-dependent; e.g., using complexity measures such asRadamacher Complexity)

But even loose bounds are often useful for understanding the basic properties of learningmodels/algorithms

Machine Learning (CS771A) Introduction to Learning Theory 16

Page 85: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Things to Remember..

We care about the expected error, not the training error

Generalization bounds quantify the difference between these two errors

It has the following general form

ExpLoss(h) ≤ TrainLoss(h) + f (model complexity, N)︸ ︷︷ ︸approaches 0 as N → ∞

Finite sized hypothesis spaces: log |H| is a measure of complexity

Finite sized hypothesis spaces: VC dimension is a measure of complexity

Often these bounds are loose for moderate values of N

Tighter generalization bounds exist (often data-dependent; e.g., using complexity measures such asRadamacher Complexity)

But even loose bounds are often useful for understanding the basic properties of learningmodels/algorithms

Machine Learning (CS771A) Introduction to Learning Theory 16

Page 86: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Things to Remember..

We care about the expected error, not the training error

Generalization bounds quantify the difference between these two errors

It has the following general form

ExpLoss(h) ≤ TrainLoss(h) + f (model complexity, N)︸ ︷︷ ︸approaches 0 as N → ∞

Finite sized hypothesis spaces: log |H| is a measure of complexity

Finite sized hypothesis spaces: VC dimension is a measure of complexity

Often these bounds are loose for moderate values of N

Tighter generalization bounds exist (often data-dependent; e.g., using complexity measures such asRadamacher Complexity)

But even loose bounds are often useful for understanding the basic properties of learningmodels/algorithms

Machine Learning (CS771A) Introduction to Learning Theory 16

Page 87: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Things to Remember..

We care about the expected error, not the training error

Generalization bounds quantify the difference between these two errors

It has the following general form

ExpLoss(h) ≤ TrainLoss(h) + f (model complexity, N)︸ ︷︷ ︸approaches 0 as N → ∞

Finite sized hypothesis spaces: log |H| is a measure of complexity

Finite sized hypothesis spaces: VC dimension is a measure of complexity

Often these bounds are loose for moderate values of N

Tighter generalization bounds exist (often data-dependent; e.g., using complexity measures such asRadamacher Complexity)

But even loose bounds are often useful for understanding the basic properties of learningmodels/algorithms

Machine Learning (CS771A) Introduction to Learning Theory 16

Page 88: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Things to Remember..

We care about the expected error, not the training error

Generalization bounds quantify the difference between these two errors

It has the following general form

ExpLoss(h) ≤ TrainLoss(h) + f (model complexity, N)︸ ︷︷ ︸approaches 0 as N → ∞

Finite sized hypothesis spaces: log |H| is a measure of complexity

Finite sized hypothesis spaces: VC dimension is a measure of complexity

Often these bounds are loose for moderate values of N

Tighter generalization bounds exist (often data-dependent; e.g., using complexity measures such asRadamacher Complexity)

But even loose bounds are often useful for understanding the basic properties of learningmodels/algorithms

Machine Learning (CS771A) Introduction to Learning Theory 16

Page 89: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Things to Remember..

We care about the expected error, not the training error

Generalization bounds quantify the difference between these two errors

It has the following general form

ExpLoss(h) ≤ TrainLoss(h) + f (model complexity, N)︸ ︷︷ ︸approaches 0 as N → ∞

Finite sized hypothesis spaces: log |H| is a measure of complexity

Finite sized hypothesis spaces: VC dimension is a measure of complexity

Often these bounds are loose for moderate values of N

Tighter generalization bounds exist (often data-dependent; e.g., using complexity measures such asRadamacher Complexity)

But even loose bounds are often useful for understanding the basic properties of learningmodels/algorithms

Machine Learning (CS771A) Introduction to Learning Theory 16

Page 90: Introduction to Learning Theory - CSE · Introduction to Learning Theory Piyush Rai Machine Learning (CS771A) Oct 24, 2016 ... Hardness of learning problems in general \Theory is

Things to Remember..

We care about the expected error, not the training error

Generalization bounds quantify the difference between these two errors

It has the following general form

ExpLoss(h) ≤ TrainLoss(h) + f (model complexity, N)︸ ︷︷ ︸approaches 0 as N → ∞

Finite sized hypothesis spaces: log |H| is a measure of complexity

Finite sized hypothesis spaces: VC dimension is a measure of complexity

Often these bounds are loose for moderate values of N

Tighter generalization bounds exist (often data-dependent; e.g., using complexity measures such asRadamacher Complexity)

But even loose bounds are often useful for understanding the basic properties of learningmodels/algorithms

Machine Learning (CS771A) Introduction to Learning Theory 16


Recommended