+ All Categories
Home > Documents > Machine Learning Week 2 Lecture 2. Hand In It is online. Web board forum for Matlab questions...

Machine Learning Week 2 Lecture 2. Hand In It is online. Web board forum for Matlab questions...

Date post: 19-Dec-2015
Category:
Upload: barnard-horton
View: 213 times
Download: 1 times
Share this document with a friend
Popular Tags:
40
Machine Learning Week 2 Lecture 2
Transcript

Machine Learning

Week 2Lecture 2

Hand In

• It is online.• Web board forum for Matlab questions• Comments and corrections very welcome. I

will upload new versions as we go along. Currently we are at version 3

• Your data is coming. We might change it over time.

Quiz

• Go through all Questions

Recap

Impossibility of Learning!x1 x2 x3 f(x

)0 0 0 11 0 0 00 1 0 11 1 0 10 0 1 01 0 1 ?0 1 1 ?1 1 1 ?

What is f?

There are 256 potential functions 8 of them has in sample error 0

Assumptions are needed

No Free Lunch"All models are wrong, but some models are useful.” George Box

Machine Learning has many different models and algorithms

Assumptions that works well in one domain may fail in another

There is no single best model that works best for all problems (No Free Lunch Theorem)

Probabilistic ApproachRepeat N times independently

Sample mean: ν #heads/N

Sample:h,h,h,t,t,h,t,t,h

μ is unknown

Hoeffdings Inequality

Sample mean is probably approximately correct PAC

Classification ConnectionTesting a Hypothesis

Fixed Hypothesis Unknown Target

is probability of picking x such that f(x) ≠ h(x)is probability of picking x such that f(x) = h(x)

μ is the sum of the probability of all the points X where hypothesis is wrong

Probability Distribution over x

Sample Mean - true error rate μ

Learning?

• Only Verification not Learning

• For finite hypothesis sets we used union bound

• Make sure is close to and minimize

Error Functions

h(x)/f(x) Lying True

Est. Lying 0

Est. True 0

Walmart. Discount for a given person Error Function

h(x)/f(x) Lying True

Est. Lying 0

Est. True 0

CIA Access (Friday bar stock)Error Function

10001

1000 1

Point being. Depends on application

Final Diagram

Unknown Target Unknown Probability Distribution P(x)

Learn Importance

P(y | x)

Data Set

Learning Algorithm

Hypothesis Set

Final Hypothesis

Error Measure e

Today

• We are still only talking classification• Test Sets

• Work towards learning with infinite size hypothesis spaces for classification– Reinvestigate Union Bound– Dichotomies– Break Points

The Test Set

Fixed hypothesis h, N independent data points, and any ε>0

• Split your data into two parts D-train,D-test• Train on D-train and select hypothesis h• Test h on D-test, error • Apply Hoeffding bound to

Test Set

• Strong Bound: 1000 points then with 98% probability, in sample error will be within 5% of out of sample error

• Unbiased– Just as likely to better than worse

• Problem lose data for training• If Error is high it is not a help that it will also be

high in practice• Can NOT be used to select h (contamination)

Learning

Pick a tolerance (risk) δ of failing you can accept

Set RHS equal to δ and solve for ε =

With Probability 1-δ

Generalization Bound

Why we minimize in sample error.

Union Bound

Union Bound Learning

Learning algorithm pick hypothesis hl

P(hl is bad) is less than the probability that some hypothesis is bad

We did not subtract overlapping events!!!

Hypotheses seem correlatedh1

h2

if h1 is bad (poor generalization) then probably so is h2

Hope to improve union bound result

Change

Goal

• Replace M with something like effective number of hypotheses

• General bound. E.g. independent, target function and input distribution

• Simple would be nice.

Look at finite point sets

Dichotomy bit string of length N

Fixed set of N points X = (x1,..,xN) Hypothesis set

Each gives a dichotomy

Capturing the “expressiveness” of the hypothesis set on X

How Many Different Dichotomies do we get? At Most

Growth FunctionFixed set of N points X = (x1,..,xN) Hypothesis set

Example 1: Positive Rays1-Dimensional input space (points on the real line)

a

Only Change When a moves to different interval

Example 2: Intervals1-Dimensional input space (points on the real line)

a1 a2

a1,a2 in separate parts + Put in same

Example 3: Convex Sets2-Dimensional input space (points in the plane)

Goal Continued

Imagine we can replace M with growth function

RHS is dropping exponentially fast in N

If Growth function is a polynomial in N then RHS still drops exponentially in N

Generalization Bound

Bright Idea. Prove Growth function is polynomial in N

Prove we can replace M with growth function

Bounding Growth Function

• Might be hard to compute

• Instead of computing the exact value

• Prove that it is bounded by a polynomial

Shattering and Break Point

If

then we say that shatters (x1,…,xN)

If no data set of size K can be shattered by then K is a break point for

If K is a break point for then so is all numbers larger than K? Why?

Revisit Examples

• Positive Rays

• Intervals

• Convex sets

a

a1 a2

2D Linear Classification (Hyperplanes)

2D Linear Classification3 Points on a line

For 2D Linear ClassificationHypothesis set4 is a break point

Break Points and Growth Function

If has a break point then the growth function is polynomial (needs proof)

If not then it is not! By definition of break point:

Break Point Game

Has Break Point 2

x1 x2

0 00 11 01 1

x1 x2 x3

0 0 10 0 0

0 1 00 1 1

Impossible for

1 0 01 0 11 1 01 1 1

Row 1,2,3,4

Row 6,5,2,1

Row 7,5,3,2

Row 8,5,3,2

Proof Coming

If has a break point then the growth function is polynomial

Definition:B(n,k) is the maximal number of dichotomiespossible on N points such that no subset of k points can be shattered by the dichotomies.

If no data set of size K can be shattered by then K is a break point for

More general than hypothesis sets

for any with break point k

Computing B(n,k) – Boundary CasesCannot shatter set of size 1. There is no way of picking dichotomies that gives different classes for a point. There is only one dichotomy since a different dichotomy would give different class for at least one point

There is only one point, this only 2 dichotomies are possible

Compute B(N,k)- RecursionN,k >1 List L with all dichotomies in B(n,k)

Recursion

Consider the first n_1 points, there are α+β different (S2 sets are identical here)They can still at most shatter k points, e.g. B(N-1,k) is an upper bound

Consider the first n_1 points in S2. If they can shatter k-1 points we can extend with last point where we have both combinations for all dichotomies. This givesk points we can shatter a contradiction.

Proof Coming

Base Cases:

Induction Step

Show for N0+1 for k>1 (k=1 was base case)

change parameter

should be 0

Continue

Make it into one sum

Recurrence for binomials

Add in zero index again

QED


Recommended