+ All Categories
Home > Documents > Batch online learning Toyota Technological Institute (TTI)transductive [Littlestone89] i.i.d.i.i.d....

Batch online learning Toyota Technological Institute (TTI)transductive [Littlestone89] i.i.d.i.i.d....

Date post: 17-Dec-2015
Category:
Upload: caren-day
View: 217 times
Download: 1 times
Share this document with a friend
Popular Tags:
13
batch online learning Toyota Technological Institute (TTI) transductive transductive [Littlestone89] [Littlestone89] i.i.d. i.i.d. i.i.d. i.i.d. Sham Kakade Adam Kalai
Transcript

batch online learning

Toyota Technological Institute (TTI)

transductivetransductive

[Littlestone89][Littlestone89]

i.i.d.i.i.d. i.i.d.i.i.d.

Sham KakadeAdam Kalai

Family of functions F (e.g. halfspaces)

Batch learning vs.dist. over X £ {––,++}

––

––

++

––

––

––

––

––

––

––

++

++

++

++

––

––––

––

––

––

––

––++

++

++

++++

++

++

++

++

++

++

++

++++

++++

++

++

––

––––

––

++++

++

h

Online learning (x1,y1)…(xn,yn) 2 X £ {––,++}

XX

h1

arbitrary

––x1

XX

Agnostic model [Kearns,Sch-[Kearns,Sch- apire,Sellie94]apire,Sellie94]

Alg. H(x1,y1),…,(xn,yn) h 2 F

Def. H learns F if, 8 E[err(h)]·minf2F err(f)+n-c

and H runs in time poly(n)

dist

ERM = “best on data”

h2

Family of functions F (e.g. halfspaces)

Batch learning vs.dist. over X £ {––,++}

––

––

++

––

––

––

––

––

––

––

++

++

++

++

––

––––

––

––

––

––

––++

++

++

++++

++

++

++

++

++

++

++

++++

++++

++

++

––

––––

––

++++

++

XX

h

Online learning (x1,y1)…(xn,yn) 2 X £ {––,++}

XX

arbitrary

––x1

++x2

ERM = “best on data”

Family of functions F (e.g. halfspaces)

Batch learning vs.dist. over X £ {––,++}

––

––

++

––

––

––

––

––

––

––

++

++

++

++

––

––––

––

––

––

––

––++

++

++

++++

++

++

++

++

++

++

++

++++

++++

++

++

––

––––

––

++++

++

XX

h

Online learning (x1,y1)…(xn,yn) 2 X £ {––,++}

XX

h3

arbitrary

––x1

++x2

++x3

Goal: err(alg) · minf2F err(f) +

ERM = “best on data”

[Ben-David,Kushilevitz,Mansour95][Ben-David,Kushilevitz,Mansour95]

h3 h2

Family of functions F (e.g. halfspaces)

Batch learning vs.dist. over X £ {––,++}

ERM = “best on data”

––

––

++

––

––

––

––

––

––

––

++

++

++

++

––

––––

––

––

––

––

––++

++

++

++++

++

++

++

++

++

++

++

++++

++++

++

++

––

––––

––

++++

++

XX

h

Online learning

Goal: err(alg) · minf2F err(f) +

(x1,y1)…(xn,yn) 2 X £ {––,++}

XX

arbitrary

TransductiveTransductive

..

..

..

..

..

..

..

..

..

..

..

..

..

..

..

....

..

..

..

..

....

..

..

....

..

..

..

..

..

..

..

....

....

..

..

..

....

..

....

..equivalent

“proper” learningoutput h(i)2F

h1

––x1

++x2++x3

Analogous definition:

Alg. H(x1,y1),…,(xi-1,yi-1)hi 2 F

H learns F if, 8(x1,y1),…,(xn,yn):

E[err(H)]·minf2F err(f)+n-c

and H runs in time poly(n)

{x1,x2,…,xn}

Our resultsTheorem 1. In online trans. setting,

HHERM requires one ERM computation per sample.

Theorem 2. These are equivalent for proper learning:

, F is agnostically learnable, ERM agnostically learns F

(ERM can be done efficiently and VC(F) is finite)

, F is online transductively learnable , HHERM online transductively learns F

HHERM = Hallucination + ERM

HH 44

Online ERM algorithm

x1 = (0,0) y1 = ––

x2 = (0,0) y2 = ++

x3 = (0,0) y3 = ––

x4 = (0,0) y4 = ++…

Choose hi 2 F with minimal errors on (x1,y1),…,(xi-1,yi-1)

hi = argminf2F |{ j<i| f(xj)yj}|

h1 (x) = ++

h2(x) = ––

h3(x) = ++

h4(x) = ––…

F = {––,++} X = { (0,0) }

(sucks)

Online ERM algorithm

err(ERM) err(ERM) ·· min minff2F2F err(f) + Perr(f) + Pii22{1,…,n}{1,…,n}[h[hiihhi+1i+1]]

Choose hi 2 F with minimal errors on (x1,y1),…,(xi-1,yi-1)

hi = argminf2F |{ j<i| f(xj)yj}|

Online “stability” lemma: [KVempala01]

Proof by induction on n = #examples easy!

random from {1,2,…,R}random from {1,2,…,R}

Online HHERM algorithmInputs: ={x1,x2,…,xn}, int R

For each x2hallucinate rx copies of (x,++) & rx copies of (x,––)

Choose hi 2 F that minimizes errors onhallucinated data + (x1,y1),…,(xi-1,yi-1)

++

--

Prxi,rxi

[hi hi+1] · R-1++ -- Stability: 8i,

, (x(xii,,++))

(xi,++),(xi,++),…,(xi,++)

rxi++

James HannanJames Hannan

random from {1,2,…,R}random from {1,2,…,R}

Online HHERM algorithmInputs: ={x1,x2,…,xn}, int R

For each x2hallucinate rx copies of (x,++) & rx copies of (x,––)

Choose hi 2 F that minimizes errors onhallucinated data + (x1,y1),…,(xi-1,yi-1)

++

--

Prxi,rxi

[hi hi+1] · R-1++ -- Stability: 8i,

Online “stability” lemma

Hallucination costHallucination cost

Theorem 1For R=n¼:

It requires one ERM computation per example.

HH 44

Being more adaptive (shifting bounds)

(x1,y1),…,(x(xii,y,yii),…(x),…(xi+Wi+W,y,yi+Wi+W)),…(xn,yn)

windowwindow

44

Related work• Inequivalence of batch and online

learning in noiseless setting– ERM black box is noiseless– For computational reasons!

• Inefficient alg. for online trans. learning:

– List all · (n+1)VC(F) labelings (Sauer’s lemma)– Run weighted majority

[Ben-David,Kushilevitz,Mansour95][Ben-David,Kushilevitz,Mansour95]

[Blum90,Balcan06][Blum90,Balcan06]

[Littlestone,Warmuth92][Littlestone,Warmuth92]

• Alg. for removing iid assumption, efficiently, using unlabeled data

• Interesting way to use unlabeled data online, reminiscent of bootstrap/bagging

• Adaptive version: can do well on every window

• Find “right” algorithm/analysisFind “right” algorithm/analysis

Conclusions


Recommended