batch online learning
Toyota Technological Institute (TTI)
transductivetransductive
[Littlestone89][Littlestone89]
i.i.d.i.i.d. i.i.d.i.i.d.
Sham KakadeAdam Kalai
Family of functions F (e.g. halfspaces)
Batch learning vs.dist. over X £ {––,++}
––
––
++
––
––
––
––
––
––
––
++
++
++
++
––
––––
––
––
––
––
––++
++
++
++++
++
++
++
++
++
++
++
++++
++++
++
++
––
––––
––
++++
++
h
Online learning (x1,y1)…(xn,yn) 2 X £ {––,++}
XX
h1
arbitrary
––x1
XX
Agnostic model [Kearns,Sch-[Kearns,Sch- apire,Sellie94]apire,Sellie94]
Alg. H(x1,y1),…,(xn,yn) h 2 F
Def. H learns F if, 8 E[err(h)]·minf2F err(f)+n-c
and H runs in time poly(n)
dist
ERM = “best on data”
h2
Family of functions F (e.g. halfspaces)
Batch learning vs.dist. over X £ {––,++}
––
––
++
––
––
––
––
––
––
––
++
++
++
++
––
––––
––
––
––
––
––++
++
++
++++
++
++
++
++
++
++
++
++++
++++
++
++
––
––––
––
++++
++
XX
h
Online learning (x1,y1)…(xn,yn) 2 X £ {––,++}
XX
arbitrary
––x1
++x2
ERM = “best on data”
Family of functions F (e.g. halfspaces)
Batch learning vs.dist. over X £ {––,++}
––
––
++
––
––
––
––
––
––
––
++
++
++
++
––
––––
––
––
––
––
––++
++
++
++++
++
++
++
++
++
++
++
++++
++++
++
++
––
––––
––
++++
++
XX
h
Online learning (x1,y1)…(xn,yn) 2 X £ {––,++}
XX
h3
arbitrary
––x1
++x2
++x3
Goal: err(alg) · minf2F err(f) +
ERM = “best on data”
[Ben-David,Kushilevitz,Mansour95][Ben-David,Kushilevitz,Mansour95]
h3 h2
Family of functions F (e.g. halfspaces)
Batch learning vs.dist. over X £ {––,++}
ERM = “best on data”
––
––
++
––
––
––
––
––
––
––
++
++
++
++
––
––––
––
––
––
––
––++
++
++
++++
++
++
++
++
++
++
++
++++
++++
++
++
––
––––
––
++++
++
XX
h
Online learning
Goal: err(alg) · minf2F err(f) +
(x1,y1)…(xn,yn) 2 X £ {––,++}
XX
arbitrary
TransductiveTransductive
..
..
..
..
..
..
..
..
..
..
..
..
..
..
..
....
..
..
..
..
....
..
..
....
..
..
..
..
..
..
..
....
....
..
..
..
....
..
....
..equivalent
“proper” learningoutput h(i)2F
h1
––x1
++x2++x3
Analogous definition:
Alg. H(x1,y1),…,(xi-1,yi-1)hi 2 F
H learns F if, 8(x1,y1),…,(xn,yn):
E[err(H)]·minf2F err(f)+n-c
and H runs in time poly(n)
{x1,x2,…,xn}
Our resultsTheorem 1. In online trans. setting,
HHERM requires one ERM computation per sample.
Theorem 2. These are equivalent for proper learning:
, F is agnostically learnable, ERM agnostically learns F
(ERM can be done efficiently and VC(F) is finite)
, F is online transductively learnable , HHERM online transductively learns F
HHERM = Hallucination + ERM
HH 44
Online ERM algorithm
x1 = (0,0) y1 = ––
x2 = (0,0) y2 = ++
x3 = (0,0) y3 = ––
x4 = (0,0) y4 = ++…
Choose hi 2 F with minimal errors on (x1,y1),…,(xi-1,yi-1)
hi = argminf2F |{ j<i| f(xj)yj}|
h1 (x) = ++
h2(x) = ––
h3(x) = ++
h4(x) = ––…
F = {––,++} X = { (0,0) }
(sucks)
Online ERM algorithm
err(ERM) err(ERM) ·· min minff2F2F err(f) + Perr(f) + Pii22{1,…,n}{1,…,n}[h[hiihhi+1i+1]]
Choose hi 2 F with minimal errors on (x1,y1),…,(xi-1,yi-1)
hi = argminf2F |{ j<i| f(xj)yj}|
Online “stability” lemma: [KVempala01]
Proof by induction on n = #examples easy!
random from {1,2,…,R}random from {1,2,…,R}
Online HHERM algorithmInputs: ={x1,x2,…,xn}, int R
For each x2hallucinate rx copies of (x,++) & rx copies of (x,––)
Choose hi 2 F that minimizes errors onhallucinated data + (x1,y1),…,(xi-1,yi-1)
++
--
Prxi,rxi
[hi hi+1] · R-1++ -- Stability: 8i,
, (x(xii,,++))
(xi,++),(xi,++),…,(xi,++)
rxi++
…
James HannanJames Hannan
random from {1,2,…,R}random from {1,2,…,R}
Online HHERM algorithmInputs: ={x1,x2,…,xn}, int R
For each x2hallucinate rx copies of (x,++) & rx copies of (x,––)
Choose hi 2 F that minimizes errors onhallucinated data + (x1,y1),…,(xi-1,yi-1)
++
--
Prxi,rxi
[hi hi+1] · R-1++ -- Stability: 8i,
Online “stability” lemma
Hallucination costHallucination cost
Theorem 1For R=n¼:
It requires one ERM computation per example.
HH 44
Being more adaptive (shifting bounds)
(x1,y1),…,(x(xii,y,yii),…(x),…(xi+Wi+W,y,yi+Wi+W)),…(xn,yn)
windowwindow
44
Related work• Inequivalence of batch and online
learning in noiseless setting– ERM black box is noiseless– For computational reasons!
• Inefficient alg. for online trans. learning:
– List all · (n+1)VC(F) labelings (Sauer’s lemma)– Run weighted majority
[Ben-David,Kushilevitz,Mansour95][Ben-David,Kushilevitz,Mansour95]
[Blum90,Balcan06][Blum90,Balcan06]
[Littlestone,Warmuth92][Littlestone,Warmuth92]