Administrivia,Introduc1ontoOnlineLearning
CS159:AdvancedTopicsinMachineLearning
1
3/29/2016
ClassDetails
• Instructor:YisongYue• TAs:
• CourseWebsite:hBp://www.yisongyue.com/courses/cs159/
2
HoangLe StephanZheng
StyleofCourse
• Graduatelevelcourse
• Givestudentsanoverviewoftopics
• Digdeepintoonetopicforfinalproject
• Assumestudentsaremathema1callymature– Goalistounderstandbasicconcepts– Understandspecificmathema1caldetailsdependingonyourinterest
3
GradingBreakdown
• Par1cipa1on(20%)
• Mini-quizzes(10%)
• FinalProject(70%)
4
PaperReading&Discussion
• PaperReadingCourse– Readingassignmentsforeachlecture– Lecturesmorelikediscussion
• StudentPresenta1ons– Presenta1onschedulesignupsoon– Presentingroups– Canchoosewhichpaper(s)topresent
5
Mini-quizzes
• Eveningabereverylecture– Veryshort– Easyifyoureadmaterial&aBendedlecture
• ReleasedviaPiazza– AlsousePiazzaforQ&A
6
FinalProject
• Canbeonanytopicrelatedtothecourse
• Workingroups
• Willrelease1melineofprogressreportssoon
• Peerreview(?)
7
Topics
• OnlineLearning• Mul1-armedBandits• Ac1veLearning• Crowdsourcing• ReinforcementLearning• ModelsofHumanDecisionmaking
8
FocusofCourse
• Rigorousalgorithmdesign– Mathintensive,butnothingtoohard– Willwalkthroughrelevantmathinclass
• Applytointeres1ngapplica1ons– Whataretherightwaystomodelaproblem?
9
WhatDoesRigorousMean?
• Formalmodel– Explicitlystateyourassump1ons
• Rigorouslyreasonabouthowyouralgorithmsolvesthemodel– Some1meswithprovableguarantees
• Arguethatyourmodelisareasonableone
10
WhatMakesaGoodFinalProject?
• PureTheory– Studyprooftechniques,trytoextendproof,orapplytonewsejng
• Algorithms– Extendalgorithms,designnewones,fornewsejngs
• Modeling– Modelnewsejng,whataretherightassump1ons?
11
Outline
• First3-5lectures– Reviewbasicalgorithms– Somewhatdry,butnecessary
• Topics/readingschosenbystudents– Withcura1ngfromInstructor&Tas– Listofpapersalreadyonwebsite
• Butisnego1able
12
RestofToday
• Introduc1ontoOnlineLearning– FollowtheLeader– Perceptron
• BriefOverviewofOtherTopicsinCourse
13
Introduc1ontoOnlineLearning
14
(MostBasic)OnlineLearning
• Fort=1….T(some1mesTisunknown)
– Algorithmchoosespt– Worldrevealslossfunc1onLt– AlgorithmsufferslossLt(pt)
• Goal:minimizetotalloss
15
Lt (pt )t=1
T
∑
WhatarethesemanDcsofpt?
Whatistheloss?
Howisthelosschosen?
Recall:SupervisedLearning
• Op1mizeviaStochas1cGradientDescent– Maintainawt
– Eachitera1onreceive:
– AssumesampledrandomlyfromS
– Choosewt+1basedonwtandLt
16
argminw
L yi, f (xi |w)( )i=1
N
∑ S = (xi, yi ){ }i=1N
Lt (wt ) = L yi, f (xi |wt )( )
(MostBasic)OnlineLearning
• Fort=1….T(some1mesTisunknown)
– Algorithmchoosespt– Worldrevealslossfunc1onLt– AlgorithmsufferslossLt(pt)
• Goal:minimizetotalloss
17
Lt (pt )t=1
T
∑
pt=wt
Lt(wt)=L(yt,f(xt|wt))
Ltchosenrandomly
Whatif…
• Wereceiveaconstantstreamofdata?– Don’tknowTapriori
• Wereceivedatainsomearbitraryway?– Notsampledindependentlyfromsomedistribu1on
• CanwesDll(provably)achievegoodperformance?
18
Quan1fyingPerformance
• Insupervisedlearningwecareabout:
• Inonlinelearning,wecareabout:
19
L yi, f (xi |w)( )i=1
N
∑ = Li (w)i=1
N
∑
L yt, f (xt |wt )( )t=1
T
∑ = Lt (wt )t=1
T
∑
asinglew
asequenceofwt
Quan1fyingPerformance
• Competeagainstsinglebestwinhindsight:
20
Lt (w*)
t=1
T
∑ =minw
Lt (w)t=1
T
∑
R(T ) = Lt (wt )t=1
T
∑ − Lt (w*)
t=1
T
∑ “Regret”
InterpretaDon:bestpossiblelossw.r.t.supervisedlearning
Interpre1ngRegret
• ExpectedTrainingErroris:
• Wantexpectedtrainingerrorto(quickly)convergetoop1mal– Equivalenttoaverageregret(quickly)convergingto0:
• SaDsfiedwhenregretgrowssublinearlyw.r.t.T!
21
1TR(T ) = 1
TLt (wt )
t=1
T
∑ − Lt (w*)
t=1
T
∑#
$%
&
'(→ 0
1T
Lt (wt )t=1
T
∑
SummaryofRegret
• Genericwaytoquan1fyperformance– CharacterizesspeedofconvergenceforSGD
• Appliestomanyonlinelearningsejngs
• We’llseeotherwaystoquan1fyperformancelaterincourse
22
FollowtheLeader
23
BasicOnlineConvexOp1miza1on
• Fort=1….T(Tunknown)– AlgorithmchoosesptinRD– Worldrevealslossfunc1onLt(pt)=|yt-pt|2
– AlgorithmsufferslossLt(pt)
• Goal:minimizetotalloss
24
Lt (pt )t=1
T
∑
SquaredDistancetoytIngeneral,convexloss
FollowtheLeaderAlgorithm
• The“leader”isthebestpointgivenwhatweknowsofar:
25
pt = argminp
Lt ' p( )t '=1
t−1
∑ = argminp
yt ' − p2
t '=1
t−1
∑ =1t −1
yt 't '=1
t−1
∑
ThisistheenDrealgorithm!
BenefitsandDrawbacks
• Benefits:– Efficientregretbounds(willseenextslide)– Conceptuallyverysimple
• Canbeappliedtomanysejngs
• Drawbacks:– Canbecomputa1onallyveryexpensive
• Forarbitrarylossfunc1ons– (can’tuseaverageallthe1me)
26
Defini1ons
• Besthindsightchoiceoffirstt1mesteps:
• FollowtheLeaderplays:
27
pt* = argmin
pLt ' p( )
t '=1
t
∑ = argminp
yt ' − p2
t '=1
t
∑ =1t
yt 't '=1
t
∑
pt = pt−1*
pt = argminp
Lt ' p( )t '=1
t−1
∑ = argminp
yt ' − p2
t '=1
t−1
∑ =1t −1
yt 't '=1
t−1
∑
Goal
• MinimizeRegret:
28
R(T ) = Lt (pt )t=1
T
∑ − Lt (pT* )
t=1
T
∑
pT* = argmin
pLt p( )
t=1
T
∑ = argminp
yt − p2
t=1
T
∑ =1T
ytt=1
T
∑
Lemma1
• InterpretaDon:– themovingbesthindsightisatleastasgoodasthefinalbesthindsight
• ProofbyInduc1on– Basecase(T=1):
29
L1(p1*) = L1(p1
*)
Lt (pt*)
t=1
T
∑ ≤ Lt (pT* )
t=1
T
∑
ProofCon1nued
• Induc1veCase(T>1):– Removelasttermbecauseit’sequivalent
– Observe:
30
Lt (pt*)
t=1
T
∑ ≤ Lt (pT* )
t=1
T
∑ ⇒ Lt (pt*)
t=1
T−1
∑ ≤ Lt (pT* )
t=1
T−1
∑
Lt (pt*)
t=1
T−1
∑ ≤ Lt (pT−1* )
t=1
T−1
∑ ≤ Lt (pT* )
t=1
T−1
∑
Induc1veHypothesis
Defini1onofp*
RegretBound
31
R(T ) = Lt (pt )t=1
T
∑ − Lt (pT* )
t=1
T
∑
= Lt (pt−1* )
t=1
T
∑ − Lt (pT* )
t=1
T
∑
≤ Lt (pt−1* )
t=1
T
∑ − Lt (pt*)
t=1
T
∑
DefiniDonofFollowtheLeader
Lemma1
RegretBound(con1nued)
32
Lt (pt−1* )
t=1
T
∑ − Lt (pt*)
t=1
T
∑ = pt−1* − yt
2
t=1
T
∑ − pt* − yt
2
t=1
T
∑
= pt−1* − pt
*, pt−1* + pt
* − 2ytt=1
T
∑
≤ pt−1* − pt
* ⋅t=1
T
∑ pt−1* + pt
* − 2yt
≤ pt−1* − pt
* ⋅t=1
T
∑ pt−1* + pt
*t + 2yt( )
Cauchy-Schwarz
TriangleInequality
RegretBound(con1nued)
33
pt−1* − pt
* ⋅t=1
T
∑ pt−1* + pt
*t + 2yt( ) ≤ 4B pt−1
* − pt*
t=1
T
∑
AssumeeachythasnormboundedbyB:
Notethateachp*alsohasnormboundedbyB
RegretBound(con1nued)
34
pt−1* − pt
* = pt−1* −
(t −1)pt−1* + ytt
= 1tpt−1
* − yt
≤ 1t
pt−1* + yt( )
≤ 2Bt
Usethefactthat:
pt* =(t −1)pt−1
* + ytt
TriangleInequality
EachhasnormB
RegretBound(complete)
35
R(T ) = Lt (pt )t=1
T
∑ − Lt (pT* )
t=1
T
∑
≤ Lt (pt−1* )
t=1
T
∑ − Lt (pt*)
t=1
T
∑
≤ 4B pt−1* − pt
*
t=1
T
∑
≤ 8B2 1tt=1
T
∑ =O B2 lnT( ) LogarithmicRegret!
Independentofhoweachytischosen!
Recall:Interpre1ngRegret
• ExpectedTrainingErroris:
• Wantexpectedtrainingerrorto(quickly)convergetoop1mal– Equivalenttoaverageregret(quickly)convergingto0:
• SaDsfiedwhenregretgrowssublinearlyw.r.t.T!
36
1TR(T ) = 1
TLt (wt )
t=1
T
∑ − Lt (w*)
t=1
T
∑#
$%
&
'(→ 0
1T
Lt (wt )t=1
T
∑
WhenShouldYouUseFTLinPrac1ce?
• Whensolvingeachop1miza1onproblemisnottheboBleneck– Forsimplesquareddistance,itistrivial– Formorecomplexlossfunc1ons,mightrequireexpensiveop1miza1on
• WewillseeananalysisofSGD-stylealgorithmsnextTuesday– MakesmallupdatestoptusingonlyLt
37
Perceptron
38
BinaryClassifica1onOnlineLearning
• Fort=1….T(some1mesTisunknown)
– AlgorithmchooseswtinRD– Worldrevealslossfunc1on:
– AlgorithmsufferslossLt(wt)
• Goal:minimizetotalloss
39
Lt (pt )t=1
T
∑
Lt (wt ) =1 yt≠sign wt ,xt( )"# $%0/1Loss
PerceptronLearningAlgorithm
40
IfLt(wt)=1: wt+1 = wt + ytxt
Else: wt+1 = wt
y ∈ −1,+1{ }x ∈ RD
41
PerceptronLearningAssumeLinearlySeparable
42
Misclassified!
PerceptronLearningAssumeLinearlySeparable
43
Update!
PerceptronLearningAssumeLinearlySeparable
44
Correct!
PerceptronLearningAssumeLinearlySeparable
45
Misclassified!
PerceptronLearningAssumeLinearlySeparable
46
Update!
PerceptronLearningAssumeLinearlySeparable
47
Update!
PerceptronLearningAssumeLinearlySeparable
48
Correct!
PerceptronLearningAssumeLinearlySeparable
49
Correct!
PerceptronLearningAssumeLinearlySeparable
50
Misclassified!
PerceptronLearningAssumeLinearlySeparable
51
Update!
PerceptronLearningAssumeLinearlySeparable
52
Update!
PerceptronLearningAssumeLinearlySeparable
53
AllTrainingExamplesCorrectlyClassified!
PerceptronLearningAssumeLinearlySeparable
RegretBound=MistakeBound(forSeparableCase)
• Forseparablecase:
• Regret=#MistakesPerceptronmakes
54
R(T ) = Lt (wt )t=1
T
∑ − Lt (w*)
t=1
T
∑
Lt (w*)
t=1
T
∑ = 0
Lemma2
55
ytxtt∈I∑ = (wt+1 −wt )
t∈I∑ = wT+1
= wt+12− wt
2( )t∈I∑
= wt + ytxt2− wt
2( )t∈I∑
= 2yt wt, xt + xt2( )
t∈I∑
≤ xt2
t∈I∑
ytxtt∈I∑ ≤ xt
2
t∈I∑
Proof:
MistakeItera1ons
TelescopingSum
UpdateDefiniDon
≤0
PerceptronMistakeBound
56
#MistakesBoundedBy: B2
γ 2
Margin
B =maxx
x
**IfLinearlySeparable
Holdsforanyorderingoftrainingexamples!
“Radius”ofFeatureSpace
Proof
• Margin:
57
γ =maxwmin(xt ,yt )
yt w, xtw
!"#
$#
%&#
'#MustbeposiDveduetolinearseparability
I γ ≤w, ytxt
t∈I∑w
≤ ytxtt∈I∑ ≤ xt
2
t∈I∑ ≤ I B2
I γ ≤ I B2 ⇒ I ≤ B2
γ 2
Interpreta1on
• Ifthedataislinearlyseparable
• ThenANYorderingof(x,y)willcauseperceptrontoconvergewithfinitemistakes
• NodependenceonIIDsamplingfromtruedistribu1on
58
BriefOverviewofOtherTopics
59
ContextualOnlineLearning(akaOnlineLearningwithExperts)
• Given:Setofexperts{fk}• Fort=1….T(some1mesTisunknown)
– Eachexpertpredictsfk,t– Algorithmchoosespt– Worldrevealslossfunc1onLt– AlgorithmsufferslossLt(pt)
• Goal:minimizetotalloss
60
Lt (pt )t=1
T
∑
GeneralizesBoosDng
Par1alInforma1onOnlineLearning
• Fort=1….T(some1mesTisunknown)
– Algorithmchoosespt– WorldrevealslossLt(pt)– AlgorithmsufferslossLt(pt)
• Goal:minimizetotalloss
61
Lt (pt )t=1
T
∑
Wedon’tknowlossofotherchoices
Needto“explore”tomeasurelossofalternaDves
BasicAc1veLearning(forsupervisedlearning)
• Fort=1….– Algorithmchoosesx– Worldrevealsassociatedlabely– Add(x,y)totrainingset
• Terminatewhensufficientlyconfidentofbestmodel
62
SimpleExample
• 1feature• Learnthresholdfunc1on
63
TrueModelPassiveLearningSamplefromdistribu1on
LearnedModel
SimpleExample
• 1feature• Learnthresholdfunc1on
64
TrueModelAcDveLearningBinarySearch
ComparisonwithPassiveLearning
• #samplestobewithinεoftruemodel
• PassiveLearning:
• Ac1veLearning:
65
O 1ε
!
"#$
%&
O log 1ε
!
"#
$
%&
Simple'Example'
• 1'feature'• Learn'threshold'func7on'
39'
True'Model'Passive'Learning'Sample'from'distribu7on'
Learned'Model'Simple'Example'
• 1'feature'• Learn'threshold'func7on'
40'
True'Model'Ac#ve&Learning&Binary'Search'
Crowdsourcing
66
Y LeCunMA Ranzato
Object Recognition [Krizhevsky, Sutskever, Hinton 2012]
“Mushroom”
Labeled and Unlabeled data
Human expert/Special equipment/
Experiment
“Crystal” “Needle” “Empty”
Cheap and abundant ! Expensive and scarce !
“0” “1” “2” …
“Sports”“News”“Science”
…
Unlabeled
LabeledIni1allyEmpty
Repeat
HowReliableareAnnotators?
• Ifweknewwhatthelabelswere– Canjudgeworkersonlabelquality
• Ifweknewwhothegoodworkerswere– Cancreatelabelsfromtheirannota1ons
• Chickenandeggproblem!
67
ReinforcementLearning
68
• Inprevioussejngs:– Ac1onsdonotimpactstate– “Stateless”
• ReinforcementLearning– Ac1onseffectstateyou’rein– Rewardfunc1ondependsonstate– Example:PlayingGo
Off-PolicyEvalua1on
• Example:Wehavehospitallogsofpneumoniadeathsundervariouscondi1ons.
– Wanttotrainmodelpredictwhoismostatrisk
– Modelpredictsthatasthmapa1entshaveLOWERriskforpneumoniadeath….
– BecausedoctorspaycloseraBen1ontoasthmapa1ents!
69
ModelingHumanDecisionMaking
• Howdohumansreactinsequen1aldecisionmakingprocesses?
– Dotheybehavelikefollowtheleader?
– Dotheybehavelikeaperceptron?
70