Learning with the human in the loop
Michele Sebag
Riad Akrour Marc Schoenauer
TAO
Constructive Machine Learning Wshop, ICML 20151 / 45
Evolution of Computer Science
1970s Specifications Languages & thm proving
1990s Programming by Examples Pattern recognition & ML
2010s Interactive Learning and Optimization
Motivations
I no explicit specification
I open world P(x) changes
I under-specified goal
2 / 45
Evolution of Computer Science
1970s Specifications Languages & thm proving
1990s Programming by Examples Pattern recognition & ML
2010s Interactive Learning and Optimization
Motivations
I no explicit specification
I open world P(x) changes
I under-specified goal
2 / 45
Summary
I Machine Learning needs logics, data, optimization....
I Machine Learning needs feedback: the human in the loop.
I Co-evolution of the human in the loop and the learner.
3 / 45
If the computer could read the user’s mindShannon’s Mind Reading Machinehttp://cs.williams.edu/ bailey/applets/MindReader/index.html
The 20Q game 220 ≈ 106 > #words ≈ 105
“…prepare to beeerily amused.”Lonnie Brown
“The Ledger”, Florida,
games played online87,135,942
Think of something and 20Q will read your mind by asking afew simple questions. The object you think of should besomething that most people would know about, but not aproper noun or a specific person, place, or thing.Click the ? in the upper right corner for help.
Q1. Is it classified as Animal, Vegetable or Mineral?Animal, Vegetable, Mineral, Concept, Unknown
Suggestions
If you would like some suggestions of what to think about, 20Q recommendsthe following:
Some things 20Q has chosen at random . . . jacks (child's game), talcum powder, anmitsu (bean paste with honey), a hot tub, anapricot.
?
20Q/5.00y, WebOddity/1.18m © 1988-2007, 20Q.net Inc., all rights reserved
20Q.net Inc. http://www.20q.net/
1 of 1 08/07/2015 18:16
4 / 45
If the computer could read the user’s mindShannon’s Mind Reading Machinehttp://cs.williams.edu/ bailey/applets/MindReader/index.html
The 20Q game 220 ≈ 106 > #words ≈ 105
“…prepare to beeerily amused.”Lonnie Brown
“The Ledger”, Florida,
games played online87,135,942
Think of something and 20Q will read your mind by asking afew simple questions. The object you think of should besomething that most people would know about, but not aproper noun or a specific person, place, or thing.Click the ? in the upper right corner for help.
Q1. Is it classified as Animal, Vegetable or Mineral?Animal, Vegetable, Mineral, Concept, Unknown
Suggestions
If you would like some suggestions of what to think about, 20Q recommendsthe following:
Some things 20Q has chosen at random . . . jacks (child's game), talcum powder, anmitsu (bean paste with honey), a hot tub, anapricot.
?
20Q/5.00y, WebOddity/1.18m © 1988-2007, 20Q.net Inc., all rights reserved
20Q.net Inc. http://www.20q.net/
1 of 1 08/07/2015 18:16
4 / 45
Overview
Interactive Learning and Optimization in Search
Reinforcement Learning
Programming by Feedback
5 / 45
Interactive learning and optimization
Optimizing the coffee taste Herdy et al., 96Black box optimization:
F : Ω→ IR Find arg max F
The user in the loop replaces F
Optimizing visual rendering Brochu et al., 07
Optimal recommendation sets Viappiani & Boutilier, 10
Information retrieval Shivaswamy & Joachims, 12
6 / 45
Interactive optimization
Features
I Search space X ⊂ IRd(recipe x : 33% arabica, 25% robusta, etc)
I hardly available features; unknown objectiveI Expert emits preferences: x ≺ x ′.
Iterative scheme
1. At step t, Alg. generates candidates x(1)t , x
(2)t
2. Expert emits preferences x(1)t x
(2)t
3. t → t + 1
Issues
I Asking as few questions as possible 6= active rankingI Modelling the expert’s preference
surrogate optimization objectiveI Enforce the exploration vs exploitation trade-off
7 / 45
Optimal Bayesian Recommendation Sets
Boutilier Viappiani 2010
Notations
I Objects in a finite domain Y ⊂ 0, 1 . . .d
I Generalized additive independent model U(y) = 〈w , y〉I Belief P(w , θ)
AlgorithmFor t = 1 . . .T do∗ Propose a set y1 . . . yk (Selection criterion, see next)∗ Observe preferred y∗ Update θ
8 / 45
Selection criterionExpected utility of solution y
EU(y , θ) =
∫W〈w , y〉dP(w , θ)
Maximum expected utility
EU∗(θ) = maxyEU(y , θ)
Selection Criterion: return solution with maximumI Expected utilityI Maximum expected posterior utility given y∗ the best solution
so far
EPU(y , θ) = Pr(y > y∗; θ)EU∗(θ|y > y∗)+ Pr(y < y∗; θ)EU∗(θ|y < y∗)
I Maximum expected utility of selection
EUS(y , θ) = Pr(y > y∗; θ)EU(y , θ|y > y∗)+ Pr(y < y∗; θ)EU(y∗, θ|y < y∗)
9 / 45
Optimal Bayesian Recommendation Sets, 2
Comments
I Max. expected utility = greedy choice
I Max expected posterior utility: greedy with 1-step look-ahead(maximizes the expected utility of the solution found after theuser will have expressed her preference). But computingEPU(y) requires solving two optimization problems.
I Max expected utility of selection: limited loss of performancecompared to max EPU; much less computationally expensive.
10 / 45
Co-active LearningShiwasvamy Joachims 2012
ContextRefining a search engine. Given query x , propose ordered list y .
NotationsI User utility U(y |x)I Search space of linear models U(y |x) = 〈w , φ(x , y)〉
AlgorithmFor t = 1 . . .T∗ Given xt , Propose yt = argmaxy〈wt , φ(xt , y)∗ Get feedback yt from user (swapping items in y)∗ Update utility model:
wt+1 = wt + φ(xt , yt)− φ(xt , yt)
Difference wrt multi-class perceptronI Feedback: yt is a rearrangement of yt (not true label)I Criterion: regret (not misclassification)
R =1
T
T∑t=1
U(y∗t |xt)− U(yt |xt)
11 / 45
Interactive Intent Modelling
The vocabulary issue in human-machine interactionFurnas et al. 87
I Single access term chosen by a single designer will providevery poor access:
I Humans are likely to use different vocabularies to encode anddecode their intended meaning.
12 / 45
Two translation tasks
...not equally difficult
A From mother tongue to foreign language: one has to knowvocabulary and grammar
B From foreign language to mother tongue: desambiguationfrom context, by guessing, etc
Search
I Writing a query: An A-task
I Assessing relevance: A B-task
13 / 45
Interactive Intent Modelling, 2
A human-in-loop approach Ruotsalo et al. 15
I Show candidate documents
I Ask user’s preferences
I Focus the query
14 / 45
Overview
Interactive Learning and Optimization in Search
Reinforcement Learning
Programming by Feedback
15 / 45
Reinforcement Learning
Generalities
I An agent, spatially and temporally situated
I Stochastic and uncertain environment
I Goal: select an action in each time step,
I ... in order maximize expected cumulative reward over a timehorizon
What is learned ?A policy = strategy = state 7→ action
16 / 45
Reinforcement Learning, formal background
Notations
I State space SI Action space AI Transition p(s, a, s ′) 7→ [0, 1]
I Reward r(s)
I Discount 0 < γ < 1
Goal: a policy π mapping states onto actions
π : S 7→ A
s.t.
Maximize E [π|s0] = Expected discounted cumulative reward= r(s0) +
∑t γ
t+1 p(st , a = π(st), st+1)r(st+1)
17 / 45
Reinforcement learning
Tasks (model-based RL)
I Learn value function
I Learn transition model
I Explore
Algorithmic & Learning issues
I Representation of the state/action space
I Approximation of the value function
I Scaling w.r.t. state-action space dimension
I Exploration / Exploitation
Expert’s duty: design the reward function, s.t.
I optimum corresponds to desired behavior
I tractable (approximate) optimization.
18 / 45
Designing the reward function
Sparse
I only reward on the treasure: a Needle in the Haystackoptimization problem
Informed
I Significant expertise (in the problem domain, in RL) required
19 / 45
Using expert demonstrations
to train a classifier s → π(s)
... yields brittle policies
Inverse Reinforcement Learning Russell Ng 00, Abbeel Ng 04
Infer the reward function explaining the expert behavior
20 / 45
Using expert demonstrations
to train a classifier s → π(s) ... yields brittle policies
Inverse Reinforcement Learning Russell Ng 00, Abbeel Ng 04
Infer the reward function explaining the expert behavior
20 / 45
Sidestepping numerical rewards
Medical prescription Furnkranz et al., 2012
Avoid quantifying the cost of a fatal event: comparing the effectsof actions.
s, a, π ≺ s, a′, π
Co-Active Learning Shivaswamy Joachims, 15
The user responds by (slightly) improving the machine output.
21 / 45
Relaxing Expertise Requirements in RL
Expert
I Associates a reward to each state RL
I Demonstrates a (nearly) optimal behavior Inverse RL
I Compares and revises agent demonstrations Co-Active L
I Compares demonstrations Preference RL, PF
Ex-per-tise
Agent
I Computes optimal policy based on rewards RL
I Imitates verbatim expert’s demonstration IRL
I Imitates and modifies IRL
I Learns the expert’s utility IRL, CAL
I Learns, and selects demonstrations CAL, PRL, PF
I Accounts for the expert’s mistakes PF
Au-ton-omy
22 / 45
Motivating application: Swarm Robotics
Swarm-bot (2001-2005) Swarm Foraging, UWE
Symbrion IP, 2008-2013; http://symbrion.org/
Inverse RL not applicable: target individual behavior unknown.
23 / 45
Programming by feedback
Akrour et al. 14
Loop
1. Computer presents the expert with a pair of behaviors y1, y2
2. Expert emits preferences y1 y2
3. Computer learns expert’s utility function 〈w , y〉4. Computer searches for behaviors with best utility
Key issues
I Asks few preference queriesNot active preference learning: Sequential model-based optimization
I Accounts for human noise
24 / 45
Human noise
Human beings often are
I irrationalI inconsistent
I they make errorsI they adapt themselvesI they are kind...
Preferences often
I do no pre-exist
I are constructed on the fly
D. Kahneman, Thinking, fast and slow, 2011
25 / 45
Formal setting
X Search space, solution space controllers, IRD
Y Evaluation space, behavior space trajectories, IRd
Φ : X 7→ Y
Utility function
U∗ Y 7→ IR U∗(y) = 〈w∗, y〉 behavior space
Requisites
I Evaluation space: simple to learn from few queries
I Search space: sufficiently expressive
26 / 45
Programming by Feedback
Ingredients
I Learning the expert’s utilityto avoid asking too many preference queries
I Modelling the expert’s competenceto accommodate expert inconsistencies
I Selecting the next best behaviors to be demonstrated:I Which optimization criterionI How to optimize it
algorithmic details at the end
27 / 45
Modelling the expert’s competence: Noise modelGiven two solutions y and y′, for w∗ the true utility
Preference margin z = 〈w∗, y− y′〉
The probability of error isProb of error
1/2
delta−delta
Preference margin Z
I 0 if the absolute margin is > threshold δ
I piecewise linear for −δ < z < δ.
Where δ is uniform in [0,M] and M is the expert’s inconsistence /incompetence
the lower, the most consistent the expert.28 / 45
Experimental validation
I Sensitivity to expert competenceSimulated expert, grid world
I Other benchmarks details at the endI Continuous case, no generative model
The cartpole
I Continuous case, generative modelThe bicycle
I Training in-situThe Nao robot
29 / 45
The learner and the (simulated) human in the loopGrid world: discrete case, no generative model25 states, 5 actions, horizon 300, 50% transition motionless
The true w∗
1
1/2
1/2
1/4
1/4
1/4
1/64
1/64
1/641/128
1/128
1/256
...
...Sensitivity studyME Expert inconsistencyMA > ME Computer estimate of expert’s inconsistency
0
0.2
0.4
0.6
0.8
1
0 10 20 30 40 50 60
Tru
e u
tilit
y
#Queries
ME = .25 MA = .25ME = .25 MA = .5ME = .25 MA = 1ME = .5 MA = .5ME = .5 MA = 1ME = 1 MA = 1
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 10 20 30 40 50 60
Exp
ert
err
or
rate
#Queries
ME = .25 MA = .25ME = .25 MA = 1ME = .5 MA = 1ME = 1 MA = 1
True utility of xt expert’s mistakes
30 / 45
The learner and the (simulated) human in the loop, 2Findings
I The learner estimate MA of the expert’s inconsistency (ME )does influence the number of mistakes done by the expert.
I No psychological effects though: this is a simulated expert.
I In the short run, a learner trusting a (mildly) incompetentexpert does better than a learner distrusting a (more)competent expert.
Interpretation
I The higher MA, the smoother the learned preference model,the more often the learner presents the expert with pairs ofsolutions with low margin;
I The lower the margin, the higher the mistake probability
I A cumulative (dis)advantage phenomenon
For low MA, the computer learns faster, submits more relevant demonstrations
to the expert, thus priming a virtuous educational process.31 / 45
Partial conclusion
Feasibility of Programming by Feedback for simple tasks
An old research agenda
One could carry through the organization of anintelligent machine with only two interfering inputs,one for pleasure or reward, and the other for pain orpunishment.
CS + learning from the human in the loop
I No need to debug if you can just say: No !and the computer reacts (appropriately).
I I had a dream: a world where I don’t need to read the manual.
32 / 45
Learning and Optimization with the Human in the Loop
Knowledge-constrained Computation, memory-constrained
33 / 45
BibliographyB. Akgun, K. Subramanian, J. Shim, and A. Lockerd Thomaz. Learning tasksand skills together from a human teacher. In W. Burgard and D. Roth, editors,AAAI. AAAI Press, 2011.R. Akrour, M. Schoenauer, M. Sebag, and J.-C. Souplet. Programming byfeedback. In ICML, volume 32 of JMLR Proceedings, pages 1503–1511.JMLR.org, 2014.E. Brochu, N. de Freitas, and A. Ghosh. Active preference learning withdiscrete choice data. In John C. Platt, Daphne Koller, Yoram Singer, andSam T. Roweis, editors, NIPS 20, pages 409–416, 2008.C. Furtlehner, M. Sebag, and Z. Xiangliang. Scaling Analysis of AffinityPropagation. Physical Review E: Statistical, Nonlinear, and Soft MatterPhysics, 81:066102, 2010.R. Garnett, Y. Krishnamurthy, X. Xiong, J. G. Schneider, and R. Mann.Bayesian optimal active search and surveying. In ICML. Omnipress, 2012.S. Gulwani. Automating string processing in spreadsheets using input-outputexamples. In T. Ball and M. Sagiv, editors, POPL, pages 317–330. ACM, 2011.A. Jain, T. Joachims, and A. Saxena. Learning trajectory preferences formanipulators via iterative improvement. In NIPS, 2013.
W. B. Knox, P. Stone, and C. Breazeal. Training a robot via human feedback:
A case study. In Int. Conf. on Social Robotics, volume 8239 of LNCS, pages
460–470. Springer, 2013.34 / 45
Bibliography, 2
P. Liang, M. I. Jordan, and D. Klein. Learning programs: A hierarchicalbayesian approach. In J. Furnkranz and T. Joachims, editors, ICML, pages639–646. Omnipress, 2010.S. H. Muggleton and D. Lin. Meta-interpretive learning of higher-order dyadicdatalog: Predicate invention revisited. In Francesca Rossi, editor, Proc. 23rdIJCAI. IJCAI/AAAI, 2013.P.-Y. Oudeyer, A. Baranes, and F. Kaplan. Intrinsically motivated explorationfor developmental and active sensorimotor learning. In From Motor Learning toInteraction Learning in Robots, volume 264 of Studies in ComputationalIntelligence, pages 107–146. Springer Verlag, 2010.A. Pease and S. Colton. Computational creativity theory: Inspirations behindthe face and idea models. In Proc. Intl Conf. on Computational Creativity,pages 72–77, 2011.P. Shivaswamy and T. Joachims. Online structured prediction via coactivelearning. In ICML, 2012.A. Tversky and D. Kahneman.
P. Viappiani and C. Boutilier. Optimal Bayesian recommendation sets and
myopically optimal choice query sets. In J. D. Lafferty et al., editor, NIPS 23,
pages 2352–2360. Curran Associates, Inc., 2010.
35 / 45
Bibliography, 3
A. Wilson, A. Fern, and P. Tadepalli. A Bayesian approach for policy learningfrom trajectory preference queries. In P. L. Bartlett et al., editor, NIPS 25,pages 1142–1150, 2012.Y. Yue and T. Joachims. Interactively optimizing information retrieval systemsas a dueling bandits problem. In A.P. Danyluk et al., editor, Proc. 26th ICML,volume 382 of ACM Intl Conf. Proc. Series, pages 1169–1176. ACM, 2009.Xiangliang Zhang, Cyril Furtlehner, Cecile Germain-Renaud, and MicheleSebag. Data stream clustering with affinity propagation.IEEE Trans. Knowl. Data Eng., 26(7):1644–1656, 2014.
36 / 45
Programming by feedback
Akrour et al. 14
Algorithm
1. Learning the expert’s utility function given the preferencearchive
2. Finding the best pair of demonstrations (y, y′) (expectedposterior utility under the noise model)
3. Achieving optimization in demonstration space (e.g. trajectoryspace)
4. Achieving optimization in solution space (e.g. neural net)
37 / 45
Learning the expert’s utility function
Data Ut = y0, y1, . . . ; (yi1 yi2), i = 1 . . . tI trajectories yiI preferences yi1 yi2
Learning: find θt posterior on W W = linear fns on Y
Proposition: Given Ut ,
θt(w) ∝∏
i=1,t P(yi1 yi2 | w)
=∏
i=1,t
(12 + wi
2M
(1 + log M
|wi |
))with wi = 〈w, yi1 − yi2〉, capped to [−M,M].
Ut(y) = IEw∼θt [〈w, y〉]
38 / 45
Best demonstration pair (y , y ′)after Viappiani Boutilier, 10
EUS: Expected utility of selection (greedy)
EUS(y, y′) = IEθt [〈w, y − y ′〉 > 0] . Uw∼θt ,y>y ′(y)+ IEθt [〈w, y − y ′〉 < 0] . Uw∼θt ,y<y ′(y′)
EPU: Expected posterior utility (lookahead)
EPU(y, y′) = IEθt [〈w, y − y ′〉 > 0] . maxy“Uw∼θt ,y>y ′(y′′)+ IEθt [〈w, y − y ′〉 < 0] . maxy“Uw∼θt ,y<y ′(y′′)
= IEθt [〈w, y − y ′〉 > 0] . Uw∼θt ,y>y ′(y∗)+ IEθt [〈w, y − y ′〉 < 0] . Uw∼θt ,y<y ′(y′∗)
Thereforeargmax EPU(y, y′) ≤ argmax EUS(y, y′)
39 / 45
Optimization in demonstration space
NL: noiseless N: noisy
Proposition
EUSNL(y, y′)− L ≤ EUSN(y, y′) ≤ EUSNL(y, y′)
Proposition
max EUSNLt (y, y′)− L ≤ max EPUN
t (y, y′) ≤ max EUSNLt (y, y′) + L
Limited loss incurred (L ∼ M20 )
40 / 45
Optimization in solution space
1. Find best y, y′ → Find best yto be compared to best behavior so far y∗t
The game of hot and cold
2. Expectation of behavior utility → utility of expected behaviorGiven the mapping Φ: search 7→ demonstration space,
IEΦ[EUSNL(Φ(x), y∗t )] ≥ EUSNL(IEΦ[Φ(x)], y∗t )
3. Iterative solution optimization
I Draw w0 ∼ θt and let x1 = argmax 〈w0, IEΦ[Φ(x)]〉I Iteratively, find xi+1 = argmax 〈IEθi [w], IEΦ[Φ(x)]〉, with θi
posterior to IEΦ[Φ(xi )] > y∗t .
Proposition. The sequence monotonically converges toward alocal optimum of EUSNL
41 / 45
Experimental validation of Programming by Feedback
42 / 45
Continuous Case, no Generative Model
The cartpoleState space IR2, 3 actionsDem. space IR9, dem. length 3,000
0.65
0.7
0.75
0.8
0.85
0.9
0.95
1
1 2 3 4 5 6 7 8 9 10
Tru
e U
tilit
y
#Queries
ME = .25, MA = .25ME = .25, MA = .5ME = .25, MA = 1ME = .5, MA = .5ME = .5, MA = 1ME = 1, MA = 1
-0.02
0
0.02
0.04
0.06
0.08
0.1
0.12
0.14
1 2 3 4 5 6 7 8 9 10
Featu
re w
eig
ht
#Queries
Gaussian centered on the equilibrium state
Cartpole True utility of xt Estimated utility of featuresfraction in equilibrium
Two interactions required on average to solve the cartpole problem.No sensitivity to noise.
43 / 45
Continuous Case, with Generative Model
The bicycleSolution space IR210 (NN weight vector)State space IR4, action space IR2, dem. length ≤ 30, 000.
True utility-1.2
-1
-0.8
-0.6
-0.4
-0.2
0
0.2
0 2 4 6 8 10 12 14 16 18 20
Tru
e U
tilit
y
#Queries
ME = 1, MA = 1
Optimization component: CMA-ES Hansen et al., 2001
15 interactions required on average to solve the problem for lownoise.versus 20 queries, with discrete action in state of the art.
44 / 45
Training in-situ
The Nao
0
0.2
0.4
0.6
0.8
1
0 5 10 15 20 25 30
Tru
e U
tility
#Queries
13 states20 states
The Nao robot Nao: true utility of xt
Goal: reaching a given state.Transition matrix estimated from 1,000 random (s, a, s ′) triplets.Dem. length 10, fixed initial state.12 interactions for 13 states25 interactions for 20 states
45 / 45