Download - Mich ele Sebag Riad Akrour Marc Schoenauer TAOsebag/Slides/ConstructiveMachineLearning_Sebag15.pdfI Max. expected utility = greedy choice I Max expected posterior utility: greedy with

Learning with the human in the loop

Michele Sebag

Riad Akrour Marc Schoenauer

TAO

Constructive Machine Learning Wshop, ICML 20151 / 45

Evolution of Computer Science

1970s Specifications Languages & thm proving

1990s Programming by Examples Pattern recognition & ML

2010s Interactive Learning and Optimization

Motivations

I no explicit specification

I open world P(x) changes

I under-specified goal

2 / 45

Evolution of Computer Science

1970s Specifications Languages & thm proving

1990s Programming by Examples Pattern recognition & ML

2010s Interactive Learning and Optimization

Motivations

I no explicit specification

I open world P(x) changes

I under-specified goal

2 / 45

Summary

I Machine Learning needs logics, data, optimization....

I Machine Learning needs feedback: the human in the loop.

I Co-evolution of the human in the loop and the learner.

3 / 45

If the computer could read the user’s mindShannon’s Mind Reading Machinehttp://cs.williams.edu/ bailey/applets/MindReader/index.html

The 20Q game 220 ≈ 106 > #words ≈ 105

“…prepare to beeerily amused.”Lonnie Brown

“The Ledger”, Florida,

games played online87,135,942

Think of something and 20Q will read your mind by asking afew simple questions. The object you think of should besomething that most people would know about, but not aproper noun or a specific person, place, or thing.Click the ? in the upper right corner for help.

Q1. Is it classified as Animal, Vegetable or Mineral?Animal, Vegetable, Mineral, Concept, Unknown

Suggestions

If you would like some suggestions of what to think about, 20Q recommendsthe following:

Some things 20Q has chosen at random . . . jacks (child's game), talcum powder, anmitsu (bean paste with honey), a hot tub, anapricot.

?

20Q/5.00y, WebOddity/1.18m © 1988-2007, 20Q.net Inc., all rights reserved

20Q.net Inc. http://www.20q.net/

1 of 1 08/07/2015 18:16

4 / 45

If the computer could read the user’s mindShannon’s Mind Reading Machinehttp://cs.williams.edu/ bailey/applets/MindReader/index.html

The 20Q game 220 ≈ 106 > #words ≈ 105

“…prepare to beeerily amused.”Lonnie Brown

“The Ledger”, Florida,

games played online87,135,942

Think of something and 20Q will read your mind by asking afew simple questions. The object you think of should besomething that most people would know about, but not aproper noun or a specific person, place, or thing.Click the ? in the upper right corner for help.

Q1. Is it classified as Animal, Vegetable or Mineral?Animal, Vegetable, Mineral, Concept, Unknown

Suggestions

If you would like some suggestions of what to think about, 20Q recommendsthe following:

Some things 20Q has chosen at random . . . jacks (child's game), talcum powder, anmitsu (bean paste with honey), a hot tub, anapricot.

?

20Q/5.00y, WebOddity/1.18m © 1988-2007, 20Q.net Inc., all rights reserved

20Q.net Inc. http://www.20q.net/

1 of 1 08/07/2015 18:16

4 / 45

Overview

Interactive Learning and Optimization in Search

Reinforcement Learning

Programming by Feedback

5 / 45

Interactive learning and optimization

Optimizing the coffee taste Herdy et al., 96Black box optimization:

F : Ω→ IR Find arg max F

The user in the loop replaces F

Optimizing visual rendering Brochu et al., 07

Optimal recommendation sets Viappiani & Boutilier, 10

Information retrieval Shivaswamy & Joachims, 12

6 / 45

Interactive optimization

Features

I Search space X ⊂ IRd(recipe x : 33% arabica, 25% robusta, etc)

I hardly available features; unknown objectiveI Expert emits preferences: x ≺ x ′.

Iterative scheme

1. At step t, Alg. generates candidates x(1)t , x

(2)t

2. Expert emits preferences x(1)t x

(2)t

3. t → t + 1

Issues

I Asking as few questions as possible 6= active rankingI Modelling the expert’s preference

surrogate optimization objectiveI Enforce the exploration vs exploitation trade-off

7 / 45

Optimal Bayesian Recommendation Sets

Boutilier Viappiani 2010

Notations

I Objects in a finite domain Y ⊂ 0, 1 . . .d

I Generalized additive independent model U(y) = 〈w , y〉I Belief P(w , θ)

AlgorithmFor t = 1 . . .T do∗ Propose a set y1 . . . yk (Selection criterion, see next)∗ Observe preferred y∗ Update θ

8 / 45

Selection criterionExpected utility of solution y

EU(y , θ) =

∫W〈w , y〉dP(w , θ)

Maximum expected utility

EU∗(θ) = maxyEU(y , θ)

Selection Criterion: return solution with maximumI Expected utilityI Maximum expected posterior utility given y∗ the best solution

so far

EPU(y , θ) = Pr(y > y∗; θ)EU∗(θ|y > y∗)+ Pr(y < y∗; θ)EU∗(θ|y < y∗)

I Maximum expected utility of selection

EUS(y , θ) = Pr(y > y∗; θ)EU(y , θ|y > y∗)+ Pr(y < y∗; θ)EU(y∗, θ|y < y∗)

9 / 45

Optimal Bayesian Recommendation Sets, 2

Comments

I Max. expected utility = greedy choice

I Max expected posterior utility: greedy with 1-step look-ahead(maximizes the expected utility of the solution found after theuser will have expressed her preference). But computingEPU(y) requires solving two optimization problems.

I Max expected utility of selection: limited loss of performancecompared to max EPU; much less computationally expensive.

10 / 45

Co-active LearningShiwasvamy Joachims 2012

ContextRefining a search engine. Given query x , propose ordered list y .

NotationsI User utility U(y |x)I Search space of linear models U(y |x) = 〈w , φ(x , y)〉

AlgorithmFor t = 1 . . .T∗ Given xt , Propose yt = argmaxy〈wt , φ(xt , y)∗ Get feedback yt from user (swapping items in y)∗ Update utility model:

wt+1 = wt + φ(xt , yt)− φ(xt , yt)

Difference wrt multi-class perceptronI Feedback: yt is a rearrangement of yt (not true label)I Criterion: regret (not misclassification)

R =1

T

T∑t=1

U(y∗t |xt)− U(yt |xt)

11 / 45

Interactive Intent Modelling

The vocabulary issue in human-machine interactionFurnas et al. 87

I Single access term chosen by a single designer will providevery poor access:

I Humans are likely to use different vocabularies to encode anddecode their intended meaning.

12 / 45

Two translation tasks

...not equally difficult

A From mother tongue to foreign language: one has to knowvocabulary and grammar

B From foreign language to mother tongue: desambiguationfrom context, by guessing, etc

Search

I Writing a query: An A-task

I Assessing relevance: A B-task

13 / 45

Interactive Intent Modelling, 2

A human-in-loop approach Ruotsalo et al. 15

I Show candidate documents

I Ask user’s preferences

I Focus the query

14 / 45

Overview

Interactive Learning and Optimization in Search



15 / 45


Generalities

I An agent, spatially and temporally situated

I Stochastic and uncertain environment

I Goal: select an action in each time step,

I ... in order maximize expected cumulative reward over a timehorizon

What is learned ?A policy = strategy = state 7→ action

16 / 45

Reinforcement Learning, formal background

Notations

I State space SI Action space AI Transition p(s, a, s ′) 7→ [0, 1]

I Reward r(s)

I Discount 0 < γ < 1

Goal: a policy π mapping states onto actions

π : S 7→ A

s.t.

Maximize E [π|s0] = Expected discounted cumulative reward= r(s0) +

∑t γ

t+1 p(st , a = π(st), st+1)r(st+1)

17 / 45

Reinforcement learning

Tasks (model-based RL)

I Learn value function

I Learn transition model

I Explore

Algorithmic & Learning issues

I Representation of the state/action space

I Approximation of the value function

I Scaling w.r.t. state-action space dimension

I Exploration / Exploitation

Expert’s duty: design the reward function, s.t.

I optimum corresponds to desired behavior

I tractable (approximate) optimization.

18 / 45

Designing the reward function

Sparse

I only reward on the treasure: a Needle in the Haystackoptimization problem

Informed

I Significant expertise (in the problem domain, in RL) required

19 / 45

Using expert demonstrations

to train a classifier s → π(s)

... yields brittle policies

Inverse Reinforcement Learning Russell Ng 00, Abbeel Ng 04

Infer the reward function explaining the expert behavior

20 / 45

Using expert demonstrations

to train a classifier s → π(s) ... yields brittle policies

Inverse Reinforcement Learning Russell Ng 00, Abbeel Ng 04

Infer the reward function explaining the expert behavior

20 / 45

Sidestepping numerical rewards

Medical prescription Furnkranz et al., 2012

Avoid quantifying the cost of a fatal event: comparing the effectsof actions.

s, a, π ≺ s, a′, π

Co-Active Learning Shivaswamy Joachims, 15

The user responds by (slightly) improving the machine output.

21 / 45

Relaxing Expertise Requirements in RL

Expert

I Associates a reward to each state RL

I Demonstrates a (nearly) optimal behavior Inverse RL

I Compares and revises agent demonstrations Co-Active L

I Compares demonstrations Preference RL, PF

Ex-per-tise

Agent

I Computes optimal policy based on rewards RL

I Imitates verbatim expert’s demonstration IRL

I Imitates and modifies IRL

I Learns the expert’s utility IRL, CAL

I Learns, and selects demonstrations CAL, PRL, PF

I Accounts for the expert’s mistakes PF

Au-ton-omy

22 / 45

Motivating application: Swarm Robotics

Swarm-bot (2001-2005) Swarm Foraging, UWE

Symbrion IP, 2008-2013; http://symbrion.org/

Inverse RL not applicable: target individual behavior unknown.

23 / 45

Programming by feedback

Akrour et al. 14

Loop

1. Computer presents the expert with a pair of behaviors y1, y2

2. Expert emits preferences y1 y2

3. Computer learns expert’s utility function 〈w , y〉4. Computer searches for behaviors with best utility

Key issues

I Asks few preference queriesNot active preference learning: Sequential model-based optimization

I Accounts for human noise

24 / 45

Human noise

Human beings often are

I irrationalI inconsistent

I they make errorsI they adapt themselvesI they are kind...

Preferences often

I do no pre-exist

I are constructed on the fly

D. Kahneman, Thinking, fast and slow, 2011

25 / 45

Formal setting

X Search space, solution space controllers, IRD

Y Evaluation space, behavior space trajectories, IRd

Φ : X 7→ Y

Utility function

U∗ Y 7→ IR U∗(y) = 〈w∗, y〉 behavior space

Requisites

I Evaluation space: simple to learn from few queries

I Search space: sufficiently expressive

26 / 45


Ingredients

I Learning the expert’s utilityto avoid asking too many preference queries

I Modelling the expert’s competenceto accommodate expert inconsistencies

I Selecting the next best behaviors to be demonstrated:I Which optimization criterionI How to optimize it

algorithmic details at the end

27 / 45

Modelling the expert’s competence: Noise modelGiven two solutions y and y′, for w∗ the true utility

Preference margin z = 〈w∗, y− y′〉

The probability of error isProb of error

1/2

delta−delta

Preference margin Z

I 0 if the absolute margin is > threshold δ

I piecewise linear for −δ < z < δ.

Where δ is uniform in [0,M] and M is the expert’s inconsistence /incompetence

the lower, the most consistent the expert.28 / 45

Experimental validation

I Sensitivity to expert competenceSimulated expert, grid world

I Other benchmarks details at the endI Continuous case, no generative model

The cartpole

I Continuous case, generative modelThe bicycle

I Training in-situThe Nao robot

29 / 45

The learner and the (simulated) human in the loopGrid world: discrete case, no generative model25 states, 5 actions, horizon 300, 50% transition motionless

The true w∗

1

1/2

1/2

1/4

1/4

1/4

1/64

1/64

1/641/128

1/128

1/256

...

...Sensitivity studyME Expert inconsistencyMA > ME Computer estimate of expert’s inconsistency

0

0.2

0.4

0.6

0.8

1

0 10 20 30 40 50 60

Tru

e u

tilit

y

#Queries

ME = .25 MA = .25ME = .25 MA = .5ME = .25 MA = 1ME = .5 MA = .5ME = .5 MA = 1ME = 1 MA = 1

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45

0.5

0 10 20 30 40 50 60

Exp

ert

err

or

rate

#Queries

ME = .25 MA = .25ME = .25 MA = 1ME = .5 MA = 1ME = 1 MA = 1

True utility of xt expert’s mistakes

30 / 45

The learner and the (simulated) human in the loop, 2Findings

I The learner estimate MA of the expert’s inconsistency (ME )does influence the number of mistakes done by the expert.

I No psychological effects though: this is a simulated expert.

I In the short run, a learner trusting a (mildly) incompetentexpert does better than a learner distrusting a (more)competent expert.

Interpretation

I The higher MA, the smoother the learned preference model,the more often the learner presents the expert with pairs ofsolutions with low margin;

I The lower the margin, the higher the mistake probability

I A cumulative (dis)advantage phenomenon

For low MA, the computer learns faster, submits more relevant demonstrations

to the expert, thus priming a virtuous educational process.31 / 45

Partial conclusion

Feasibility of Programming by Feedback for simple tasks

An old research agenda

One could carry through the organization of anintelligent machine with only two interfering inputs,one for pleasure or reward, and the other for pain orpunishment.

CS + learning from the human in the loop

I No need to debug if you can just say: No !and the computer reacts (appropriately).

I I had a dream: a world where I don’t need to read the manual.

32 / 45

Learning and Optimization with the Human in the Loop

Knowledge-constrained Computation, memory-constrained

33 / 45

BibliographyB. Akgun, K. Subramanian, J. Shim, and A. Lockerd Thomaz. Learning tasksand skills together from a human teacher. In W. Burgard and D. Roth, editors,AAAI. AAAI Press, 2011.R. Akrour, M. Schoenauer, M. Sebag, and J.-C. Souplet. Programming byfeedback. In ICML, volume 32 of JMLR Proceedings, pages 1503–1511.JMLR.org, 2014.E. Brochu, N. de Freitas, and A. Ghosh. Active preference learning withdiscrete choice data. In John C. Platt, Daphne Koller, Yoram Singer, andSam T. Roweis, editors, NIPS 20, pages 409–416, 2008.C. Furtlehner, M. Sebag, and Z. Xiangliang. Scaling Analysis of AffinityPropagation. Physical Review E: Statistical, Nonlinear, and Soft MatterPhysics, 81:066102, 2010.R. Garnett, Y. Krishnamurthy, X. Xiong, J. G. Schneider, and R. Mann.Bayesian optimal active search and surveying. In ICML. Omnipress, 2012.S. Gulwani. Automating string processing in spreadsheets using input-outputexamples. In T. Ball and M. Sagiv, editors, POPL, pages 317–330. ACM, 2011.A. Jain, T. Joachims, and A. Saxena. Learning trajectory preferences formanipulators via iterative improvement. In NIPS, 2013.

W. B. Knox, P. Stone, and C. Breazeal. Training a robot via human feedback:

A case study. In Int. Conf. on Social Robotics, volume 8239 of LNCS, pages

460–470. Springer, 2013.34 / 45

Bibliography, 2

P. Liang, M. I. Jordan, and D. Klein. Learning programs: A hierarchicalbayesian approach. In J. Furnkranz and T. Joachims, editors, ICML, pages639–646. Omnipress, 2010.S. H. Muggleton and D. Lin. Meta-interpretive learning of higher-order dyadicdatalog: Predicate invention revisited. In Francesca Rossi, editor, Proc. 23rdIJCAI. IJCAI/AAAI, 2013.P.-Y. Oudeyer, A. Baranes, and F. Kaplan. Intrinsically motivated explorationfor developmental and active sensorimotor learning. In From Motor Learning toInteraction Learning in Robots, volume 264 of Studies in ComputationalIntelligence, pages 107–146. Springer Verlag, 2010.A. Pease and S. Colton. Computational creativity theory: Inspirations behindthe face and idea models. In Proc. Intl Conf. on Computational Creativity,pages 72–77, 2011.P. Shivaswamy and T. Joachims. Online structured prediction via coactivelearning. In ICML, 2012.A. Tversky and D. Kahneman.

P. Viappiani and C. Boutilier. Optimal Bayesian recommendation sets and

myopically optimal choice query sets. In J. D. Lafferty et al., editor, NIPS 23,

pages 2352–2360. Curran Associates, Inc., 2010.

35 / 45

Bibliography, 3

A. Wilson, A. Fern, and P. Tadepalli. A Bayesian approach for policy learningfrom trajectory preference queries. In P. L. Bartlett et al., editor, NIPS 25,pages 1142–1150, 2012.Y. Yue and T. Joachims. Interactively optimizing information retrieval systemsas a dueling bandits problem. In A.P. Danyluk et al., editor, Proc. 26th ICML,volume 382 of ACM Intl Conf. Proc. Series, pages 1169–1176. ACM, 2009.Xiangliang Zhang, Cyril Furtlehner, Cecile Germain-Renaud, and MicheleSebag. Data stream clustering with affinity propagation.IEEE Trans. Knowl. Data Eng., 26(7):1644–1656, 2014.

36 / 45

Programming by feedback

Akrour et al. 14

Algorithm

1. Learning the expert’s utility function given the preferencearchive

2. Finding the best pair of demonstrations (y, y′) (expectedposterior utility under the noise model)

3. Achieving optimization in demonstration space (e.g. trajectoryspace)

4. Achieving optimization in solution space (e.g. neural net)

37 / 45

Learning the expert’s utility function

Data Ut = y0, y1, . . . ; (yi1 yi2), i = 1 . . . tI trajectories yiI preferences yi1 yi2

Learning: find θt posterior on W W = linear fns on Y

Proposition: Given Ut ,

θt(w) ∝∏

i=1,t P(yi1 yi2 | w)

=∏

i=1,t

(12 + wi

2M

(1 + log M

|wi |

))with wi = 〈w, yi1 − yi2〉, capped to [−M,M].

Ut(y) = IEw∼θt [〈w, y〉]

38 / 45

Best demonstration pair (y , y ′)after Viappiani Boutilier, 10

EUS: Expected utility of selection (greedy)

EUS(y, y′) = IEθt [〈w, y − y ′〉 > 0] . Uw∼θt ,y>y ′(y)+ IEθt [〈w, y − y ′〉 < 0] . Uw∼θt ,y<y ′(y′)

EPU: Expected posterior utility (lookahead)

EPU(y, y′) = IEθt [〈w, y − y ′〉 > 0] . maxy“Uw∼θt ,y>y ′(y′′)+ IEθt [〈w, y − y ′〉 < 0] . maxy“Uw∼θt ,y<y ′(y′′)

= IEθt [〈w, y − y ′〉 > 0] . Uw∼θt ,y>y ′(y∗)+ IEθt [〈w, y − y ′〉 < 0] . Uw∼θt ,y<y ′(y′∗)

Thereforeargmax EPU(y, y′) ≤ argmax EUS(y, y′)

39 / 45

Optimization in demonstration space

NL: noiseless N: noisy

Proposition

EUSNL(y, y′)− L ≤ EUSN(y, y′) ≤ EUSNL(y, y′)

Proposition

max EUSNLt (y, y′)− L ≤ max EPUN

t (y, y′) ≤ max EUSNLt (y, y′) + L

Limited loss incurred (L ∼ M20 )

40 / 45

Optimization in solution space

1. Find best y, y′ → Find best yto be compared to best behavior so far y∗t

The game of hot and cold

2. Expectation of behavior utility → utility of expected behaviorGiven the mapping Φ: search 7→ demonstration space,

IEΦ[EUSNL(Φ(x), y∗t )] ≥ EUSNL(IEΦ[Φ(x)], y∗t )

3. Iterative solution optimization

I Draw w0 ∼ θt and let x1 = argmax 〈w0, IEΦ[Φ(x)]〉I Iteratively, find xi+1 = argmax 〈IEθi [w], IEΦ[Φ(x)]〉, with θi

posterior to IEΦ[Φ(xi )] > y∗t .

Proposition. The sequence monotonically converges toward alocal optimum of EUSNL

41 / 45

Experimental validation of Programming by Feedback

42 / 45

Continuous Case, no Generative Model

The cartpoleState space IR2, 3 actionsDem. space IR9, dem. length 3,000

0.65

0.7

0.75

0.8

0.85

0.9

0.95

1

1 2 3 4 5 6 7 8 9 10

Tru

e U

tilit

y

#Queries

ME = .25, MA = .25ME = .25, MA = .5ME = .25, MA = 1ME = .5, MA = .5ME = .5, MA = 1ME = 1, MA = 1

-0.02

0

0.02

0.04

0.06

0.08

0.1

0.12

0.14

1 2 3 4 5 6 7 8 9 10

Featu

re w

eig

ht

#Queries

Gaussian centered on the equilibrium state

Cartpole True utility of xt Estimated utility of featuresfraction in equilibrium

Two interactions required on average to solve the cartpole problem.No sensitivity to noise.

43 / 45

Continuous Case, with Generative Model

The bicycleSolution space IR210 (NN weight vector)State space IR4, action space IR2, dem. length ≤ 30, 000.

True utility-1.2

-1

-0.8

-0.6

-0.4

-0.2

0

0.2

0 2 4 6 8 10 12 14 16 18 20

Tru

e U

tilit

y

#Queries

ME = 1, MA = 1

Optimization component: CMA-ES Hansen et al., 2001

15 interactions required on average to solve the problem for lownoise.versus 20 queries, with discrete action in state of the art.

44 / 45

Training in-situ

The Nao

0

0.2

0.4

0.6

0.8

1

0 5 10 15 20 25 30

Tru

e U

tility

#Queries

13 states20 states

The Nao robot Nao: true utility of xt

Goal: reaching a given state.Transition matrix estimated from 1,000 random (s, a, s ′) triplets.Dem. length 10, fixed initial state.12 interactions for 13 states25 interactions for 20 states

45 / 45