Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The exploration-exploitation trade-off
Pantelis Pipergias Analytis
Cornell University
February 5, 2018
1 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Examples of exploration and exploitation in real life
1 Going to your favorite restaurant/bar vs. trying a new one.
2 Listening to music from a band you love vs. discoveringnew ones.
3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.
4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.
5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.
6 An organization trying a new organizational structure vs.a decently working existing one.
2 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Examples of exploration and exploitation in real life
1 Going to your favorite restaurant/bar vs. trying a new one.
2 Listening to music from a band you love vs. discoveringnew ones.
3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.
4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.
5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.
6 An organization trying a new organizational structure vs.a decently working existing one.
2 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Examples of exploration and exploitation in real life
1 Going to your favorite restaurant/bar vs. trying a new one.
2 Listening to music from a band you love vs. discoveringnew ones.
3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.
4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.
5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.
6 An organization trying a new organizational structure vs.a decently working existing one.
2 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Examples of exploration and exploitation in real life
1 Going to your favorite restaurant/bar vs. trying a new one.
2 Listening to music from a band you love vs. discoveringnew ones.
3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.
4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.
5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.
6 An organization trying a new organizational structure vs.a decently working existing one.
2 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Examples of exploration and exploitation in real life
1 Going to your favorite restaurant/bar vs. trying a new one.
2 Listening to music from a band you love vs. discoveringnew ones.
3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.
4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.
5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.
6 An organization trying a new organizational structure vs.a decently working existing one.
2 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Examples of exploration and exploitation in real life
1 Going to your favorite restaurant/bar vs. trying a new one.
2 Listening to music from a band you love vs. discoveringnew ones.
3 Preparing a meal that you have made successfully in thepast and you enjoyed vs. cooking up something new.
4 Reading a newspaper article from a journalist you like vs.reading something from a newcomer in the field.
5 A chimpanzee foraging in a new territory with unknownfood resources as opposed to the known home territory.
6 An organization trying a new organizational structure vs.a decently working existing one.
2 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Multi-armed bandit (MAB) problem
Option 1
??
N(µ1, σ1)
N(12, 3)
Option 2
7.417.3
N(µ2, σ2)
N(15, 3)3 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
History of the problem
1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)
2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).
3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.
4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.
5 Note the similarities to the search problem consideredlast week: the problems fold into each other
4 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
History of the problem
1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)
2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).
3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.
4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.
5 Note the similarities to the search problem consideredlast week: the problems fold into each other
4 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
History of the problem
1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)
2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).
3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.
4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.
5 Note the similarities to the search problem consideredlast week: the problems fold into each other
4 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
History of the problem
1 The [MAB] problem was formulated during the war, andefforts to solve it so sapped the energies and minds ofAllied scientists that the suggestion was made that theproblem be dropped over Germany, as the ultimateinstrument of intellectual sabotage. — Whittle (1980)
2 The first papers and strategies on the topic were writtenby Thompson (1933) and Robins (1952).
3 Bellman and Gittins provided backward looking andforward looking solutions to the problem.
4 Today the MAB framework is behind numerous algorithmsthat are used in the online world.
5 Note the similarities to the search problem consideredlast week: the problems fold into each other
4 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Domains where MABs have been applied
1 Developing new medicine—clinical trials.
2 One of the steam-engines for studying human (andanimal) learning.
3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.
4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.
5 Used to decide which learning algorithm to use in aspecific context.
6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.
5 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Domains where MABs have been applied
1 Developing new medicine—clinical trials.
2 One of the steam-engines for studying human (andanimal) learning.
3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.
4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.
5 Used to decide which learning algorithm to use in aspecific context.
6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.
5 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Domains where MABs have been applied
1 Developing new medicine—clinical trials.
2 One of the steam-engines for studying human (andanimal) learning.
3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.
4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.
5 Used to decide which learning algorithm to use in aspecific context.
6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.
5 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Domains where MABs have been applied
1 Developing new medicine—clinical trials.
2 One of the steam-engines for studying human (andanimal) learning.
3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.
4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.
5 Used to decide which learning algorithm to use in aspecific context.
6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.
5 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Domains where MABs have been applied
1 Developing new medicine—clinical trials.
2 One of the steam-engines for studying human (andanimal) learning.
3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.
4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.
5 Used to decide which learning algorithm to use in aspecific context.
6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.
5 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Domains where MABs have been applied
1 Developing new medicine—clinical trials.
2 One of the steam-engines for studying human (andanimal) learning.
3 A very general framework for autonomous AI decisionmaking. It is used as an alternative to A/B testing.
4 Currently used to allocate ads on the web. Companies likeCriteo rely heavily on this framework.
5 Used to decide which learning algorithm to use in aspecific context.
6 Used to model how companies might choose amongorganizational structures or technologies of unknown merit.
5 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Different strategies for coping with the multi-armedbandit problem
Go optimal — not always possible and oftencomputational very expensive.
Go greedy — always try the best alternative.
Add some noise — randomize one in while (e-greedy).
When randomizing choose options with higher expectedreturn with a higher probability (softmax).
Probability matching — choose actions according to theirprobability of being the best.
Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.
6 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Different strategies for coping with the multi-armedbandit problem
Go optimal — not always possible and oftencomputational very expensive.
Go greedy — always try the best alternative.
Add some noise — randomize one in while (e-greedy).
When randomizing choose options with higher expectedreturn with a higher probability (softmax).
Probability matching — choose actions according to theirprobability of being the best.
Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.
6 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Different strategies for coping with the multi-armedbandit problem
Go optimal — not always possible and oftencomputational very expensive.
Go greedy — always try the best alternative.
Add some noise — randomize one in while (e-greedy).
When randomizing choose options with higher expectedreturn with a higher probability (softmax).
Probability matching — choose actions according to theirprobability of being the best.
Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.
6 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Different strategies for coping with the multi-armedbandit problem
Go optimal — not always possible and oftencomputational very expensive.
Go greedy — always try the best alternative.
Add some noise — randomize one in while (e-greedy).
When randomizing choose options with higher expectedreturn with a higher probability (softmax).
Probability matching — choose actions according to theirprobability of being the best.
Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.
6 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Different strategies for coping with the multi-armedbandit problem
Go optimal — not always possible and oftencomputational very expensive.
Go greedy — always try the best alternative.
Add some noise — randomize one in while (e-greedy).
When randomizing choose options with higher expectedreturn with a higher probability (softmax).
Probability matching — choose actions according to theirprobability of being the best.
Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.
6 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Different strategies for coping with the multi-armedbandit problem
Go optimal — not always possible and oftencomputational very expensive.
Go greedy — always try the best alternative.
Add some noise — randomize one in while (e-greedy).
When randomizing choose options with higher expectedreturn with a higher probability (softmax).
Probability matching — choose actions according to theirprobability of being the best.
Optimism in the face of uncertainty — prefer actions thatare more uncertain, as they may turn out being rely good.
6 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The Gittins index (Christian and Griffiths cpt. 2)
Possible to calculate for Bernoulli bandits with stablediscounting of future trials.
7 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
A simple example (Sutton and Barto, cpt. 2)
8 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Performance of the e-greedy algorithm
9 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Starting optimistically
10 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Discussion: A/B testing andexploration-exploitation
11 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The softmax rule
Biases exploration towards the more promising actions.
The softmax rule grades probabilities according to theirselected values.
P(C (t) = j) =exp(θEj(t))∑K
k=1 exp(θEk(t))
where θ is a temperature controlling how biased thealgorithm will be.
12 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The softmax rule
Biases exploration towards the more promising actions.
The softmax rule grades probabilities according to theirselected values.
P(C (t) = j) =exp(θEj(t))∑K
k=1 exp(θEk(t))
where θ is a temperature controlling how biased thealgorithm will be.
12 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The softmax rule
Biases exploration towards the more promising actions.
The softmax rule grades probabilities according to theirselected values.
P(C (t) = j) =exp(θEj(t))∑K
k=1 exp(θEk(t))
where θ is a temperature controlling how biased thealgorithm will be.
12 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The softmax rule
Biases exploration towards the more promising actions.
The softmax rule grades probabilities according to theirselected values.
P(C (t) = j) =exp(θEj(t))∑K
k=1 exp(θEk(t))
where θ is a temperature controlling how biased thealgorithm will be.
12 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Optimism in the face of uncertainty and the upperconfidence bound (UCB)
The more uncertain you are about the value of an optionthe more important it is to explore.That option could turn out to be really good and in thelong-term improve your overall utility.UCB: P(C = i) ∝ exp(θmi + α
√vari )
13 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
UCB against e-greedy
14 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Probability matching, changing environments andThompson sampling
Probability matching suggests sampling alternativesaccording to their rewards or their probability of being thebest.Thompson sampling is an implementation of theprobability matching principle.
15 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Collective exploration
The Roger’s paradox — produce or scrounge?
The social learning tournament — Rendell et al. (2010)
Counter-intuitive more or less effects — Toyokawa et al.(2014)
16 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Collective exploration
The Roger’s paradox — produce or scrounge?
The social learning tournament — Rendell et al. (2010)
Counter-intuitive more or less effects — Toyokawa et al.(2014)
16 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Collective exploration
The Roger’s paradox — produce or scrounge?
The social learning tournament — Rendell et al. (2010)
Counter-intuitive more or less effects — Toyokawa et al.(2014)
16 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Collective exploration
The Roger’s paradox — produce or scrounge?
The social learning tournament — Rendell et al. (2010)
Counter-intuitive more or less effects — Toyokawa et al.(2014)
17 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The typical bandit setting is like blind tasting...
18 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
My grandma’s problem: Choosing the best place toswim
19 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
The machine learner’s problem
Li, L., Chu, W., Langford, J., Schapire, R. E. (2010,April). A contextual-bandit approach to personalizednews article recommendation. In Proceedings of the 19thinternational conference on World Wide Web.
20 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
A contextual bandit experiment
21 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
A contextual bandit experiment: Results
22 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Contextual multi-armed bandit (CMAB) problem
Option 1
N(f (·), σ1)N(w1x1 + w2x2, σ)
N(µ1, σ1)
Option 2
N(f (·), σ2)N(w1x1 + w2x2, σ)
N(µ2, σ2)
23 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Realistic decision problem...
24 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Realistic decision problem...
25 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Motivation
Why is the CMAB problem interesting?
1 Captures the important characteristics of decisions in thewild better.
2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.
3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.
4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.
26 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Motivation
Why is the CMAB problem interesting?
1 Captures the important characteristics of decisions in thewild better.
2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.
3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.
4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.
26 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Motivation
Why is the CMAB problem interesting?
1 Captures the important characteristics of decisions in thewild better.
2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.
3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.
4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.
26 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Motivation
Why is the CMAB problem interesting?
1 Captures the important characteristics of decisions in thewild better.
2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.
3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.
4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.
26 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Motivation
Why is the CMAB problem interesting?
1 Captures the important characteristics of decisions in thewild better.
2 We can study how function learning interacts with decisionmaking, how people deal with novelty, transfer of learning.
3 TD(λ) & curse of dimensionality - function learning as asolution. These problems are notoriously hard to solveusing optimization techniques.
4 There is no realistic framework within we can study howpeople their preferences. CMAB might provide us withone.
26 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
CMAB task
27 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
MAB task
28 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
One-shot choices in the test phase
Three alternatives:
Dominating - highest function value.
Neutral - middle function value.
Dominated - lowest function value.
29 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experimental Design
Training phase
Between subject design – CMAB or MAB
Contextual multi-armed bandit (CMAB) task – twoinformative features are visually displayed
Classic multi-armed bandit (MAB) task – control group,features are not visible
20 alternatives, 100 trials
Test phase
Designed to test the functional knowledge.
One shot choices, no outcome feedback.
3 arms in 70 trials.
30 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Gaussian process (GP) based “optimal” solutions
Goal: simultaneously learn and optimize unknown function.
y = f (x) + ε, ε ∼ N(0, σ2)
GP based function learning process
f (x) ∼ GP(m(x),K (x, x′))
K (x, x′) = σ2f exp(− (x−x′)2
2l2)
Two versions of the choice process
1 Upper confidence bound (GP-UCB): argmaximi + 2√vari
2 Thompson sampling (GP-Th): Draw from p(θ|D,M) foreach arm, take the max.
31 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Gaussian process (GP) based “optimal” solutions
Goal: simultaneously learn and optimize unknown function.
y = f (x) + ε, ε ∼ N(0, σ2)
GP based function learning process
f (x) ∼ GP(m(x),K (x, x′))
K (x, x′) = σ2f exp(− (x−x′)2
2l2)
Two versions of the choice process
1 Upper confidence bound (GP-UCB): argmaximi + 2√vari
2 Thompson sampling (GP-Th): Draw from p(θ|D,M) foreach arm, take the max.
31 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Gaussian process (GP) based “optimal” solutions
Goal: simultaneously learn and optimize unknown function.
y = f (x) + ε, ε ∼ N(0, σ2)
GP based function learning process
f (x) ∼ GP(m(x),K (x, x′))
K (x, x′) = σ2f exp(− (x−x′)2
2l2)
Two versions of the choice process
1 Upper confidence bound (GP-UCB): argmaximi + 2√vari
2 Thompson sampling (GP-Th): Draw from p(θ|D,M) foreach arm, take the max.
31 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
GP prior, 1D example
−2
0
2
0.00 0.25 0.50 0.75 1.00input, x
outp
ut, f
(x)
Prior, Squared exponential kernel, l=1
−2
0
2
0.00 0.25 0.50 0.75 1.00input, x
outp
ut, f
(x)
GP−UCB, trial 3
−2
0
2
0.00 0.25 0.50 0.75 1.00input, x
outp
ut, f
(x)
GP−UCB, trial 5
−2
0
2
0.00 0.25 0.50 0.75 1.00input, x
outp
ut, f
(x)
GP−UCB, trial 20
32 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
GP-Thompson, 1D example
−2
0
2
0.00 0.25 0.50 0.75 1.00input, x
outp
ut, f
(x)
GP−Thompson, trial 2
−2
0
2
0.00 0.25 0.50 0.75 1.00input, x
outp
ut, f
(x)
GP−Thompson, trial 3
−2
0
2
0.00 0.25 0.50 0.75 1.00input, x
outp
ut, f
(x)
GP−Thompson, trial 20
−2
0
2
0.00 0.25 0.50 0.75 1.00input, x
outp
ut, f
(x)
GP−Thompson, trial 100
33 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?
Can we model people’s behavior using traditional machinelearning models?
How priors about functional relationships affect thedecision making?
Do people explore the choice set strategically, to learnthe relationships?
34 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 1 – Positive linear function
Experiment 1a – Amazon Turk
Feature values x drawn from U(0.1, 0.9)
For each arm j in trial t, the payoffs Rj(t) were computedas:
Rj(t) = 2× x1,j + 1× x2,j + εj(t).
εj(t) drawn independently for each arm in every trial, fromN(0, 0.25).
Task was to maximize the cumulative reward.
186 participants – monetary payoffs.
Experiment 1b – lab replication
Weights and noise rescaled: w1 = 20, w2 = 10, N(0, 2.5).
75 UPF lab participants – monetary payoffs.
35 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 1 – Positive linear function
Experiment 1a – Amazon Turk
Feature values x drawn from U(0.1, 0.9)
For each arm j in trial t, the payoffs Rj(t) were computedas:
Rj(t) = 2× x1,j + 1× x2,j + εj(t).
εj(t) drawn independently for each arm in every trial, fromN(0, 0.25).
Task was to maximize the cumulative reward.
186 participants – monetary payoffs.
Experiment 1b – lab replication
Weights and noise rescaled: w1 = 20, w2 = 10, N(0, 2.5).
75 UPF lab participants – monetary payoffs.35 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Mean choice rank - Exp 1a
Random performance
1
2
3
4
5
6
7
8
9
10
11
12
13
1 2 3 4 5Block
Mea
n ra
nk o
f the
cho
sen
alte
rnat
ive
MABCMABGP−UCB
Individual 1 Individual 236 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Mean choice rank - Exp 1a
Random performance
1
2
3
4
5
6
7
8
9
10
11
12
13
1 2 3 4 5Block
Mea
n ra
nk o
f the
cho
sen
alte
rnat
ive
MABCMABGP−UCB
Mean choice rank - Exp 1b37 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
One-shot choices in the test phase
Three alternatives:
Dominating - highest function value.
Neutral - middle function value.
Dominated - lowest function value.
38 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
One-shot choices in the test phase
CMABn
Diff/Extra
CMABn
Diff/Inter
CMABn
Easy/Extra
CMABn
Easy/Inter
CMABn
Weight test
0.00
0.25
0.50
0.75
1.00
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative
Mea
n pr
opor
tion
of c
hoic
es
39 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
One-shot choices in the test phase
CMABn
Diff/Extra
CMABn
Diff/Inter
CMABn
Easy/Extra
CMABn
Easy/Inter
CMABn
Weight test
0.00
0.25
0.50
0.75
1.00
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative
Mea
n pr
opor
tion
of c
hoic
es
One-shot choices – Lab replication
40 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Exploration in the feature space
MAB CMABn
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
Feature 1
Feat
ure
2
0.1
0.2
0.3Proportion
41 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Exploration in the feature space – First 10 trials
MAB CMABn
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
Feature 1
Feat
ure
2
0.040.060.080.100.12
Proportion
Lab replication Clusters: Exploration
42 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Exploration in the feature space – All trials
MAB CMABn
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
Feature 1
Feat
ure
2
0.1
0.2
0.3Proportion
Exp 1b - Lab replication
43 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Inter-individual differences: Function-basedand naive learners
44 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Clustering according to the test phase performance
1
Diff/Extra
1
Diff/Inter
1
Easy/Extra
1
Easy/Inter
1
Weight test
2
Diff/Extra
2
Diff/Inter
2
Easy/Extra
2
Easy/Inter
2
Weight test
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative
Mea
n pr
opor
tion
of c
hoic
es
45 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Clusters: Performance in the CMAB task
N1 = 43N2 = 53
Rtr = 7Rtr = 4.59
Rte = 1.94Rte = 1.24
CMABn
1
2
3
4
5
6
7
8
9
10
11
12
13
1 2 3 4 5Block
Mea
n ra
nk o
f the
cho
sen
alte
rnat
ive
aaaaaa
12
46 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Clusters: Feature space, first 10 trials
CMABn
1
CMABn
2
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
Feature 1
Feat
ure
2
0.060.090.120.15
Proportion
Exploration in the MAB condition
47 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?
Can we model people’s behavior using traditional machinelearning models?
How priors about functional relationships affect thedecision making?
Do people explore the choice set strategically, to learnthe relationships?
48 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Modeling user behavior
Learning: We model participants as function learners(GP) or as tracing mean rewards (BMT)
1 Gaussian processes (GP) function learning model:
f (x) ∼ GP(m(x),K (x, x′)), K (x, x′) = σ2f exp(− (x−x′)2
2l2 )2 Bayesian mean reward tracing (BMT)
Choices: Participants either use uncertainty in balancingthe exploration-exploitation (UCB) or not (SM).
1 Upper confidence bound (UCB):P(C = i) ∝ exp(θmi + α
√vari )
2 Softmax (SM): P(C = i) ∝ exp(θmi )
49 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Modeling user behavior
Model BICw N # C1 C2
BMT-SM .54 (.38) 19 4 .48 (12) .72 (7)BMT-UCB .05 (.55) 0 5 .04 (0) .07 (0)GP-SM .27 (.38) 12 4 .34 (11) .09 (1)GP-UCB .02 (.04) 0 5 .03 (0) .01 (0)
RCM .11 (.32) 4 0 .11 (3) .11 (1)
There is evidence for GP models, especially for participants thatknow function well (according to the test task). Models withUCB perform poorly.
50 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?
Can we model people’s behavior using traditional machinelearning models?
How priors about functional relationships affect thedecision making?
Do people explore the choice set strategically, to learnthe relationships?
51 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 1c – Quadratic and mixed linearfunction
Training phase
2x2 between subject design: Type of task (CMAB, MAB)and Type of function (Quadratic, Mixed)
Quadratic function:1 + 60(x1 − .02)2 + 60(x1 − .02)2 + 30x1x2, N(0, 2.5)
Mixed linear function: w1 = 40, w2 = −30, N(0, 2.5)
376 participants – Amazon Turk – monetary payoffs.
Test phase
Test items for mixed linear function are the same as forthe positive linear one
Special items for the quadratic function, testing whetherpeople detected the nonlinear nature of the relationship.
Illustration
52 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Exploration in the feature space, first 10 trials
MAB mixed CMAB mixed
MAB quadratic CMAB quadratic
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
Feature 1
Feat
ure
2
0.0500.0750.100
Proportion
Mean choice rank One-shot choices
53 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Exploration in the feature space, all trials
MAB mixed CMAB mixed
MAB quadratic CMAB quadratic
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
Feature 1
Feat
ure
2
0.1
0.2
0.3Proportion
Mean choice rank One-shot choices
54 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
How much do people rely on knowledge of therelationships between features and alternative value’swhen making decisions?
Can we model people’s behavior using traditional machinelearning models?
How priors about functional relationships affect thedecision making?
Do people explore the choice set strategically, to learnthe relationships?
55 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.
Type of function.Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.
Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.Horizon.
Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)
Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.
Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Experiment 2 – Function learning pretraining
Exploration to learn the function should depend on...
Uncertainty about the function.Type of function.Horizon.Expecting need for generalization.
Training phase
Mixed design – Two between factors: Type of function(Positive linear, Quadratic) x Horizon (100 or 30 trialsCMAB phase) and within factor (With or without functionlearning phase)Function learning task – 100 trials with single alternative,same two features and function, accuracy incentivized.Same positive linear and quadratic functions as before, butalternatives now includes randomly drawn intercepts!425 participants – Amazon Turk – monetary payoffs.
56 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Mean choice ranks
Random performance Random performance
Linear Quadratic
1
2
3
4
5
6
7
8
9
10
11
12
13
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10Block
Mea
n ra
nk o
f the
cho
sen
alte
rnat
ive
CMAB linearfCMAB linearfCMABs linearCMAB quadraticfCMAB quadraticfCMABs quadratic
57 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Exploration in the feature space, first 10 trials
CMAB lin fCMAB lin fCMABs lin
CMAB quad fCMAB quad fCMABs quad
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
Feature 1
Feat
ure
2
0.1
0.2
0.3Proportion
One-shot choices
58 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Exploration in the feature space, all trials
CMAB lin fCMAB lin fCMABs lin
CMAB quad fCMAB quad fCMABs quad
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
Feature 1
Feat
ure
2
0.10.20.30.4
Proportion
One-shot choices
59 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
People learn the function and generalize their knowledgeto new decision situations.
But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.
New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.
Priors about the functional relationship can hurt theperformance.
People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.
60 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
People learn the function and generalize their knowledgeto new decision situations.
But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.
New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.
Priors about the functional relationship can hurt theperformance.
People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.
60 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
People learn the function and generalize their knowledgeto new decision situations.
But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.
New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.
Priors about the functional relationship can hurt theperformance.
People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.
60 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
People learn the function and generalize their knowledgeto new decision situations.
But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.
New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.
Priors about the functional relationship can hurt theperformance.
People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.
60 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
People learn the function and generalize their knowledgeto new decision situations.
But there are inter-individual differences – some peoplerely on learning the function, others are naive learners;akin to model-based vs model-free RL.
New flavour of the exploration-exploitation trade-off –evidence that people simultaneously learn and optimizethe function.
Priors about the functional relationship can hurt theperformance.
People do not seem to take into account the time horizon.People exploit more aggressively when they have beenpre-trained on the function.
60 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
Challenges and future directions
Goals is to develop a function learning based RL model –algorithmic level.
Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.
How do people behave in the presence of informationabout the alternatives and other contextual information?
61 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
Challenges and future directions
Goals is to develop a function learning based RL model –algorithmic level.
Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.
How do people behave in the presence of informationabout the alternatives and other contextual information?
61 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
Challenges and future directions
Goals is to develop a function learning based RL model –algorithmic level.
Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.
How do people behave in the presence of informationabout the alternatives and other contextual information?
61 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Summary
Challenges and future directions
Goals is to develop a function learning based RL model –algorithmic level.
Moreover, it is difficult to fit function learning modelswithout prediction data. However, asking for predictionsalong with choices changes the behaviour.
How do people behave in the presence of informationabout the alternatives and other contextual information?
61 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Acknowledgments
Funding:
FPU grant, Ministry of Education, Culture and Sports,Spain
Max Planck Institute for Human Development, Berlin
Barcelona Graduate School of Economics
62 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Quadratic function – An illustration
Experimental design
63 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Individual behaviour in the training phase -Experiment 1a
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●●
●●●●
●
●
●●●●●●●●●●●●●●●●●●●●●●●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
5
10
15
20
0 25 50 75 100
Ran
k of
the
chos
en a
ltern
ativ
e
Choice behavior of subject e2−0124,CMABn condition, experiment LowNoise
Mean choice rank64 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Individual behaviour in the training phase -Experiment 1a
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●●
●
●
●
●●●
●
●
●
●
●●●●●●●
●●●●
●●
●●●●
●●●
●
●
●
●
●
●
●●
●
●
●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●
●
●
●
●●
●●●●●●●●
5
10
15
20
0 25 50 75 100
Ran
k of
the
chos
en a
ltern
ativ
e
Choice behavior of subject e2−0065,CMABn condition, experiment LowNoise
Mean choice rank65 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Mean choice rank – Lab replication
Random performance
1
2
3
4
5
6
7
8
9
10
11
12
13
1 2 3 4 5Block
Mea
n ra
nk o
f the
cho
sen
alte
rnat
ive
MABCMAB
Mean choice rank - Exp 1a66 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
One-shot choices – Lab replication
CMABn
Diff/Extra
CMABn
Diff/Inter
CMABn
Easy/Extra
CMABn
Easy/Inter
CMABn
Weight test
0.00
0.25
0.50
0.75
1.00
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative
Mea
n pr
opor
tion
of c
hoic
es
One-shot choices - Exp 1a
67 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Feature space, all trials – Lab replication
MAB CMABn
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
Feature 1
Feat
ure
2
0.10.20.3
Proportion
Back to Exp 1a
68 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Feature space, first 10 trials – Lab replication
MAB CMABn
(0.1,0.3]
(0.3,0.5]
(0.5,0.7]
(0.7,0.9]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
(0.1
,0.3
]
(0.3
,0.5
]
(0.5
,0.7
]
(0.7
,0.9
]
Feature 1
Feat
ure
2
0.050.100.150.200.25
Proportion
Back to Exp 1a
69 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Mean choice rank – Mixed and quadratic
Random performance Random performance
Mixed Quadratic
1
2
3
4
5
6
7
8
9
10
11
12
13
1 2 3 4 5 1 2 3 4 5Block
Mea
n ra
nk o
f the
cho
sen
alte
rnat
ive
MAB mixedCMAB mixedMAB quadraticCMAB quadratic
Feature space, all Feature space, subset
70 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
One-shot choices – Mixed and quadratic
CMAB mixed
Easy
CMAB mixed
Difficult
CMAB mixed
Weight test
CMAB quadratic
Max test 1
CMAB quadratic
Min test 1
CMAB quadratic
Slope test 1
CMAB quadratic
Max test 2
CMAB quadratic
Min test 2
CMAB quadratic
Slope test 2
0.000.250.500.751.00
0.000.250.500.751.00
0.000.250.500.751.00
1 2 3 1 2 3 1 2 3Rank of the chosen alternative
Mea
n pr
opor
tion
of c
hoic
es
Feature space, all Feature space, subset
71 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
Mean choice rank – Positive and quadratic
Random performance Random performance
Linear Quadratic
1
2
3
4
5
6
7
8
9
10
11
12
13
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10Block
Mea
n ra
nk o
f the
cho
sen
alte
rnat
ive
CMAB linearfCMAB linearfCMABs linearCMAB quadraticfCMAB quadraticfCMABs quadratic
Feature space, all Feature space, subset
72 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
One-shot choices – Positive linear
CMAB lin
Easy
CMAB lin
Difficult
CMAB lin
Weight test
fCMAB lin
Easy
fCMAB lin
Difficult
fCMAB lin
Weight test
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
1 2 3 1 2 3 1 2 3Rank of the chosen alternative
Mea
n pr
opor
tion
of c
hoic
es
Feature space, all Feature space, subset
73 / 75
Theexploration-exploitation
trade-off
PantelisPipergiasAnalytis
Exploration-exploitationproblems
Themulti-armedbanditframework
Strategies
Contextualbandits
Results from areal worldexperiment
Conclusions
One-shot choices – Quadratic
CMAB q
Max test 1
CMAB q
Min test 1
CMAB q
Slope test 1
CMAB q
Max test 2
CMAB q
Min test 2
CMAB q
Slope test 2
fCMAB q
Max test 1
fCMAB q
Min test 1
fCMAB q
Slope test 1
fCMAB q
Max test 2
fCMAB q
Min test 2
fCMAB q
Slope test 2
0.00
0.25
0.50
0.75
1.00
0.00
0.25
0.50
0.75
1.00
1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3Rank of the chosen alternative
Mea
n pr
opor
tion
of c
hoic
es
Feature space, all Feature space, subset
74 / 75