An Information-theoretic On-line LearningPrinciple for Specialization in Hierarchical
Decision-Making Systems
Heinke Hihn, Sebastian Gottwald, and Daniel A. BraunUlm University
Institute for Neural Information Processing
December 12, 201958th IEEE Conference on Decision and Control, Nice
2 Introduction | Hihn et al. | December 12, 2019
Emergence of Specialized Decision-MakersDecision-Maker: optimizes a utility U in state s:
a∗s = arg maxa
U(s,a)
Central Idea:Limited resources such as
I Linear Decision-MakersI Limited information processing
drive specialization. 1 2
Motivation: Linear decision-makers are easy toanalyze.
1Genewein, T., Leibfried, F., Grau-Moya, J., and Braun, D.A. Bounded rationality, abstraction, and
hierarchical decision-making: An information-theoretic optimality principle. Frontiers in Robotics and AI,2:27, 2015.
2Hihn, H., Gottwald, S., and Braun, D. A. (2018). Bounded rational decision-making with adaptive
neural network priors. In IAPR Workshop on Artificial Neural Networks in Pattern Recognition.2 / 15
3 Decision-making under Limited Resources | Hihn et al. | December 12, 2019
Bounded Rationality and Specialization
Herbert A. Simoncoined the term
Bounded Rationality
Intelligent agents must invest their re-sources such that they optimally trade offutility versus processing costs 3 4
Consequence: Specialization
3Simon, H. A. A behavioral model of rational choice. The Quarterly Journal of Economics,
69(1):99–118, 1955.4
Gershman, S. J., et al. Computational rationality: A converging paradigm for intelligence in brains,minds, and machines. Science (2015)
3 / 15
4 Information-theoretic Bounded Rationality | Hihn et al. | December 12, 2019
Information-theoretic Bounded Rationality 5
Observation
Observation
High uncertainty Low uncertainty
High uncertainty Remaining uncertainty
Unlimited Resources
Limited Resources
p(a|
s)
p(a|
s)
p(a
|s)
p(a
|s)
maxp(a|s)
Ep(s),p(a|s) [U(s,a)] s.t. I(S;A) ≤ C (1)
p∗(a|s) = arg maxp(a|s)
E [U(s,a)]− 1β
I(S;A) (2)
Mutual Information: I(S; A) = Ep(a|s) [DKL(p(a|s)||p(a))]
5Ortega, P. A., and Braun, D.A.. Thermodynamics as a theory of decision-making with
information-processing costs. Proceedings of the Royal Society of London A: Mathematical, Physicaland Engineering Sciences, 469(2153), 2013.
4 / 15
5 Information-theoretic Bounded Rationality | Hihn et al. | December 12, 2019
Hierarchical Decision-Making
Extend to two-level hierarchy with experts x ∈ X 6
S → X → A (3)
Extended objective:
maxp(a|s,x),p(x |s)
E[U(s,a)]− 1β1
I(S;X )− 1β2
I(S;A|X ). (4)
6Genewein, T., Leibfried, F., Grau-Moya, J., and Braun, D.A. Bounded rationality, abstraction, and
hierarchical decision-making: An information- theoretic optimality principle. Frontiers in Robotics andAI, 2:27, 2015.
5 / 15
6 Learning | Hihn et al. | December 12, 2019
Learning via Gradient DescentParametrize distributions with parameters θ and ϑ:
J(s, x ,a) = U(s,a)− 1β1
logpθ(x |s)
p(x)− 1β2
logpϑ(a|s, x)
p(a|x)(5)
maxθ
Epθ(x |s)
[f (x , s)− 1
β1log pθ(x |s)
p(x)
](6)
f (x , s) = Epϑ(a|x ,s)
[U(s,a)− 1
β2log
pϑ(a|s, x)p(a|x)
]︸ ︷︷ ︸
Expert Objective
(7)
Approximate the prior distributions p(x) and p(a|x) byrunning means.
6 / 15
7 Learning | Hihn et al. | December 12, 2019
Utilities for Classification and Regression
1. cross-entropy lossL(y , y) =∑i yi log 1
yi= −∑i yi log yi
2. mean squared error L(y , y) =∑i(yi − yi)2
maxθ
Epθ(x |s)
[f (x , s)− 1
β1log pθ(x |s)
p(x)
](8)
f (x , s) = Epϑ(y |x ,s)
[−L(y , y)− 1
β2log
pϑ(y |s, x)p(y |x)
]︸ ︷︷ ︸
Expert Objective
(9)
7 / 15
8 Results | Hihn et al. | December 12, 2019
Classification
−1 0 1x1
−1
0
1
x2
Circles
1 2 4Experts
0.00
0.25
0.50
0.75
1.00
%
Accuracy
-2 -1 0 1 2x2
-2
-1
0
1
2
x1
State Partion
1 2 4Experts
0.0
0.1
0.2
0.3
Bit
s
I(W ;A|X)
2 40
1
2I(W |X)
1 2 3 4x
0.0
0.1
0.2
p(x
)
Selection Prior
0.0 2.5x1
0
1
x2
Half Moons
1 2 4Experts
0.00
0.25
0.50
0.75
1.00
%
-2 -1 0 1 2x2
-2
-1
0
1
2
x1
1 2 4Experts
0.00
0.25
0.50
0.75
Bit
s
2 40.0
0.5
1.0
1.5
1 2 3 4x
0.0
0.2
0.4
p(x
)
−2 0 2x1
−2
0
2
x2
Blobs
1 2 4Experts
0.00
0.25
0.50
0.75
1.00
%
-2 -1 0 1 2x2
-2
-1
0
1
2
x1
1 2 4Experts
0.0
0.2
0.4
0.6
Bit
s
2 40
1
2
1 2 3 4x
0.0
0.1
0.2
p(x
)
0
1
2
3
0
1
2
3
0
1
2
3
8 / 15
9 Results | Hihn et al. | December 12, 2019
Reinforcement Learning: Setup
Markov Decision Process as a tuple (S,A,P, r), whereI S is the set of statesI A the set of actionsI P : S ×A× S → [0,1] is the transition probabilityI r : S ×A → R is a reward function
Find policy πθ maximizing expected reward:
θ∗ = arg maxθ
Eτ∼πθ
[ ∞∑t=0
r(st ,at)
]︸ ︷︷ ︸
J(πθ)
. (10)
9 / 15
10 Results | Hihn et al. | December 12, 2019
RL Objective
Penalize deviation from a prior policy:
arg maxπ
Eπ
[ ∞∑t=0
γt(
r(st ,at)−1β
logπ(at |st)
π(a)
)]. (11)
Similar to MaxEnt RL 7, Trust Region Policy Optimization8, Mutual Information Regularized RL 9
7Eysenbach, B. and Levine, S. If MaxEnt RL is the Answer, What is the Question?. arXiv preprint
(2019).8
Schulman, J., et al. Trust region policy optimization. In International Conference on MachineLearning (2015)
9Leibfried, F., and Grau-Moya, J. Mutual-information regularization in markov decision processes
and actor-critic learning. Conference on Robot Learning (2019).10 / 15
11 Results | Hihn et al. | December 12, 2019
RL Objectives
Advantage-Actor-Critic 10 Selection Stage Objective:
maxθ
Eπθ(x |s)[f (s, x)− 1
β1log
πθ(x |s)π(x)
], (12)
where
f (s, x) = Eπϑ(a|s,x)[r(s,a)− 1
β2log
πϑ(a|s, x)π(a|x)
]︸ ︷︷ ︸
Expert Objective
(13)
10Schulman, J., et al. High-dimensional continuous control using generalized advantage estimation.
International Conference on Learning Representations (2015)11 / 15
12 Results | Hihn et al. | December 12, 2019
Reinforcement Learning - State Partition
γu = -1 u = +1
0.0 0.50
20
40
Car
tP
osit
ion
Cart Position
0.0 0.5
0
2
Cart Velocity
0.0 0.5
−0.2
0.0
0.2Pole Angle
0.0 0.5
−2.5
0.0
2.5Pole Velocity
0 2
0.0
0.5
Car
tV
eloci
ty
0 20
20
40
0 2
−0.2
0.0
0.2
0 2
−2.5
0.0
2.5
−0.2 0.0 0.2
0.0
0.5
Pol
eA
ngl
e
−0.2 0.0 0.2
0
2
−0.2 0.00
20
40
−0.2 0.0 0.2
−2.5
0.0
2.5
−2.5 0.0 2.5
0.0
0.5
Pol
eV
eloci
ty
−2.5 0.0 2.5
0
2
−2.5 0.0 2.5
−0.2
0.0
0.2
−2.5 0.0 2.50
25
50
12 / 15
13 Results | Hihn et al. | December 12, 2019
Reinforcement Learning - Continuous Control Problems
γ
a
x
α
0 2000 4000 6000 8000 10000Episode
0
2000
4000
6000
8000
10000
Rew
ard
Cumulative Reward per Episode
1 Expert
5 Experts
TRPO
−1 0 1a
0.0
0.5
1.0
p(a
)
Action Priors
0 2500 5000 7500 10000Episode
0
1
2
Bit
s
Mean Expert DKL
1 Expert
5 Experts
see 11
11Schulman, J., et al. Trust region policy optimization. In International Conference on Machine
Learning (2015)13 / 15
14 Results | Hihn et al. | December 12, 2019
Gain Scheduling
x = Aix + Biu + ε, for x ∈ Xi
Bi =
{1 if x ≥ 0−1 if x < 0
(14)
0 500 1000 1500 2000 2500Iteration
−6000
−4000
−2000
−1000
Cos
t
Cumulative Control Cost
0 1250 2500Iteration
0.5
1.0
Bit
s
Policy Entropies
H(X|S)
H(A;S|X)
0 1250 2500Iteration
4.0
4.2
4.4
4.6
Bit
s
Mean Expert DKL
0 1250 2500Iteration
0.00
0.25
0.50
0.75
1.00
Bit
s
Mean Selector DKL
0 32 64t
−20
0
20
x
Plant Control
Optimal Control
14 / 15
15 Conclusion | Hihn et al. | December 12, 2019
Conclusion
I Principled method applicable to a variety of tasksI Resource limitation drives specializationI No prior task information required: utility driven
partitioningI Normative framework to analyze hierarchical
structuresI System build only by linear decision-makersI Open Questions
I High dimensional tasksI Sample efficiency in RL
15 / 15