Information Theory of Decisions and Actions
Naftali Tishby and Daniel Polani
2
Contents• Guiding Questions• Introduction
Guiding Questions Q1. How is Fuster’s perception-action-cycle and Shannon’s information theory
related? How is this analogy related with reinforcement learning? Q2. What is value-to-go? What is information-to-go? How do we trade-off be-
tween these two terms? Give a formulation that can make this trade-off. Hint: Free-energy principle. How can we find the optimal policy, i.e. the one mini-mizing its information-to-go under a constraint on the attained value-to-go.
Q3. Define the entropy. Define the relative entropy or Kullback-Leibler diver-gence. Define the Markov decision process (MDP). Define the value function of the MDP. How is the value function optimized? What is Bellman equation and how is this related with the MDP problem? What’s the relationship be-tween reinforcement learning, the MDP, and the Bellman equation?
Q4. Use a Bayesian network or graphical model (see Figure in page 12) to de-scribe the perception-action cycle of an agent with sensors and memory. What are the characteristics of this agent?
3
Introduction• We need to develop intelligent behaviour for artificial agents, such as organ-
isms.• The “cycle” view, such as perception-action cycle, can help identifying biases,
incentives and constraints for the self-organized formation of intelligent pro-cessing in living organisms.
• There are many modeling ways for quantitative treatment of the perception-ac-tion cycle.
• Information-theoretic treatment for perception-action cycle can compare sce-narios with differing computational models.
• Markovian Decision Process (MDP) framework solves the problem of finding the optimal policy which maximizes the reward achieved by agents.
• Goal of the paper is to marry the MDP formalism with an information-theo-retic treatment of the processing cost required by the agent to attain a given level of performance.
4
Shannon’s Information TheoryWhat is the Shannon’s information theory?
A branch of applied mathematics, electrical engineering, and com-puter science involving the quantification of information.
A key measure of information is known as entropy, which is usu-ally expressed by the average number of bits needed to store or communicate one symbol in message.
Entropy quantifies the uncertainty involved in predicting the value of random variable.
5
Shannon’s Information TheoryEntropy and Information
Entropy of a random variable
The entropy is a measure of uncertainty about the outcome of the random variable before it has been measured, or seen, and is a natural choice for this.
Attain maximum for uniform distribution, reflecting the state of maximal uncertainty.
6
Shannon’s Information Theory Conditional entropy of random variables
→ The conditional entropy measures the remaining uncertainty about if is known.
7
Shannon’s Information Theory Joint entropy of a random variables
→ Joint entropy is a measure of the uncertainty associated with a set of variables Mutual information between and
→ Mutual information of two random variables is a quantity that measures the mutual dependence of the two random variables.
8
Shannon’s Information TheoryRelative entropy (Kullback-Leibler divergence)
The relative entropy is a measure how much “compression” (or predic-tion, both in bits) could be gained if instead of an hypothesized distribu-tion of , a concrete distribution is utilized.
One has with equality if and only if everywhere. The relative entropy can become infinite if for an outcome that can occur
with nonzero probability one assumes a probability . The mutual information between two variables and can be expressed as
9
Markov Decision Processes
MDP: Definition Discrete time stochastic control process Basic model for the interaction of an organism (or an artificial
agent) with a stochastic environment The core problem of MDPs is to find a “policy” for the decision
maker The goal is to choose a policy that will maximize some cumula-
tive function of the random rewards, typically the expected dis-counted sum over a potentially infinite horizon.
10
Markov Decision Processes Given a state set , and for each state an action set , an MDP is specified
by the tuple , defined for all and : the probability that performing an action in a state will move the agent to state : the expected reward for this particular transition
11
Markov Decision ProcessesValue function of MDP and its optimization
Policy specified an explicit probability to select action if the agent in a state
Total cumulated reward
Future expected cumulative reward value (Bellman Equation)
Per-action value function Q which is expanded from value function V
12
Markov Decision ProcessesBellman equation
A dynamic decision problem
Constraint
Bellman’s principle of optimalityAn optimal policy has the property that whatever the initial state and initial decisions are, the remaining decisions must constitute an optimal policy with regard to the state resulting from the first decision Bellman equation
Constraint
13
Markov Decision ProcessesReinforcement learning
If the probabilities or rewards are unknown, the problem is one of reinforcement learning
For this purpose it is useful to define a further function, which cor-responds to taking the action and then continuing optimally.
14
Markov Decision Processes• Problem
– The MDP framework is concerned with describing the task and with solving the problem of finding the optimal policy
(value-to-go)– It is not concerned with the actual processing cost that is involved
with carrying out the given policies (Information-to-go)
15
Guiding Questions• Q1. How is Fuster’s perception-action-cycle and
Shannon’s information theory related? How is this analogy related with reinforcement learning?
• Q3. Define the entropy. Define the relative en-tropy or Kullback-Leibler divergence. Define the Markov decision process (MDP). Define the value function of the MDP. How is the value function optimized? What is Bellman equation and how is this related with the MDP problem? What’s the re-lationship between reinforcement learning, the MDP, and the Bellman equation?
16
Bayesian Network• Bayesian network of general agent
17
𝑊 𝑡 −3 𝑊 𝑡 −2 𝑊 𝑡 −1 𝑊 𝑡+ 1𝑊 𝑡
𝑆𝑡− 3 𝐴𝑡− 3
𝑀 𝑡− 3 𝑀 𝑡− 2 𝑀 𝑡− 1 𝑀 𝑡
𝑆𝑡− 2 𝐴𝑡− 2 𝑆𝑡−1 𝐴𝑡− 1 𝑆𝑡 𝐴𝑡
: World state
: sensor of agent
: memory of agent
: Action
Bayesian Network• Characteristics of agent
– Agent can be considered as an all-knowing observer. – Agent can access full states to world state – Memory of reactive agent will be ignored.
• Apply previous comment to graph
18
𝑆𝑡− 3
𝐴𝑡− 3
𝑆𝑡− 2
𝐴𝑡− 2
𝑆𝑡−1
𝐴𝑡− 1
𝑆𝑡
𝐴𝑡
Value to go / Information to go• Value-to-go
– Future expected reward in the course of a behaviour sequence to-wards a goal
• Information-to-go– Cumulated information processing cost or bandwidth required to
specify the future decision and action sequence
• Trade-off– In a view of biological ramification, organism finds optimal re-
wards that an organism can accumulate under given constraints on its informational bandwidth
– How much reward the organism can accumulate vs. how much in-formational bandwidth it needs for that
19
Information-to-go• Formalism
– The cumulated information processing cost or bandwidth required to specify the future decision and action sequence
– This is computed specifying a given starting state and initial ac-tion and accumulating information-to-go into the open-ended fu-ture
– Let is fixed prior on the distribution of successive states and ac-tions
– Define now the process complexity as the Kullback-Leibler diver-gence between actual distribution of states and actions after t and the one assumed in the prior
20
Information-to-go• Formalism
– Since are independent so
Action distributions are consistent with them via the policy which we assume constant over time for all t
21
Information to go• Formalism
With
22
Calculating trade-off• Using Lagrange method
– The constrained optimization problem of finding minimal informa-tion-to-go at a given level of value-to-go can be turn into an un-constrained one.
– Let the Lagrange multiplier as
– Lagrangian build a link to the Free Energy formalism corresponds to the physical entropy corresponds to the energy of system– This provides additional justification for the minimization of the
information-to-go under value-to-go constraints• Minimization of identifies the least committed policy in the
sense that the future is the least informative
23
Calculating trade-off• Using Lagrange method
– To find the optimal policy,
where the minimization ranges over all policies– To resolve this equation
24
Calculating trade-off• Using Lagrange method
– Extending above equation by Lagrange term for the normalization of and taking the gradient with respect to and then setting the gradient of to 0 provides
25
Calculating trade-off• Using Lagrange method
– Iterating the system of self-consistent above equations till conver-gence for every state will produce an optimal policy.
26