Date post: | 04-Jan-2016 |
Category: |
Documents |
Upload: | augustine-flynn |
View: | 214 times |
Download: | 1 times |
Why simple organisms can copewith complex environments?
NIPS 2009 WorkshopThe curse of dimensionality – how can the brain solve it?
Naftali Tishby
Interdisciplinary Center for Neural ComputationHebrew University, Jerusalem, Israel
Outline• Is the RL “curse of dimensionality” real ?
…the “number of parameters” debate revisited … ?
• The Brain’s primary task: making valuable predictions – The perception-action cycle of information– Optimal solution: the Past-Future Information Bottleneck
• Predictive information is rare – Only a tiny fraction of the world’s complexity is relevant– How difficult it is to extract it?
• The brain’s complexity reflects behavior (not the world)– New bounds on predictive representation complexity – Information bounded Reinforcement Learning– Robustness and generalization theorems
The brain’s primary task is making valuable predictions
Perception is goal oriented directed by active predictions
Hierarchies and reverse hierarchies
Tsostos 1990; Hochstein and Ahissar 2002
The auditory pathways
Feedback
reverse
hierarchy
Feed-fo
rward hierarch
y
Low level representations are sensitive to fine
temporal cues, in a μs resolution
Phonological/semantic level
……
day bay
nightdream
Initial perception is based on high-level,
phonological representations
Nelken et al, 2005
Perception-Action Cycles
Multiple cycles with Multiple time scales!
The Perception-Action Cycle
The circular flow of information that takes place between the organism and its environment in the course of a sensory-guided sequence of behavior towards a goal. (JM Fuster)
Why Predictability? Life is all about making good
predictions…
The essence of the cycle
Sensing costs
Prediction value
Internal Representations
NOW
The Environment: stationary stochastic process
Internal Representations
PAST FUTURE
InternalRepresentation
Internal Representations
PAST FUTURE
X Y
T
(Optimal) Internal Representationswe like to think probabilistically
X
T
Y
YXP ,
XTP | TYP |
• Environment: P(X,Y)
• Internal representation: P(T|X), P(Y|T)
X
T
Y
YXI ;
XTI ; YTI ;
• Environment: I(X;Y) – predictive information
• Internal representation: I(T;X) , I(T;Y) - compression & prediction
(Optimal) Internal Representationsand we want a computational principle…
X
T
Y
YXI ;
XTI ; YTI ;
Model Quantifiers:
• Complexity (“cost”): I (T;X)
• Predictive Info (“value”): I(T;Y)
Optimality Trade-off:
• minimize complexity
• maximize predictive-info
model
past future
(Optimal) Internal Representationsand a computational principle…
• Environment: I(X;Y) – predictive information
• Internal representation: I(T;X) , I(T;Y) - compression & prediction
A simple illustration
2,,
18,18,...,2,1
YBAy
Xx
YXP ,2 4 6 8 10 12 14 16 18
A
B
2 4 6 8 10 12 14 16 180
0.2
0.4
0.6
0.8
1
2 4 6 8 10 12 14 16 18
A
B
2 4 6 8 10 12 14 16 180
0.2
0.4
0.6
0.8
1
P (
Y=
B|X
)
0 1 2 3 40
0.05
0.1
0.15
0.2
I(T;X)
I(T
;Y)
Info Curve
X
T
P(T|X)
2 4 6 8 10 12 14 16 18
2
4
6
8
10
12
14
16
182 4 6 8 10 12 14 16 18
0
0.2
0.4
0.6
0.8
1
X
Predictions
A simple illustration
XHXTIXT ;,P
(‘B
’|X
)
(most complex) (perfect copy) (perfect predictions)
0 1 2 3 40
0.05
0.1
0.15
0.2
I(T;X)
I(T
;Y)
Info Curve
X
T
P(T|X)
2 4 6 8 10 12 14 16 18
2
4
6
8
10
12
14
16
182 4 6 8 10 12 14 16 18
0
0.2
0.4
0.6
0.8
1
X
Predictions
A simple illustration
bit3; XTIP
(‘B
’|X
)
0 1 2 3 40
0.05
0.1
0.15
0.2
I(T;X)
I(T
;Y)
Info Curve
X
T
P(T|X)
2 4 6 8 10 12 14 16 18
2
4
6
8
10
12
14
16
182 4 6 8 10 12 14 16 18
0
0.2
0.4
0.6
0.8
1
X
Predictions
A simple illustration
bit2; XTIP
(‘B
’|X
)
0 1 2 3 40
0.05
0.1
0.15
0.2
I(T;X)
I(T
;Y)
Info Curve
X
T
P(T|X)
2 4 6 8 10 12 14 16 18
2
4
6
8
10
12
14
16
182 4 6 8 10 12 14 16 18
0
0.2
0.4
0.6
0.8
1
X
Predictions
A simple illustration
bit1; XTIP
(‘B
’|X
)
0 1 2 3 40
0.05
0.1
0.15
0.2
I(T;X)
I(T
;Y)
Info Curve
X
T
P(T|X)
2 4 6 8 10 12 14 16 18
2
4
6
8
10
12
14
16
182 4 6 8 10 12 14 16 18
0
0.2
0.4
0.6
0.8
1
X
Predictions
A simple illustration
bit5.0; XTIP
(‘B
’|X
)
0 1 2 3 40
0.05
0.1
0.15
0.2
I(T;X)
I(T
;Y)
Info Curve
X
T
P(T|X)
2 4 6 8 10 12 14 16 18
2
4
6
8
10
12
14
16
182 4 6 8 10 12 14 16 18
0
0.2
0.4
0.6
0.8
1
X
Predictions
A simple illustration
bit0; XTIP
(‘B
’|X
)
How much of the past The brain really needs?
Predictive Information: The Capacity of the Future-Past
Channel(with Bialek and Nemenman, 2001)
– Estimate PT(W(-),W(+)) : T- past-future distribution
W(t)
past futureW(-)- T-window
t=0
W(+)- T-window
( , )
( | )[ ] log
( )past future
T Tfuture past
pred Tfuture p W W
p W WI T
p W
Entropy of words in a Spin Chain
12
02 )(log)()(
N
kKNKN WPWPNS
Entropy of spin Chains
total.spins 10 · 1
spins 400000every
)j-i
1 (0,Νfrom
randomat taken is J •
spins 400000every
1) N(0, from randomat
takenis J , J J •
J •
9
ij
01ji,0ij
1ji,ij
Entropy is Extensive : it shows No distinction between the cases!
Predictive Information –Subextensive Component of the
Entropy
shows a qualitativ
e distinctio
n between the cases!
Subextensive component
growth is reflecting
the underlying complexity!
Logarithmic growth for finite dimensional processes
• Finite parameter processes (e.g. Markov chains)
• Similar to stochastic complexity (MDL)
dim( )( ) log
2predI T T
Power law growth
• Fast growth is a signature of infinite dimensional processes (e.g. speech)
• Power laws appear in cases where the interactions/correlations have long range.
( ) 1predI T T
Efficient predictors: Prediction Suffix Trees
deep sparse trees do better than full trees
[Ron, Singer, Tishby, 1994,95]
– Most of the past is irrelevant for the future!
– The “relevant components” can be extracted efficiently from small samples (typically),
much smaller than required for reliable Entropy estimation!
But WHAT - in the past - is predictive ?
W(t)
past futureW(-)- T-window
t=0
W(+)- T-window
How much information is needed
for valuable behavior?
Bellman meets Shannon
37Perception-Action-Cycles © 2009 Naftali Tishby
Richard Ernest Bellman (August 26, 1920 – March 19, 1984)
Claude Elwood Shannon (April 30, 1916 – February 24, 2001)
38
Value and Information parallels …
asstttt
t
tt
t
Passps
r
sasAat
Sst
t
',11
1
),|( with :statenext resulting
:reward resulting gets
)|( with )( : stepat action produces
: stepat state TRUE observesAgent
,2,1,0 :steps at timeinteract t environmen andAgent
and ,given DPby )(for solved
)()|()(
:
ass
ass
a s
ass
ass
RPsV
sVRPsasV
V
valuefor equation Bellman
Agent:Internal
knowledge
Environment:complex external
states
Action/sensing at
information gain It
“State of Knowledge“
on goal gsimplex
estimation of p(g|st)
asst
ass
tt
t
Ps
ΔI
sasAat
sgpssp
Ss
t
Gg
',1
',
with :statenext worldresulting
:gainn informatio teget/estima
)ˆ|( with )( : stepat action produces
)ˆ|( ),|ˆ(by zedcharacteri
,ˆˆ : state internalan infer estimates/
,2,1,0 :steps at timet environmen with interacts
variablegoal a hasAgent
inference prob. and DP using for solved
);ˆ()ˆ|();ˆ(
:
I
gsIIPsagsI
I
a s
ass
ass
for equation Bellman
39Perception-Action-Cycles © 2009 Naftali Tishby
Combining (future) Value and Information
In cases where information is free, we can maximize value
irrespective of its information cost.
In gene
to reduce decision comple
ral, however, we want
(1) (get home in the simplest way)
(2) maxi
xity
mize
increase robustness to model f
(e.g. with the coins)
(3)
All three can be obtained by co
th
mbi
luct
ning
uations
the Inf
e enviornment information
ormation and Value
equati
gain
ons.
40Perception-Action-Cycles © 2009 Naftali Tishby
Trading Value and (future) Information
1
1 2 2
1
1( , , ,
( , |
,1 11 1
1 2
1, )
1
1
1 2, |
11
1
)...
( | , )log
( ) (( )
( (
( ) (
)
( | , )log
) )
( | )log
(
,..., then:
,...
),
( ),
t t
t
t
t
t
t t t tt t a
t p s a s
t t t t
t t
t t
p s a s a
t
t t
t
t at
t
t t
t
s
t t
a a
as E
s E
p s s s a
p s p s
p s s a
p s
ss
a
a a
a
a
a
I
I I
11
1
1
1
1 1
(
1 1 1
)
With:
We want: arg min arg min
with
( | , )
( , ) ( , ) ( , , )
( | , ) (,
( ,
)
( |
( ,
( , | , )
:
)
)
)
)
( t
tt
t
t
t t t
t t t t t
s t t t t
t t t t
t t
p s
t t
t
t
ast t
t
s
p s s a
s a Q s a F s
p s s a sQ
Q
p s
a V
s a
a s
a
Rs
a
a s a
E
s
I
I
1 1 11
1 11 1
1 1( | )
1| , ) 1
1
1( | ,
1
1 1( | )
11
1
1
1
)
( | , )log ( )
( | )log
( )
( | )
( )
( |log
(
( ,
,, )
lo( ) )
)
(g
t t
t
t
t t t t
t
tt tt t t
t t ts a t
t
t
at ta
t tp s s a
s s st
at ta s s s
t
tt
t
t
a sR
a
a s
p s s aV s
pa
s
p s s a
p s a
s
E R s
I+
+I
1 11 1
1 1
1 1( | , ) 1
1
1( | )
1
1
1
( , )
( | , ), , log ,
( )( ) (
(
)
or
)| )
log ,( )
t
t t t tt t t
a tt
t t
t t t tt p s s a t
t
t
s s a s tt
a s
aa R
a
p s s a sa
a
Q
Es sp s
EF F
+
41Perception-Action-Cycles © 2009 Naftali Tishby
Information bounded RL
, '
, '
,define
the "optimal" (reward as ) transition probabilities
( '): ( ' | , )
(
( ')
,
and the state prior,
sufficient statist
exp , , )
exp , is the local pa
( , ,
i
rtition) ')
c
(
a
s s
a
s s
a
s
s s
p sq s s a q
Z
p s
Z s a
sR
Rs
a
p
11
1
111
1 1
11
, ( ,
function.
Then the state-action free energy, Bellman equation is
( | )lo
( , ) ),
( | , )log
( | , ),
:( , )
( , )g
( )( |( )| , )tt
t t t
t
t
t
t t
t t t
t t ts t t
t t
ta tt t
F s s a
F
a Q s a
p s s a
q
a s
aa s s aa p s s as s
I
1 1, (
the desired optimal policy is (somewhat surprisin
,
( )( | ) exp(
gly):
, ,
) log , , )
))(
(
(
(
( ) exp( ,
( ) ( |
,
( )
These 3 equatio
, )
ns should be iter
)
)
, ( ))
t
a
t t
s
t a
aa s a
Z s
Z s a a
a a s
Z
p
F s a
F
s
s
s
s
F
ated till convergence for every state (like Blahut Arimoto).
Biological evidence?
Auditory cortex encodes surprise
(with Eli Nelken and Jonathan Rubin)
The predictive bottleneck
0 0.33 0.67 1 1.33 1.67 2 2.330
0.05
0.1
0.15
0.2
Model Complexity (bits)
Pre
dic
tive
Po
we
r (b
its)
0 1 2 3 4 5
123456
0 1 2 3 4 5
123456
0 1 23 4 5
123456
0 1 2 3 4 5
12345
0 1 2 3 4 5
123
0 1 2 3 4 5
12
0 1 2 3 4 5
12
Information curve showing the optimal predictive information (surprise) as a function of the complexity of the internal model (memory bits) for the next-tone prediction of oddball sequences using a memory duration of 5 tones back.
Left: scatter plots of the neural responses to either ‘A’ (blue) or ‘B’ (red) and the surprise values calculated for a specific model. Dots mark the mean response at a given surprise level, and the error-bars represent 25 and 75 percentile of the data. Right: (1) PSTH for stimulus ‘A’, each row is the averaged PSTH corresponding to a single point in the scatter-plot, sorted from low to high surprise level. (2) PSTH for stimulus ‘B’. (3) Correlations for ‘A’ (as explained before). (4) Correlations for ‘B’.
The PSTH plots help to see what part of signal is correlated with the surprise. For instance the onset seems pretty constant (and absent in the responses to ‘B’), where the sustained part seems to be very correlated with the surprise.
(1)
(2)
(3)
(4)
Conclusions
- Prediction complexity – is governed by the “predictive information” of the environment – NOT by its complexity (Entropy). The predictive information is a tiny (exp. small) fraction of the full Entropy of the environment.
- The brain can extract/learn efficient (good enough) predictors
from small samples. No need to capture the full complexity of the world. - There is accumulating experimental evidence that the
brain represents predictive information (surprises). - This view is in full agreement with the top-down
(reverse hierarchy) models of perception and attention.
- Bellman’s “curse of dimensionality” is avoided (not solved) by the brain because the brain’s main task is making predictions, not modeling the world.