A reinforcement learning algorithm for sampling design in Markov random fields
Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN
INRA-MIA ToulouseE-Mail: {mbonneau,peyrard,sabbadin}@toulouse.inra.fr
MSTGA, Bordeaux, 7 2012
1. Problem statement.
2. General Approach.
3. Formulation using dynamic model.
4. Reinforcement learning solution.
5. Experiments.
6. Conclusions.
Plan
PROBLEM STATEMENT Adaptive sampling
X(7)
X(6)X(5)
X(1)
X(4)
X(3)X(2)
X(8) X(9)
• Adaptive selection of variables to observe for reconstruction of the random vector
X=(X(1),….,X(n))
• c(A) -> Cost of observing variables X(A)
• B -> Initial budget
•Observations are reliable
PROBLEM STATEMENT Adaptive sampling
• Adaptive selection of variables to observe for reconstruction of the random vector X=(X(1),….,X(n))
• c(A,x(A)) -> Cost of observing variables X(A) in state x(A)
• B -> Initial budget•Observations are reliable
Problem: Find strategy / sampling policy to adaptively select variables to observe in order to :
Optimize Quality of the reconstruction of X / Respect Initial Budget
PROBLEM STATEMENT Adaptive sampling
• Adaptive selection of variables to observe for reconstruction of the random vector X=(X(1),….,X(n))
• c(A,x(A)) -> Cost of observing variables X(A) in state x(A)
• B -> Initial budget•Observations are reliable
Problem: Find strategies / sampling policy to adaptively select variables to observed in order to :
Optimize Quality of the reconstructed vector X / Respect Initial Budget
For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe:
𝛿((A1,x(A1)),…,(At,x(At)))=At+1.
DEFINITION: Adaptive sampling policy
A1= 𝛿1={ 8 }A2= 𝛿2 ((8,0))={ 6 , 7 }
x(A1)=0
- - Example - -
x(A2)=(1,3)
For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe:
𝛿((A1,x(A1)),…,(At,x(At)))=At+1.
DEFINITION: Adaptive sampling policy
X(7)
X(6)X(5)
X(1)
X(4)
X(3)X(2)
X(8) X(9)
DEFINITIONS
A1= 𝛿1={ 8 }A2= 𝛿2 ((8,0))={ 6 , 7 }
•Vocabulary :
•A history {(A1, x (A1)) ,… ,(AH, x (AH) } is a trajectory followed when applying 𝛿• : set of all reachable histories of 𝛿• c(𝛿) ≤ B cost of any history respects the initial budget
δτ x(A2)=(1,3)x(A1)=0
For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe:
𝛿((A1,x(A1)),…,(At,x(At)))=At+1.
GENERAL APPROACH
GENERAL APPROACH
1. Find a distribution ℙ that well describes the phenomenon under study.
2. Define the value of adaptive sampling policy:
3. Define approximate resolution method for finding near optimal policy:
STATE OF THE ART1. Find a distribution ℙ that well describes the phenomenon
under study.
2. Define the value of adaptive sampling policy:
3. Define approximate resolution method for finding near optimal policy:
X continuous random vector ℙ multivariate Gaussian joint distribution
Entropy based criterionKriging variance
Greedy algorithm
1. Find a distribution ℙ that well desribed the phenomenon under study.
2. Define the value of adaptive sampling policy:
3. Define approximate resolution method for finding near optimal policy:
X continuous random vector X discrete random vector ℙ multivariate Gaussian joint distribution
ℙ Markov random field distribution
Entropy based criterions Maximum Posterior Marginals (MPM)Kriging variance
OUR CONTRIBUTION
Greedy algorithm Reinforcement learning
1. Find a distribution ℙ that well desribed the phenomenon under study.
2. Define the value of adaptive sampling policy:
3. Define approximate resolution method for finding near optimal policy:
X continuous random vector X discrete random vector ℙ multivariate Gaussian joint distribution
ℙ Markov random field distribution
Entropy based criterions Maximum Posterior Marginals (MPM)Kriging variance
OUR CONTRIBUTION
Greedy algorithm Reinforcement learning
Formulation using dynamic model
An adapted framework for reinforcement learning
Summarize knowledge on X in a random vector S of length n
•Observe variables update our knowledge on X Evolution of S
•Example: s = ( -1, …….. , k , …….. , -1 ) Variable X(i) was observed in state k i
s1 s2 s3
U(sH+1)
A1 A2 A3
sH+1
Summarize knowledge on X in a random vector S of length n
•Observe variables update our knowledge on X Evolution of S
•Example: s = ( -1, …….. , k , …….. , -1 ) Variable X(i) was observed in state k i
s1 s2 s3
U(sH+1)
A1 A2 A3
sH+1
Reinforcement learning solution
Find optimal policy: The Q-function
•
Compute Q* Compute 𝛿 *
= « The expected value of the history when starting in st, observing variables X(At)and then
following policy 𝛿*»
Find optimal policy: The Q-function
s1 s2 s3
U(s3)
A1 A2
• How to compute Q*: classical solution (Q-learning …)
with proba 1-𝜀random with proba 𝜀
1. Initialize Q
2. Simulate history
Find optimal policy: The Q-function
s1 s2 s3
U(s3)
Update Q(s1,A1) Update Q(s2,A2)
A1 A2
with proba 1-𝜀random with proba 𝜀
many times!
• How to compute Q*: classical solution (Q-learning …)
1. Initialize Q
2. Simulate history
Q Q*C.V
Alternative approach• Linear approximation of Q-function:
•
•
• Choice of function Φi:
LSDP Algorithm
• Linear approximation of Q-function:
•
•
Define weights for each decision step
Compute weights using “backward induction”
H
LSDP Algorithm: application to sampling1. Computation of Φi(st,At):
2. Computation of
3. Computation of
LSDP Algorithm: application to sampling1. Computation of Φi(st,At):2. Computation of 3. Computation of
• We fix|At|=1 and use the approximation:
Experiments
Experiments• Regular grid with first order neighbourhood.
•X(i) are binary variables.
•ℙ is a Potts model with β=0.5
• Simple cost: observation of each variale cost 1
X(7)
X(6)X(5)
X(1)
X(4)
X(3)X(2)
X(8) X(9)
Experiments• Comparison between:
Random policy
BP-max heuristic: at each time step observed variable
LSPI policy “ common reinforcement learning algorithm”
LSDP policy
• using score:
Experiment: 100 variables (n=100)
LSDP et BP-max : No observation
Action’s value for LSDP policy
Max marginals
LSDP et BP-max : one observation
Action’s value for LSDP policy
Action’s value for LSDP policy
Max marginals
LSDP et BP-max : two observations
Action’s value for LSDP policy
Action’s value for LSDP policy
Max marginals
Experiment: 100 variables - constraint move
• Allowed to visit second ordre neighbourood only !
Experiment: 100 variables - constraint move
• Allowed to visit second ordre neighbourood only !
Experiment: 100 variables - constraint move
Experiment: 200 variables - Different costRandom
BP-Max
LSDP Min Cost
Policy Value
60.27% 61.77%
64.8% 64.58%
Mean number
of observe
d variable
s
26.65 19 27.3 38
Initial Budget = 38
Experiment: 200 variables - Different costRandom BP-
MaxLSDP Min Cost
Policy Value
60.27% 61.77% 64.8% 64.58%
Mean number
of observed variables
26.65 19 27.3 38
Initial Budget = 38
•Cost Repartition:
Random
BP-max LSDP
Experiment: 100 variables - Different costRandom BP-Max LSDP
Policy Value 65.3% 66.2% 67%
Mean number of observed variables
15.4 15.4 15.4
WR 1 65.9% 65.4% 67%
WR 0 61% 63.2% 64%
•Initial Budget = 30
•Cost of 1 when observed variable is in state 0
•Cost of 3 when observed variable is in state 1
Conclusions• An adapted framework for adaptive sampling in discrete random variables
•LSDP: a reinforcement learning approach for finding near optimal policy
Adaptation of common reinforcement learning algorithm for solving adaptive sampling problem
Computation of near optimal policy « off-line »
Design of new policies that outperform simple heuristics and usual RL method
• Possible application?
Weeds sampling in crop field
THANK YOU!
Reconstruction of X(R) and trajectory valueA1= 𝛿1={ 8 }
A2= 𝛿2 ((8,0))={ 6 , 7 }xA1=0
xA2=(4,2)=Reconstruction of XR
• Maximum Posterior Marginal for reconstruction:
Reconstruction of X(R) and trajectory valueA1= 𝛿1={ 8 }
A2= 𝛿2 ((8,0))={ 6 , 7 }xA1=0
xA2=(4,2)
• Maximum Posterior Marginal for reconstruction:
• Quality of trajectory:
LSDP Algorithm• Linear approximation of Q-function:
•
•
•How to comput w1,…,wH:
sH+1s1 s2 sH-1U(sH+1)
a1 a2
sH
aH-1 aH
sH+1s1 s2 sH-1
a1 a2
sH
aH-1 aH
U(sH+1)
LSDP Algorithm• Linear approximation of Q-function:
•
•
•How to comput w1,…,wH:
sH+1s1 s2 sH-1U(sH+1)
a1 a2
sH
aH-1 aH
sH+1s1 s2 sH-1
a1 a2 aH-1
U(sH+1)sH
aH
LINEAR SYSTEM
wH
LSDP Algorithm• Linear approximation of Q-function:
•
•
•How to comput w1,…,wH:
s1 s2 sH-1
a1 a2 aH-1
s1 s2 sH-1
a1 a2 aH-1
LINEAR SYSTEM
wH-1