A reinforcement learning algorithm for sampling design in Markov random fields

A reinforcement learning algorithm for sampling design in Markov random fields

Mathieu BONNEAU Nathalie PEYRARD Régis SABBADIN

INRA-MIA ToulouseE-Mail: {mbonneau,peyrard,sabbadin}@toulouse.inra.fr

MSTGA, Bordeaux, 7 2012

1. Problem statement.

2. General Approach.

3. Formulation using dynamic model.

4. Reinforcement learning solution.

5. Experiments.

6. Conclusions.

Plan

PROBLEM STATEMENT Adaptive sampling

X(7)

X(6)X(5)

X(1)

X(4)

X(3)X(2)

X(8) X(9)

• Adaptive selection of variables to observe for reconstruction of the random vector

X=(X(1),….,X(n))

• c(A) -> Cost of observing variables X(A)

• B -> Initial budget

•Observations are reliable


• Adaptive selection of variables to observe for reconstruction of the random vector X=(X(1),….,X(n))

• c(A,x(A)) -> Cost of observing variables X(A) in state x(A)

• B -> Initial budget•Observations are reliable

Problem: Find strategy / sampling policy to adaptively select variables to observe in order to :

Optimize Quality of the reconstruction of X / Respect Initial Budget


• Adaptive selection of variables to observe for reconstruction of the random vector X=(X(1),….,X(n))

• c(A,x(A)) -> Cost of observing variables X(A) in state x(A)

• B -> Initial budget•Observations are reliable

Problem: Find strategies / sampling policy to adaptively select variables to observed in order to :

Optimize Quality of the reconstructed vector X / Respect Initial Budget

For any sampling plans A1 ,…,At and observations x(A1),…, x(At) , an adaptive sampling policy 𝛿 is a function giving the next variable(s) to observe:

𝛿((A1,x(A1)),…,(At,x(At)))=At+1.

DEFINITION: Adaptive sampling policy

A1= 𝛿1={ 8 }A2= 𝛿2 ((8,0))={ 6 , 7 }

x(A1)=0

- - Example - -

x(A2)=(1,3)


𝛿((A1,x(A1)),…,(At,x(At)))=At+1.

DEFINITION: Adaptive sampling policy

X(7)

X(6)X(5)

X(1)

X(4)

X(3)X(2)

X(8) X(9)

DEFINITIONS

A1= 𝛿1={ 8 }A2= 𝛿2 ((8,0))={ 6 , 7 }

•Vocabulary :

•A history {(A1, x (A1)) ,… ,(AH, x (AH) } is a trajectory followed when applying 𝛿• : set of all reachable histories of 𝛿• c(𝛿) ≤ B cost of any history respects the initial budget

δτ x(A2)=(1,3)x(A1)=0


𝛿((A1,x(A1)),…,(At,x(At)))=At+1.

GENERAL APPROACH

GENERAL APPROACH

1. Find a distribution ℙ that well describes the phenomenon under study.

2. Define the value of adaptive sampling policy:

3. Define approximate resolution method for finding near optimal policy:

STATE OF THE ART1. Find a distribution ℙ that well describes the phenomenon

under study.



X continuous random vector ℙ multivariate Gaussian joint distribution

Entropy based criterionKriging variance

Greedy algorithm

1. Find a distribution ℙ that well desribed the phenomenon under study.



X continuous random vector X discrete random vector ℙ multivariate Gaussian joint distribution

ℙ Markov random field distribution

Entropy based criterions Maximum Posterior Marginals (MPM)Kriging variance

OUR CONTRIBUTION

Greedy algorithm Reinforcement learning

1. Find a distribution ℙ that well desribed the phenomenon under study.



X continuous random vector X discrete random vector ℙ multivariate Gaussian joint distribution

ℙ Markov random field distribution

Entropy based criterions Maximum Posterior Marginals (MPM)Kriging variance

OUR CONTRIBUTION

Greedy algorithm Reinforcement learning

Formulation using dynamic model

An adapted framework for reinforcement learning

Summarize knowledge on X in a random vector S of length n

•Observe variables update our knowledge on X Evolution of S

•Example: s = ( -1, …….. , k , …….. , -1 ) Variable X(i) was observed in state k i

s1 s2 s3

U(sH+1)

A1 A2 A3

sH+1

Summarize knowledge on X in a random vector S of length n

•Observe variables update our knowledge on X Evolution of S

•Example: s = ( -1, …….. , k , …….. , -1 ) Variable X(i) was observed in state k i

s1 s2 s3

U(sH+1)

A1 A2 A3

sH+1

Reinforcement learning solution

Find optimal policy: The Q-function

•

Compute Q* Compute 𝛿 *

= « The expected value of the history when starting in st, observing variables X(At)and then

following policy 𝛿*»


s1 s2 s3

U(s3)

A1 A2

• How to compute Q*: classical solution (Q-learning …)

with proba 1-𝜀random with proba 𝜀

1. Initialize Q

2. Simulate history


s1 s2 s3

U(s3)

Update Q(s1,A1) Update Q(s2,A2)

A1 A2

with proba 1-𝜀random with proba 𝜀

many times!

• How to compute Q*: classical solution (Q-learning …)

1. Initialize Q

2. Simulate history

Q Q*C.V

Alternative approach• Linear approximation of Q-function:

•

•

• Choice of function Φi:

LSDP Algorithm

• Linear approximation of Q-function:

•

•

Define weights for each decision step

Compute weights using “backward induction”

H

LSDP Algorithm: application to sampling1. Computation of Φi(st,At):

2. Computation of

3. Computation of

LSDP Algorithm: application to sampling1. Computation of Φi(st,At):2. Computation of 3. Computation of

• We fix|At|=1 and use the approximation:

Experiments

Experiments• Regular grid with first order neighbourhood.

•X(i) are binary variables.

•ℙ is a Potts model with β=0.5

• Simple cost: observation of each variale cost 1

X(7)

X(6)X(5)

X(1)

X(4)

X(3)X(2)

X(8) X(9)

Experiments• Comparison between:

Random policy

BP-max heuristic: at each time step observed variable

LSPI policy “ common reinforcement learning algorithm”

LSDP policy

• using score:

Experiment: 100 variables (n=100)

LSDP et BP-max : No observation

Action’s value for LSDP policy

Max marginals

LSDP et BP-max : one observation



Max marginals

LSDP et BP-max : two observations



Max marginals

Experiment: 100 variables - constraint move

• Allowed to visit second ordre neighbourood only !


• Allowed to visit second ordre neighbourood only !


Experiment: 200 variables - Different costRandom

BP-Max

LSDP Min Cost

Policy Value

60.27% 61.77%

64.8% 64.58%

Mean number

of observe

d variable

s

26.65 19 27.3 38

Initial Budget = 38

Experiment: 200 variables - Different costRandom BP-

MaxLSDP Min Cost

Policy Value

60.27% 61.77% 64.8% 64.58%

Mean number

of observed variables

26.65 19 27.3 38

Initial Budget = 38

•Cost Repartition:

Random

BP-max LSDP

Experiment: 100 variables - Different costRandom BP-Max LSDP

Policy Value 65.3% 66.2% 67%

Mean number of observed variables

15.4 15.4 15.4

WR 1 65.9% 65.4% 67%

WR 0 61% 63.2% 64%

•Initial Budget = 30

•Cost of 1 when observed variable is in state 0

•Cost of 3 when observed variable is in state 1

Conclusions• An adapted framework for adaptive sampling in discrete random variables

•LSDP: a reinforcement learning approach for finding near optimal policy

Adaptation of common reinforcement learning algorithm for solving adaptive sampling problem

Computation of near optimal policy « off-line »

Design of new policies that outperform simple heuristics and usual RL method

• Possible application?

Weeds sampling in crop field

THANK YOU!

Reconstruction of X(R) and trajectory valueA1= 𝛿1={ 8 }

A2= 𝛿2 ((8,0))={ 6 , 7 }xA1=0

xA2=(4,2)=Reconstruction of XR

• Maximum Posterior Marginal for reconstruction:

Reconstruction of X(R) and trajectory valueA1= 𝛿1={ 8 }

A2= 𝛿2 ((8,0))={ 6 , 7 }xA1=0

xA2=(4,2)

• Maximum Posterior Marginal for reconstruction:

• Quality of trajectory:

LSDP Algorithm• Linear approximation of Q-function:

•

•

•How to comput w1,…,wH:

sH+1s1 s2 sH-1U(sH+1)

a1 a2

sH

aH-1 aH

sH+1s1 s2 sH-1

a1 a2

sH

aH-1 aH

U(sH+1)


•

•


sH+1s1 s2 sH-1U(sH+1)

a1 a2

sH

aH-1 aH

sH+1s1 s2 sH-1

a1 a2 aH-1

U(sH+1)sH

aH

LINEAR SYSTEM

wH


•

•


s1 s2 sH-1

a1 a2 aH-1

s1 s2 sH-1

a1 a2 aH-1

LINEAR SYSTEM

wH-1

Date post:	12-Feb-2016
Category:	Documents
Upload:	aren
View:	45 times
Download:	0 times

A reinforcement learning algorithm for sampling design in Markov random fields

Documents