Amoeba-inspired Tug-of-War algorithms for exploration...

Ad

Ma

b

c

d

Me

a

ARR1A

KNMDRP

1

ctPmpb

nf

wm

0h

BioSystems 117 (2014) 1– 9

Contents lists available at ScienceDirect

BioSystems

journa l h om epa ge: www.elsev ier .com/ locate /b iosystems

moeba-inspired Tug-of-War algorithms for exploration–exploitationilemma in extended Bandit Problem

asashi Aonoa,b,∗, Song-Ju Kimc, Masahiko Harad, Toshinori Munakatae

Earth-Life Science Institute, Tokyo Institute of Technology, 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8550, JapanPRESTO, Japan Science and Technology Agency, 4-1-8 Honcho, Kawaguchi-shi, Saitama 332-0012, JapanWPI Center for Materials Nanoarchitectonics (MANA), National Institute for Materials Science (NIMS), 1-1 Namiki, Tsukuba, Ibaraki 305-0044, JapanDepartment of Electronic Chemistry, Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology, 4259 Nagatsuta,idori-ku, Yokohama 226-8503, Japan

Computer and Information Science Department, Cleveland State University, Cleveland, OH 44115, USA

r t i c l e i n f o

rticle history:eceived 23 March 2013eceived in revised form8 December 2013ccepted 21 December 2013

eywords:atural computingulti-armed Bandit Problemecision makingesource allocationhysarum polycephalum

a b s t r a c t

The true slime mold Physarum polycephalum, a single-celled amoeboid organism, is capable of efficientlyallocating a constant amount of intracellular resource to its pseudopod-like branches that best fit theenvironment where dynamic light stimuli are applied. Inspired by the resource allocation process, theauthors formulated a concurrent search algorithm, called the Tug-of-War (TOW) model, for maximiz-ing the profit in the multi-armed Bandit Problem (BP). A player (gambler) of the BP should decide asquickly and accurately as possible which slot machine to invest in out of the N machines and facesan “exploration–exploitation dilemma.” The dilemma is a trade-off between the speed and accuracy ofthe decision making that are conflicted objectives. The TOW model maintains a constant intracellularresource volume while collecting environmental information by concurrently expanding and shrinkingits branches. The conservation law entails a nonlocal correlation among the branches, i.e., volume incre-ment in one branch is immediately compensated by volume decrement(s) in the other branch(es). Owingto this nonlocal correlation, the TOW model can efficiently manage the dilemma. In this study, we extend
the TOW model to apply it to a stretched variant of BP, the Extended Bandit Problem (EBP), which is aproblem of selecting the best M-tuple of the N machines. We demonstrate that the extended TOW modelexhibits better performances for 2-tuple-3-machine and 2-tuple-4-machine instances of EBP comparedwith the extended versions of well-known algorithms for BP, the �-Greedy and SoftMax algorithms, par-ticularly in terms of its short-term decision-making capability that is essential for the survival of theamoeba in a hostile environment.
. Introduction

The speed and accuracy of the decision making are crucial butonflicted objectives for resource-limited gamblers and organismso survive in uncertain environments. The multi-armed Banditroblem (BP) (Sutton and Barto, 1998), a problem of finding the
ost rewarding machine from N slot machines,1 is a good exam-
le of why and how the difficulty of the decision making shoulde overcome in real-world situations. Suppose each machine i

∗ Corresponding author at: Earth-Life Science Institute, Tokyo Institute of Tech-ology, 2-12-1 Ookayama, Meguro-ku, Tokyo 152-8550, Japan. Tel.: +81 357342854;

ax: +81 357343416.E-mail address: [email protected] (M. Aono).

1 Although the original definition of BP is stated with the term “bandit,” heree explain BP using “N slot machines” instead of “N-armed bandit” because a slotachine is equivalent to one-armed bandit.

303-2647/$ – see front matter © 2013 Elsevier Ireland Ltd. All rights reserved.ttp://dx.doi.org/10.1016/j.biosystems.2013.12.007

© 2013 Elsevier Ireland Ltd. All rights reserved.

emits a reward, for example, a coin,2 with an individual proba-bility pi but no player knows the reward probabilities of all themachines in advance. A player tries to maximize the total rewardsum obtained after playing the machines for a certain number oftrials. The player needs to drop coins into a bank of machines tocorrectly evaluate the best machine, but has to complete the eval-uation quickly to minimize the waste. Thus, the player confrontsthe “exploration–exploitation dilemma” created with incompat-ible demands: one either “exploits” the rewards obtained usingalready collected knowledge or “explores” new alternatives for
acquiring higher rewards involving risks. How can we optimizethese conflicting objectives? The optimal solution to this problemis to determine the optimal strategy, i.e., an efficient algorithm for
2 In this study, we deal with the simplified variant of the general BP. We assumethat all slot machines can only return a uniform reward, (i.e., at most a coin for aplay), but their probabilities to emit the reward are different.

dx.doi.org/10.1016/j.biosystems.2013.12.007

http://www.sciencedirect.com/science/journal/03032647

http://www.elsevier.com/locate/biosystems

http://crossmark.crossref.org/dialog/?doi=10.1016/j.biosystems.2013.12.007&domain=pdf

mailto:[email protected]

dx.doi.org/10.1016/j.biosystems.2013.12.007

2 M. Aono et al. / BioSyste

Fig. 1. (a) A unicellular amoeboid organism, the true slime mold P. polycephalum(scale bar = 7 mm); (b) a gold-coated plastic chip resting on an agar plate (scalebar = 7 mm). The organism remains inside the chip where the agar is exposedbecause of its aversion to metal surfaces. In the absence of light stimulation, thespherically shaped organism (top) flattened to elongate its four branches (bottom)by keeping its total volume almost constant; and (c) schematic illustration of thebody architecture of the organism and experimental setup for applying dynamiclight stimuli. The intracellular sol (green) flows along with the pressure difference,i.e., the sol is absorbed into the relaxation of the sponge-like gel layer (yellow) and issqueezed from the gel layer when it contracts. Light stimuli (blue), which enhancethe contraction tendency of the gel layer and induce the illuminated branches towithdraw, are updated at each time interval as the organism changes its shape. Adigital image of the shape of the organism, taken by a video camera (VC), is digitallyprocessed using a personal computer (PC) to project a high-intensity illuminationpattern (visible white light image) using a commercial projector (PJ). (For interpre-tw

st

iaitat

(htcacpiraAvv

ation of the references to color in this figure legend, the reader is referred to theeb version of the article.)

electing the machine that yields maximum rewards by referringo past experiences.

Many algorithms for tree searches are subjected to this dilemman practical situations. Thus, efficient algorithms for BP are useful for

wide range of applications requiring powerful tree search capabil-ty. We believe that living organisms generally encounter a similarrade-off between the speed and accuracy of their decision makingnd would have developed some efficient methods to overcomehe dilemma for their survival in an unknown world.

A plasmodium of the true slime mold Physarum polycephalumFig. 1A) is an amoeboid multi-nucleated unicellular organism thatas been studied actively in terms of its sophisticated computa-ional capabilities. Nakagaki et al. showed that the organism isapable of connecting the optimal routes among foods, despite thebsence of a central nervous system (Nakagaki et al., 2000). Thisapability can be applied to solving some geometric path planningroblems (Tero et al., 2006, 2010; Nakagaki et al., 2000) As shown

n Fig. 1B, when the organism is placed in a four-lane stellate chipesting on an agar plate, it elongates its four terminal branches
nd changes its shape by expanding or shrinking the branches.lthough the shape of the organism can deform arbitrarily, its totalolume remains almost constant during our experimental obser-ations. Thus, the shape-changing behavior of the organism can
ms 117 (2014) 1– 9

be considered to be a process for allocating a constant amount ofintracellular resource to its branches.

Fig. 1c shows a schematic illustration of the architecture of thebody of the organism (Kessler, 1982). In the body, the intracel-lular sol (resource) shuttles through tubular channels (branches),while the extracellular gel layer, like a sponge, rhythmically oscil-lates contraction tension to squeeze and absorb the sol. Dependingon the phase difference of the oscillation, the contraction ten-sion of the gel layer varies from site to site, and the sol is madeto flow along the pressure difference (gradient) produced by thisdifference in local contraction tension. Each branch expands fur-ther as the sol influx raises the growth rate of the volume of thebranch.

The branch of the organism shrinks when illuminated by visi-ble light. The light stimuli are believed to enhance the contractiontendency of the stimulated gel layer and intensify the sol efflux(extrusion) from the illuminated branch. This negative phototacticbehavior enables us to make the organism withdraw its branchesfrom illuminated lanes, and expand toward non-illuminated lanes.

By introducing an optical feedback system (Fig. 1c), which auto-matically updates the illumination condition with changes in theshape of the organism, Aono et al. created a biocomputer thatexploits the capability of the organism to search for solutions tocombinatorial optimization problems (Aono and Gunji, 2003; Aonoet al., 2007; Aono and Hara, 2008). The amoeba-based computercan be used to solve the N-city traveling salesman problem (TSP)when we place the organism in a N2-lane stellate chip and updatethe illumination condition according to a recurrent neural networkmodel (Hopfield and Tank, 1986). The optimal solution is obtainedwhen the organism succeeds in exclusively elongating the correctcombination of the N branches to maximize its body area for max-imal nutrient absorption from the agar plate and minimize the riskof being illuminated. The amoeba-based computer showed a highprobability of deriving the optimal solution to the TSP (Aono et al.,2009a; Zhu et al., 2013). These results indicate the optimizationcapability of the organism to allocate efficiently the intracellularresource to the elongated branches that are most adaptive to thedynamic illumination condition.

To extract the physical essence of the resource allocation processof amoeba, Kim et al. formulated a discrete-time-state dynami-cal system, called the Tug-of-War (TOW) model (Kim et al., 2009,2010a,b, 2011; Kim and Aono, 2014). As in the illustration shown inFig. 1c, the TOW model is a star network consisting of amoeba-likebranches connected to a hub node, where the total sum of volumesis kept constant.

The TOW model can be used as an algorithm to solve BP. In theTOW model, each branch plays machine i when the volume dis-placement of branch i is positive. The positive branch i is stimulatedby light with a probability 1 − pi, i.e., “punished” instead of being“rewarded” with a probability pi. Namely, BP is translated equiva-lently into a problem of finding the least frequently punished lane.This problem setting can be compared to a situation created in theamoeba-based computer in which the amoeba changes its shapeto maximize the energy (nutrient) acquisition from the agar plateby elongating the most appropriate branches that can minimizethe probability of being illuminated. The amoeba needs to elon-gate its branches in the least frequently illuminated lanes becausethe branches consume certain energy for their withdrawal move-ments when illuminated. To correctly evaluate the least frequentlyilluminated lanes, the amoeba has to elongate its branches in allthe lanes. However, at the same time, the amoeba has to com-plete the evaluation quickly to minimize the energy consumption
caused by the elongation movements. Thus, the amoeba encountersthe exploration–exploitation dilemma, which arises as a difficultyof achieving the maximal energy acquisition through the minimalenergy consumption.

ioSyste

pwotorepawG2

cbg2ecpstt

mnotofsittcwrcfior

tiBMsrfiTBfTSt

2

2

pa.

The player accumulates past experiences and calculates the esti-mates gt

i∈ R+ for all the machines:

gti = Rt

i

Ut, (3)

M. Aono et al. / B

The TOW model is a concurrent algorithm in which the brancheslay more than one machine simultaneously at each time step,hereas in any sequential algorithm the player is allowed to play

nly one machine during each trial according to the original defini-ion of BP. To make unbiased comparisons among the performancesf the sequential and concurrent algorithms, if the concurrent algo-ithm played Y (≤N) machines at a time, the scores should bevaluated as the results obtained after Y trials. Even under this com-arison guideline, the TOW model was evaluated as more efficientnd adaptive for 2-machine and 3-machine instances comparedith well-known sequential algorithms such as the modified �-reedy algorithm and modified SoftMax algorithm (Kim et al., 2009,010a,b).

In our previous studies on amoeba-based computing for the N-ity TSP, the problem was to find the optimal combination of the Nranches out of N2 branches, and the organism succeeded in elon-ating the optimal branches with a high probability (Aono et al.,009a; Zhu et al., 2013). We expect that the TOW model can bextended to predict the elongation of more than one branch effi-iently and can be used to explore the essential dynamics thatroduce the problem-solving capability of the organism. In thistudy, we revise the settings of the BP and consider a new problemermed the “Extended Bandit Problem (EBP),” a problem of findinghe optimal M-tuple out of N machines.

EBP represents a common situation in which one should deter-ine the optimal strategy for allocating resource to a specified

umber of options simultaneously. An example is a situation thatccurs in the internet advertising. An ad agent should maximizehe number of total click-through counts of banner advertisementsf N companies, but in a targeted website there are only M spacesor the banners to be placed simultaneously. The agent, therefore,hould try many combinations of the M banners and find the max-mally clicked one. Another example of EBP is the optimization ofhe usability of telecommunication lines in terms of cognitive radioechnology. Suppose there are N lines that are preferentially allo-ated to licensed users, where each line is assumed to be occupiedith a fixed probability. To make effective use of communication

esources, cognitive radio technology allows public access to unoc-upied lines for M unlicensed users. In this case, the problem is tond a combination of M lines that are the least frequently occupiednes. Thus, efficient algorithms for EBP will be useful for a lot ofeal-world applications.

This paper is organized as follows. In the next section, we reviewhe definition of the original TOW model referred to as TOW1, afterntroducing two well-known sequential strategies for the originalP, i.e., the modified �-Greedy algorithm (EGR1) and modified Soft-ax algorithm (SMX1). Then we extend these three algorithms

o that they can be applied to EBP. The extended algorithms areeferred to as TOW2, EGR2, and SMX2, respectively. In Section 3,rst we show the results on the performance comparisons amongOW1, EGR1, and SMX1 for 3-machine and 4-machine instances ofP. Then we compare the performances of TOW2, EGR2, and SMX2

or 2-tuple-3-machine and 2-tuple-4-machine instances of EBP.OW2 exhibits better short-term decision capability than EGR2 andMX2. We conclude this paper, by discussing some implications ofhe results from biological and physical perspectives.

. Models

.1. Multi-armed Bandit Problem

The multi-armed Bandit Problem (BP) is stated as follows. Sup-ose there are N slot machines, and their probabilities of emitting

reward are independent and unknown to a player. Let I = {1, 2, . ., N} be a set of all slot machines and pi ∈ [0.0, 1.0] be a reward

ms 117 (2014) 1– 9 3

probability of machine i ∈ I. Here we assume that all the machinesrelease at most a unit of reward, for example, a coin. The playershould maximize the total sum of rewards obtained after playingthe machines for a certain number of trials. Thus, the problem is tofind as quickly and accurately as possible the optimal machine i�

such that its reward probability is maximum, i.e., i� = argmaxi ∈ I{pi}.BP was originally described by Robbins (1952), although the

same problem, in essence, was studied by Thompson (1933). A dif-ferent version of the Bandit Problem has also been studied wherethe reward distributions are assumed to be known to the player.In this version, the optimal strategy is known only for a limitedclass of problems (Gittins and Jones, 1974; Gittins, 1979). In theoriginal version, a popular measure for the performance of an algo-rithm is “regret,” i.e., the expected loss of rewards for not selectingthe correct machine at all times. Lai and Robbins first showedthat regret has to increase at least logarithmically in the numberof selections (Lai and Robbins, 1985). They defined the conditionwhere an optimal strategy must satisfy asymptotically. However,the computation of their algorithm is generally difficult due to theKullback–Leibler divergence. Agrawal proposed algorithms wherethe index could be expressed as a simple function of the totalreward sum obtained from a machine (Agrawal, 1995). These algo-rithms are considerably easier to compute than those developedby Lai and Robbins; however, the regret retains the asymptoticlogarithmic behavior albeit with a larger leading constant. Aueret al. proposed a simple algorithm called Upper Confidence Bound1 (UCB1) that achieved logarithmic regret uniformly over time,rather than only asymptotically (Auer et al., 2002). In addition, theyproved that some family of the modified �-Greedy algorithm (the�-decreasing algorithm) also achieves logarithmic regret. Vermoreland Mohri concluded that the most naive approach, the modified�-Greedy algorithm with carefully chosen parameter, is the best(Vermorel and Mohri, 2005). It is also known that the performanceof the modified SoftMax algorithm (the decreasing SoftMax algo-rithm) is comparable to that of the modified �-Greedy algorithm(Vermorel and Mohri, 2005). Therefore, we evaluate these two algo-rithms for the performance comparisons in this study.

2.1.1. EGR1: modified �-Greedy algorithmThe greedy algorithm is one of the most popular sequential algo-

rithms for solving BP (Sutton and Barto, 1998). In this algorithm,a player switches between “random exploration” and “greedyexploitation” in a probabilistic manner.

We write uti= 1 when machine i is played at time t, i.e., a player

drops a coin into machine i.

uti =

{1 (if machine i is played),

0 (otherwise).(1)

When the player obtains a reward released from machine i, wewrite this as rt

i= 1.

uti =

{1 (if rewarded),

0 (otherwise).(2)

i

Uti = Ut−1

i+ ut

i , (4)

4 ioSystems 117 (2014) 1– 9

R

timW(

Hf

�

wG

2

raC

pi

e

wi

i

ˇ

wsS

2

bttv

t

w

�

Bt{

p“

l

r

Table 1The acceleration at

idefined by Eq. (13).

ati

sti

> 0 sti= 0 st

i< 0

lti= −1 1 1 0

lti= 1 0 −1 −1

Table 2Signals from the reward detector �t

iand punishment detector �t

j.

Rewarded Punished Not played

M. Aono et al. / B

ti = Rt−1

i+ rt

i . (5)

The player randomly “explores” which machine to play withhe probability � or performs a “greedy” action with the probabil-ty 1 − �. In the greedy mode, the player selects the known best

achine i which has the largest estimate, i.e., i = argmaxi∈I{gti}.

hen machine i is played (uti= 1), other ones cannot be played

utk

= 0 for all k ∈ I \ {i}).In the original �-Greedy algorithm, � ∈ [0.0, 1.0] is constant.

owever, in this study, we use the time-dependent �(t) given asollows:

t = 11 + � · t

, (6)

here � is the decay rate.3 Hereafter we refer to this modified �-reedy algorithm as EGR1.

.1.2. SMX1: modified SoftMax algorithmThe SoftMax algorithm is another well-known sequential algo-

ithm for BP. Some studies have reported that this is the bestlgorithm in the context of decision making (Daw et al., 2006;ohen et al., 2007).

In this algorithm, the player always selects which machine tolay in a probabilistic manner. The probability of selecting machine

, eti∈ [0.0, 1.0], is given by the following Boltzmann distributions:

ti = exp( ̌ · gt

i)∑

k∈Iexp( ̌ · gtk), (7)

here ̌ is the growth rate, and the estimate gti

defined by Eq. (3)s the same as the one used in EGR1. Note that

∑i∈Ie

ti= 1.

Similar to � in EGR1, ̌ was modified to a time-dependent formn this study as follows:

t = ̌ · t, (8)

here ̌ = 0 corresponds to a random selection, and ˇ→ ∞ corre-ponds to a greedy action. We call this modified SoftMax algorithmMX1.

.1.3. TOW1: Tug-of-War modelThe original Tug-of-War model, TOW1, represents the amoeba’s

ody as a star network consisting of N terminal branches connectedo a hub node (Fig. 1c). For each branch i ∈ I at time t, let xt

i∈ N be

he displacement of the volume from an arbitrarily assumed normalalue, where that of the hub node is denoted by xt

0.We consider that branch i pulls the lever of slot machine i when

aking a positive volume displacement �(x) ={

1 (if x > 0),0 (otherwise).

,

here � is a step function:

(x) ={

1 (if x > 0),

0 (otherwise).(9)

ecause the amoeba is allowed to play up to N machines at aime t, there exist 2N possible moves that are included in a set〈�(x1),�(x2), . . ., �(xN)〉 |�(xi) ∈ {0, 1}}.

If machine i is played, branch i is stimulated by light with therobability 1 − pi as a “punishment,” i.e., an effect opposite to areward.” For each branch i, the light condition lt

iis given as follows:{

ti =

1 (if �(xti) = 1, light ON with a probability 1 − pi),

−1 (otherwise, light OFF).(10)

3 Instead of �t = Min{1, 1/(� · t)}, which is well known as the “�-decreasing algo-ithm,” we used this �t for simplicity.

�ti

1 0 0�t

j0 1 0

The volume displacement xti

is updated according to the follow-ing difference equations:

xt+1i

=

⎧⎪⎪⎨⎪⎪⎩

xti+ vt

i(i ∈ I),

xt0 −

N∑j=1

vtj (i = 0),

(11)

vti = vt−1

i+ at

i , (12)

where vti

and ati

denote the velocity and acceleration of the corre-sponding volume displacement, respectively. Owing to Eq. (11), theconservation law holds as

∑Ni=0xt

i=

∑Ni=0x0

i= const., i.e., the total

volume remains unchanged from the initial condition.The acceleration at

iis determined by the following function that

is equivalently expressed by Table 1:

ati = −lti · (1 − �(lti · st

i )), (13)

where sti

denotes the deviation of the amount of intracellularsol allocated to branch i. The positive acceleration (at

i= 1) drives

branch i to grow, whereas the negative acceleration (ati= −1)

results in the withdrawal of the branch. Table 1 shows that branch igrows when the light stimulation is not applied (lt

i= −1) and the sol

is sufficiently allocated (sti≥0). The sol, therefore, can be considered

to be a resource required for growth. The photoavoidance behaviorof the branch is expressed as the negative acceleration (at

i= −1) in

response to the illumination (lti

= 1). However, if the sol is abun-dant (st

i> 0), the branch cannot withdraw even when illuminated,

because the acceleration cannot be negative (ati= 0). The move-

ments of the branches, therefore, cannot be determined solely bythe external light stimuli. The decision to expand and shrink par-ticular branches is jointly determined by the external stimuli andinternal resource-allocating dynamics.

The sol deviation sti

is defined as a function of the hub volume

displacement xt0 and the accumulated information qt−1

iregarding

past events:

sti = xt

0 + qt−1i

− meanj∈I\{i}{qt−1j

}, (14)

qti = qt−1

i+ � · (�t

i + ω ·∑j∈I\{i}

�tj ), (15)

where �ti

= �(xti) − �(lt

i) detects the rewarded (unpunished) play,

�ti

= �(lti) detects the punished play, � ∈ R+ is a parameter for

adjusting the accumulation rate, and ω ∈ R+ is a parameter foremphasizing information transmitted from punished branches.Table 2 shows how the reward detector �t

iand punishment detector

�ti

operate. The parameter ω is always set to ω = 1 in this study.
The intrinsic dynamics of TOW1 are deterministic. However,
the accelerations ati

are determined stochastically, as the externallight stimuli are applied in a probabilistic manner. Fig. 2 shows atypical time evolution of the four-branch TOW1, where the initial

M. Aono et al. / BioSyste

Fig. 2. Time evolution of the four-branch original Tug-of-War model (TOW1) forthe multi-armed Bandit Problem, where 〈p1, p2, p3, p4〉 = 〈0.35, 0.45, 0.55, 0.65〉,(�, ω) = (3, 1), and x0

i= v0

i= q0

i= 0. (a) Volume displacement xt

i. The hub node 0,

branches 1, 2, 3, and 4 are indicated by the red dotted, yellow dotted, green broken,light blue broken, and blue solid lines, respectively; (b) velocity vt

iof xt

i; (c) acceler-

ation ati∈ {−1, 0, 1} of xt

i; (d) sol deviation st

i; (e) accumulated information qt

i; and

(f) AccuracyRateSeries as a function of Play. (For interpretation of the references toc

cTp

2

nbol

ptrtaT

Although in this study we consider a case where M = 2, TOW2can be extended for an arbitrary M using the following generalizedform of Eq. (22):

qt = qt−1 +∑

� · (�t ·∏

�t + ω ·∑ ∏

�t ), (23)

olor in this figure legend, the reader is referred to the web version of the article.)

onditions (x0i, v0

i, q0

i) are always set to zero for all i in this study.

OW1 comes to play only machine 4, which has the highest rewardrobability, although initially all machines were played.

.2. Extended Bandit Problem

We define the Extended Bandit Problem (EBP) as an extendedotion of the multi-armed Bandit Problem (BP). The only differenceetween EBP and BP is that a player of the former should pick theptimal M-tuple out of the N slot machines, whereas a player of theatter is solely required to select the best one.

In this section, we deal with the case where M = 2 for sim-licity in describing the definitions of EGR2, SMX2, and TOW2hat are extended versions of the previously introduced algo-ithms. The optimal 2-tuple 〈i�, j�〉 is a pair of machines such thathe sum of their reward probabilities is maximum, i.e., 〈i�, j�〉 =
rgmax〈i,j〉∈I2
{pi + pj}, where I2 = {〈i, j〉| i, j ∈ I, i < j} is a set of all pairs.he number of all possible pairs is N · (N − 1)/2.

ms 117 (2014) 1– 9 5

2.2.1. EGR2: extended �-Greedy algorithmOnly by substituting Eqs. (3)–(5), with the following equations,

EGR1 can be naturally extended to EGR2:

gt〈i,j〉 =

Rt〈i,j〉

Ut〈i,j〉

, (16)

Ut〈i,j〉 = Ut−1

〈i,j〉 + uti · ut

j , (17)

Rt〈i,j〉 = Rt−1

〈i,j〉 + rti · rt

j . (18)

That is, if 〈i, j〉 = argmax〈i,j〉∈I2{gt

〈i,j〉}, EGR2 takes a greedy action to

play machines i and j simultaneously (i.e., uti= ut

j= 1 and ut

k= 0

for all k ∈ I \ {i, j}) with the probability 1 − �t or explores a randomlychosen pair with the probability �t given by Eq. (6).

2.2.2. SMX2: extended SoftMax algorithmSMX1 can be developed to SMX2 by replacing Eq. (7) with the

following equation:

et〈i,j〉 =

exp(ˇt · gt〈i,j〉)∑

〈k,k′〉∈I2exp(ˇt · gt

〈k,k′〉), (19)

where ˇt = ̌ · t is the time-dependent decay rate and gt〈i,j〉 is calcu-

lated by Eqs. (16)–(18). At each time t, SMX2 plays machine i and jsimultaneously with the probability et

〈i,j〉.

2.2.3. TOW2: extended TOW modelWe revised three points for extending TOW1 to TOW2. First, we

set an upper limit on the absolute value of the velocity vti

in Eq. (12)as follows:

vti = L(vt−1

i+ at

i ), (20)

where L(v) = (if v≥); − (if v ≤ −); v (otherwise), and ≥ 0 isthe limit value. In this study, is fixed at = 10.

Second, we updated the acceleration ati

determined discretelyby Eq. (13) to be modulated continuously depending on the volumedisplacements of the hub xt

0 and branch xti:

ati = −lti · (1 − �(lti · st

i )) + � · (xt0 − xt

i ), (21)

where � is a parameter for adjusting the restoring effect whichnarrows the gap between xt

0 and xti. We also fix this parameter as

� = 0.002.Third, we modified Eq. (15) so that TOW2 can accumulate the

information on coinciding rewards and punishments:

qti = qt−1

i+

∑j∈I\{i}

� · (�ti · �t

j + ω ·∑

〈k,k′〉∈I2\{〈i,j〉}�t

k · �tk′ ). (22)

A typical time evolution of the four-branch TOW2 is shown inFig. 3. TOW2 finally stabilized to play only two machines 3 and4 while the two volume displacements oscillate in performingantiphasic synchronization. Comparing Figs. 2 and 3, we can con-firm that there is a sharp contrast between the movements of TOW1and TOW2. That is, the volume displacements of the latter are con-fined in a bounded state space owing to the velocity limit and therestoring effect, whereas that of the former appears to diverge.

i i

J∈IiM−1

i

j∈J

j

K∈IM\{〈i,J〉}k∈K

k

6 M. Aono et al. / BioSyste

Fig. 3. Time evolution of the four-branch extended Tug-of-War model (TOW2) forthe Extended Bandit Problem, where 〈p1, p2, p3, p4〉 = 〈0.35, 0.45, 0.55, 0.65〉, (�, ω, , �) = (7, 1, 10, 0.002), and x0

i= v0

i= q0

i= 0. (a) Volume displacement xt

i. The color

coding was the same as Fig. 2; (b) velocity vti

of xti; (c) acceleration at

i∈ R of xt

i; (d) sol

deviation sti; (e) accumulated information qt

i; and (f) AccuracyRateSeries as a function

of Play. (For interpretation of the references to color in this figure legend, the readeri

wte

2

irlaHridre

Eo

3.1. Multi-armed Bandit Problem

s referred to the web version of the article.)

here IM is a set of all M-tuples, and IiM−1 is a set of all M − 1-tuples

hat do not contain i. Using Eq. (23), we confirmed that the modellongates M branches exclusively in principle.

.3. Average accuracy rate

As a measure for evaluating the performances of the aboventroduced algorithms, in this study we use the “average accu-acy rate” instead of “regret” which is commonly used for analyzingogarithmic asymptotical behavior, mentioned in Section 1. The log-rithmic regret behavior generally belongs to a long-term behavior.owever, the long-term behavior can continue in constant envi-

onments that are rare in natural worlds. We are more interestedn a short-term decision capability required for surviving in theynamic environments that commonly occur. The average accu-acy rate that we define in this section allows us to focus on thearly-stage performances of the algorithms.

First we define the measure AverageAccuracyRate for evaluatingGR1, SMX1, and TOW1. Let I� = {i� | i� = argmaxi∈I{pi}} be a setf all correct machines and It = {i ∈ I | ut

i= 1 or �(xt

i) = 1} be a

ms 117 (2014) 1– 9

set of all played machines at time t. The AccuracyRatet at time t iscalculated as follows:

Accuracy Ratet =

t∑s−1

Corrects

t∑s−1

Plays

, (24)

Corrects = #(I� ∩ Is), (25)

Plays = #(Is), (26)

where # (I) counts the number of elements in the set I.Recall that we should compare fairly the sequential and con-

current algorithms. For that purpose, if the concurrent algorithmplayed Y (≤N) machines at time t, we consider that AccuracyRatet

was subsequently continued over a Y-play span. More formally,AccuracyRatet in concurrent time is mapped to sequential time asfollows:

Accuracy Rate Series = AppendTt=1{Copy (Accuracy Ratet, Playt)}, (27)

where Copy(x, m) gives m copies of x, AppendTt=1(Xt) generates the

conjunction of the series (X1, X2, . . ., XT), and T = 1000 is the maxi-mum observation time.

After Monte Carlo simulations for each algorithm, we obtain acollection of AccuracyRateSeries. Finally, the measure AverageAccu-racyRate is calculated as a series averaged over 1000 samples ofAccuracyRateSeries.

To evaluate EGR2, SMX2, and TOW2, AverageAccuracyRate canbe extended by replacing the set of all correct machines I� andthe set of all played machines It in Eqs. (25) and (26) withI�2 = {〈i�, j�〉 | 〈i�, j�〉 = argmax〈i,j〉∈I2

{pi + pj}} and It2 = {〈i, j〉| i, j ∈

It, i < j}, respectively.

3. Results

We were interested in the early-stage performances of the algo-rithms because many real-world situations in general do not allowthe algorithms to collect information for long periods of time.Therefore, we focused on the initial rise of AverageAccuracyRate.For each algorithm, we optimized the performance by changing asingle parameter, i.e., � for EGRs, ̌ for SMXs, and � for TOWs, sothat the maximal AverageAccuracyRate can be achieved at 100 and200 Plays.

In this study, we demonstrate the results for four illustrative setsof reward probabilities: P3E = 〈p1, p2, p3〉 = 〈0.2, 0.5, 0.8〉, P3H = 〈0.4,0.5, 0.6〉, P4E = 〈p1, p2, p3, p4〉 = 〈0.2, 0.4, 0.6, 0.8〉, and P4H = 〈0.35,0.45, 0.55, 0.65〉. We chose these probability sets because of thefollowing two reasons. First, they are “symmetric” in a sense thatthe differences of all values from the average value 0.5 are symmet-rically distributed. In our previous study, we confirmed that TOW1exhibited the best performances for symmetric probability sets of3-machine instances by setting the parameter ω = 1. The symmetricprobability sets, therefore, allow us to skip the parameter optimiza-tion for ω. Second, these sets enable to evaluate the dependenceof the performances on the difficulty of the problem instances.Because BP becomes difficult when the differences among the prob-abilities are small, P3E is “easier” than P3H, and P4H is “harder” thanP4E.

Fig. 4 shows the performances of TOW1, EGR1, and SMX1 at 200Plays for the four instances of BP. For each instance, we determined

M. Aono et al. / BioSystems 117 (2014) 1– 9 7

Fig. 4. Average accuracy rate of the original Tug-of-War model (TOW1: solid red line), modified SoftMax algorithm (SMX1: blue dotted line), and modified �-Greedy algorithm(EGR1: green broken line) for the multi-armed Bandit Problem (BP). The horizontal and vertical axes denote the number of Plays and AverageAccuracyRate of 1000 samples,respectively. For each algorithm, a single parameter was optimized, i.e., � for TOW1, ̌ for SMX1, and � for EGR1. The parameter ω of TOW1 was fixed at ω = 1. (a) A relativelyeasier 3-machine BP instance P3E = 〈0.2, 0.5, 0.8〉, where the optimized parameters were (�, ˇ, �) = (3, 0.2, 0.25); (b) A relatively harder 3-machine instance P3H = 〈0.4, 0.5,0 〈0.2,

i tationv

tAoc

TSlSo

t

3

fwr�

t

〈0.05, 0.35, 0.65, 0.95〉.

.6〉, where (�, ˇ, �) = (3, 0.25, 0.05); (c) a relatively easier 4-machine instance P4E =nstance P4H = 〈0.35, 0.45, 0.55, 0.65〉, where (�, ˇ, �) = (3, 0.20, 0.05). (For interpreersion of the article.)

he optimal parameter � of TOW1 by comparing the maximalverageAccuracyRates with the changes �� = 1, where ω = 1. Theptimal parameters for EGR1 and SMX1 were selected with thehanges �� = �ˇ = 0.05.

For all the instances, AverageAccuracyRates of the optimizedOW1 at 100 and 200 Plays were higher than that of the optimizedMX1 and EGR1. That is, a player of BP using TOW1 can obtain aarger total reward sum after 100 and 200 trials, compared withMX1 and EGR1 users. Fig. 4 also shows that AverageAccuracyRatesf all the algorithms degrade as the problems become harder.

The initial rise of AverageAccuracyRates of SMX1 was higher thanhat of EGR1 in most cases.4

.2. Extended Bandit Problem

The performances of TOW2, EGR2, and SMX2 at 200 Plays for theour instances of EBP are shown in Fig. 5. The optimal parameters
ere determined with the changes �� = 0.5, and �� = �ˇ = 0.01,
espectively. For TOW2, we fixed all other parameters to = 10, = 0.002, and ω = 1.

4 The performances of EGR1 and EGR2 are likely to be improved when we changehe �t into �t = Min{1, 1/(� · t)}.

0.4, 0.6, 0.8〉, where (�, ˇ, �) = (1, 0.20, 0.15); and (d) a relatively harder 4-machine of the references to color in this figure legend, the reader is referred to the web

As well as the previous cases, AverageAccuracyRates of the opti-mized TOW2 at 100 and 200 Plays were larger than that of theoptimized SMX2 and EGR2. We could also confirm that the dif-ference in the performance between TOW2 and other algorithmsbecomes greater as the problem becomes difficult.

3.3. Summary

In summary, TOWs were stronger in their early-stage per-formances compared with EGRs and SMXs for the symmetricprobability sets. However, we have to report not only strong pointsbut also weak points of TOWs. TOWs were sometimes overtaken byEGRs and SMXs after long periods of observation time.5 In addition,TOW2 tends to be good at solving harder problem instances butweak for easier ones. We confirmed that sometimes SMX2 defeatedTOW2 by narrow margins for very easy problems, for example,

5 However, EGRs and SMXs require significant changes in their optimal parame-ters to achieve their best performances when the observation time gets longer. Incontrast, TOWs do not need large parameter changes even if the observation timewere extended.

8 M. Aono et al. / BioSystems 117 (2014) 1– 9

Fig. 5. Average accuracy rate of the extended Tug-of-War model (TOW2: solid red line), extended SoftMax algorithm (SMX2: blue dotted line), and extend �-Greedy algorithm(EGR2: green broken line) for the Extended Bandit Problem (EBP). For each algorithm, a single parameter was optimized, i.e., � for TOW2, ̌ for SMX2, and � for EGR2. Otherparameters of TOW2 were fixed at (ω, , �) = (1, 10, 0.002). (a) A relatively easier 2-tuple-3-machine EBP instance P3E = 〈0.2, 0.5, 0.8〉, where the optimized parameters were(�, ˇ, �) = (7, 0.30, 0.16); (b) a relatively harder 2-tuple-3-machine instance P3H = 〈0.4, 0.5, 0.6〉, where (�, ˇ, �) = (7, 0.38, 0.07); (c) a relatively easier 2-tuple-4-machinei ly har0 ader i

4

wTiamTpclrot(

iteo“wde

nstance P4E = 〈0.2, 0.4, 0.6, 0.8〉, where (�, ˇ, �) = (2.5, 0.20, 0.07); and (d) a relative.26, 0.15). (For interpretation of the references to color in this figure legend, the re

. Discussion

There is a study that connects an efficient algorithm for BPith human decision-making capability (Shinohara et al., 2007;

akahashi et al., 2010). The TOW models will be useful for study-ng the origin of the efficient resource-allocating capability of themoeboid organism, because it was formulated on the basis ofechanics to be grounded in physical laws. The learning term of

OW2 (Eq. (22)), which enables the “exploitation,” represents ahysically plausible process of accumulating information on coin-iding events between two branches of the organism. Indeed, thisearning process, which is realized in a manner similar to Hebbianeinforcement learning, has been experimentally confirmed, as therganism was found to be capable of strengthening the connec-ivity between the two parts that are in contact with food sourcesNakagaki et al., 2000; Tero et al., 2006, 2010).

Unlike many nature-inspired metaheuristics, the TOW models able to “explore” without needing a random number generatorhat leaves the question of the origin of the decision to the externalnvironment. The intrinsic dynamics of the TOW model are capablef spontaneously switching between the “exploitation” mode and
exploration” mode. This spontaneous mode-switching behavioras observed experimentally (Takamatsu, 2006) and was repro-uced by the authors’ ordinary differential equation model (Aonot al., 2009b, 2011; Hirata et al., 2010).
der 2-tuple-4-machine instance P4H = 〈0.35, 0.45, 0.55, 0.65〉, where (�, ˇ, �) = (5.5,s referred to the web version of the article.)

The TOW model maintains a constant volume while collectingenvironmental information by concurrently growing and with-drawing its branches. We are interested in the effect of thisconservation law on the computational capabilities of the organ-ism because it yields a nonlocal correlation among the oscillatingbranches in terms of their spatiotemporal dynamics. In our previousstudy, we showed that the conservation law enhances efficiencyand adaptability in solution-searching, as the resource incrementinformation in a branch is instantaneously transmitted from onebranch to the other so that they can immediately decrease theirresources to compensate for the increment (Kim et al., 2010b). Itis an interesting subject to verify the problem solving skill of theorganism, by investigating the correlation between branches andthe leaning processes assumed in the TOW algorithms.

5. Conclusion

In this study, we proposed two concurrent search algorithmsthat extract the physical nature of the efficient resource-allocatingprocess of an amoeboid organism, the true slime mold P. poly-cephalum. The Tug-of-War algorithm (TOW1) and its extended
version (TOW2) were applied to solving the multi-armed BanditProblem (BP) and Extended Bandit Problem (EBP), respectively.Two well-known algorithms for the BP, the modified �-Greedyalgorithm (EGR1) and modified SoftMax algorithm (SMX1), were

ioSyste

aa

psatannTrW

ibvaoap

R

A

A

A

A

A

A

A

A

C

D

G

437–448.

M. Aono et al. / B

lso extended to EGR2 and SMX2 respectively, so that they can bepplied to EBP.

Optimizing a single parameter for each algorithm, we com-ared the performances of TOWs, EGRs, and SMXs in terms of theirhort-term decision-making capabilities represented by averageccuracy rates. Although TOWs have more than one parameter,hey exhibited better performances than the optimized EGRsnd SMXs by adjusting solely a parameter �. Moreover, it wasoteworthy that, even when the parameter � was fixed, TOWs didot degrade significantly their early-stage performances. Indeed,OW1 with � = 3 and TOW2 with � = 7 outperformed other algo-ithms for almost all the problem instances examined in this study.

e will report the parameter robustness of TOWs elsewhere.The proposed algorithms for BP and EBP are good at manag-

ng the exploration–exploitation dilemma, which is a trade-offetween the speed and accuracy of the decision making that areital but incompatible objectives for achieving successful businessnd quick adaptation in unpredictable worlds. Thus, we believe thatur TOW models will be exploited for a broad range of real-worldpplications (Kim et al., 2013) and will be useful for exploring thehysical nature of biological information processing.

eferences

grawal, R., 1995. Sample mean based index policies with O(log n) regret for themulti-armed bandit problem. Adv. Appl. Prob. 27, 1054–1078.

ono, M., Gunji, Y.-P., 2003. Beyond input–output computings: error-driven emer-gence with parallel non-distributed slime mold computer. Biosystems 71,257–287.

ono, M., Hara, M., Aihara, K., 2007. Amoeba-based neurocomputing with chaoticdynamics. Commun. ACM 50 (9), 69–72.

ono, M., Hara, M., 2008. Spontaneous deadlock breaking on amoeba-based neuro-computer. Biosystems 91, 83–93.

ono, M., Hirata, Y., Hara, M., Aihara, K., 2009a. Amoeba-based chaotic neurocompu-ting: combinatorial optimization by coupled biological oscillators. New Gener.Comput. 27, 129–157.

ono, M., Hirata, Y., Hara, Y., Aihara, K., 2009b. Resource-competing oscillatornetwork as a model of amoeba-based neurocompute. In: Calude, C. (Ed.),Unconventional Computation. Lecture Notes in Computer Science, vol. 5715.Springer-Verlag, Berlin Heidelberg, pp. 56–69.

ono, M., Hirata, Y., Hara, M., Aihara, K., 2011. Greedy versus social: resource-competing oscillator network as a model of amoeba-based neurocomputer. Nat.Comput. 10, 1219–1244.

uer, P., Cesa-Bianchi, N., Fischer, P., 2002. Finite-time analysis of the multiarmedbandit problem. Mach. Learn. 47, 235–256.

ohen, J., McClure, S., Yu, A., 2007. Should I stay or should I go? How the human brainmanages the trade-off between exploitation and exploration. Philos. Trans. R.Soc. B 362 (1481), 933–942.

aw, N., O’Doherty, J., Dayan, P., Seymour, B., Dolan, R., 2006. Cortical substrates forexploratory decisions in humans. Nature 441, 876–879.

ittins, J., Jones, D., 1974. A dynamic allocation index for the sequential design ofexperiments. In: Gans, J. (Ed.), Progress in Statistics. North Holland, Amsterdam,pp. 241–266.

ms 117 (2014) 1– 9 9

Gittins, J., 1979. Bandit processes and dynamic allocation indices. J. R. Stat. Soc. B 41,148–177.

Hirata, Y., Aono, M., Hara, M., Aihara, K., 2010. Spontaneous mode switching in cou-pled oscillators competing for constant amounts of resources. Chaos 20, 013117.

Hopfield, J.J., Tank, D.W., 1986. Computing with neural circuits: a model. Science233, 625–633.

Kessler, D., 1982. Plasmodial structure and motility. In: Aldrich, H.C., Daniel, J.W.(Eds.), In: Cell Biology of Physarum and Didymium, vol. 1. Academic Press, Inc.,New York, pp. 145–208.

Kim, S.-J., Aono, M., Hara, M., 2009. Tug-of-war model for two-bandit problem.In: Calude, C. (Ed.), Unconventional Computation. Lecture Notes in ComputerScience, vol. 5715. Springer-Verlag, Berlin Heidelberg, p. 289.

Kim, S.-J., Aono, M., Hara, M., 2010a. Tug-of-war model for multi-armed ban-dit problem. In: Calude, C. (Ed.), Unconventional Computation. LectureNotes in Computer Science, vol. 6079. Springer-Verlag, Berlin Heidelberg,pp. 69–80.

Kim, S.-J., Aono, M., Hara, M., 2010b. Tug-of-war model for two-bandit problem:nonlocally correlated parallel exploration via resource conservation. Biosystems101, 29–36.

Kim, S.-J., Nameda, E., Aono, M., Hara, M., 2011. Adaptive tug-of-war modelfor two-armed bandit problem. In: Proceedings of the International Sym-posium on Nonlinear Theory and Its Applications NOLTA2011, IEICE,pp. 176–179.

Kim, S.-J., Aono, M., 2014. Amoeba-inspired algorithm for cognitive medium access.Nonlinear Theory and Its Applications (NOLTA) E5-N, IEICE (in press).

Kim, S.-J., Naruse, M., Aono, M., Ohtsu, M., Hara, M., 2013. Decisionmaker based on nanoscale photo-excitation transfer. Sci. Rep. 3, 2370,http://dx.doi.org/10.1038/srep02370.

Lai, T., Robbins, H., 1985. Asymptotically efficient adaptive allocation rules. Adv.Appl. Math. 6, 4–22.

Nakagaki, T., Yamada, H., Toth, A., 2000. Maze-solving by an amoeboid organism.Nature 407, 470.

Robbins, H., 1952. Some aspects of the sequential design of experiments. Bull. Am.Math. Soc. 58, 527–536.

Shinohara, S., Taguchi, R., Katsurada, K., Nitta, T., 2007. A model of belief formationbased on causality and application to n-armed bandit problem. Trans. Jpn. Soc.Artif. Intell. 22 (1), 58–68 (in Japanese).

Sutton, R., Barto, A., 1998. Reinforcement Learning: An Introduction. The MIT Press,Cambridge, Massachusetts London, England.

Takahashi, T., Nakano, M., Shinohara, S., 2010. Cognitive symmetries: illogical butrational biases. Symmetry Cult. Sci. 21 (1–3), 275–294.

Takamatsu, A., 2006. Spontaneous switching among multiple spatio-temporal pat-terns in three-oscillator systems constructed with oscillatory cells of true slimemold. Physica D 223, 180–188.

Tero, A., Kobayashi, R., Nakagaki, T., 2006. Physarum solver: a biologically inspiredmethod of road-network navigation. Physica A 363, 115–119.

Tero, A., Takagi, S., Saigusa, T., Ito, K., Bebber, D.P., Fricker, M.D., Yumiki, K., Kobayashi,R., Nakagaki, T., 2010. Rules for biologically inspired adaptive network design.Science 327 (5964), 439–442.

Thompson, W., 1933. On the likelihood that one unknown probability exceedsanother in view of the evidence of two samples. Biometrika 25, 285–294.

Vermorel, J., Mohri, M., 2005. Multi-armed bandit algorithms and empirical evalua-tion. In: Gama, J. (Ed.), 16th European Conference on Machine Learning. LectureNotes in Artificial Intelligence, vol. 3720. Springer-Verlag, Berlin Heidelberg, pp.

Zhu, L., Aono, M., Kim, S.-J., Hara, M., 2013. Amoeba-based computing fortraveling salesman problem: long-term correlations between spatially sep-arated individual cells of Physarum polycephalum. Biosystems 112 (1–10),2013.

http://refhub.elsevier.com/S0303-2647(13)00248-7/sbref0005



































































































































































































































































































































































































dx.doi.org/10.1038/srep02370




























































































































































































































Date post:	24-Jun-2020
Category:	Documents
Upload:	others
View:	0 times
Download:	0 times

Amoeba-inspired Tug-of-War algorithms for exploration...

Documents